Some missing things in the current threading implementation
Sönke Ludwig
ludwig at informatik.uni-luebeck.de
Sun Sep 12 06:35:42 PDT 2010
Recently I thought it would be a good idea to try out the new
concurrency system once again. Some time back, when 'shared' was still
new, I already tried it several times but since it was completely
unusable I gave up on it at that time (and as it seems, many others also
did this).
Now, however, after TDPL has been released and there is some
documentation + std.concurrency, the system should be in a state where
it is actually useful and only some bugs should be there to fix - which
does not include inherent system changes. The reality is quite different
once you step anywhere beside the already walked path (defined by the
book examples and similar things).
Just for the record, I've done a lot with most kinds of threading
schemes (even if the only lockless thing I implemented was a simple
Shared/WeakPtr implementation *shiver*). This may very well have the
effect that there are some patterns burned into my head that somehow
clash with some ideas behind the current system. But, for most of the
points, I am quite sure that there is no viable alternative if
performance and memory consumption should be anywhere new the optimum.
I apologize for the length of this post, although I already tried to
make it as short as possible and left out a lot of details. Also it is
very possible that I assume some false things about the concurrency
implementation because my knowledge is mostly based only on the NG and
the book chapter.
The following problems are those that I found during a one day endeavor
to convert some parts of my code base to spawn/shared (not really
successful, partly because of the very viral nature of shared).
1. spawn and objects
Spawn only supports 'function' + some bound parameters. Since taking
the address of an object method in D always yields a delegate, it is not
possible to call class members without a static wrapper function. This
can be quite disturbing when working object oriented (C++ obviously has
the same problem).
2. error messages
Right now, error messages just state that there is a shared/unshared
mismatch somewhere. For a non-shared-expert, this can be a real bummer.
You have to know a lot of implications 'shared' has to be able to
correctly interpret these messages and track down the cause. Not very
good for a feature that is meant to make threading easier.
3. everything in implicit
This may seem kind of counter-intuitive, but using 'synchronized'
classes and features like setSameMutex - which are deadly necessary, it
is stupid to neglect the importance of lock based threading in an object
oriented environment - creates a feeling of climbing without a safety
rope. Not stating how you really want to synchronize/lock and not being
able to directly read from the code how this is really done just leaves
a black-box feeling. This in turn means threading newcomers will not be
educated, they just use the system somehow and it magically works. But
as soon as you get problems such as deadlocks, you suddenly have to
understand the details and in this moment you have to read up and
remember everything that is going on in the background - plus everything
you would have to know about threading/synchronization in C. I'm not
sure if this is the right course here or if there is any better one.
4. steep learning curve - more a high learning wall to climb on
Resulting from the first points, my feeling tells me that a newcomer,
who has not followed the discussions and thoughts about the system here,
will see himself standing before a very high barrier of material to
learn, before he can actually put anything of it to use. Also I imagine
this to be a very painful process because of all the things that you
discover are not possible or those error messages that potentially make
you banging your head against the wall.
5. advanced synchronization primitives need to be considered
Things such as core.sync.condition (the most important one) need to be
considered in the 'shared'-system. This means there needs to be a
condition variable that takes a shared object instead of a mutex or you
have to be able to query an objects mutex.
6. temporary unlock
There are often situations when you do lock-based programming, in which
you need to temporarily unlock your mutex, perform some time consuming
external task (disk i/o, ...) and then reaquire the mutex. For this
feature, which is really important also because it is really difficult
and dirty to work around it, needs language support, could be something
like the inverse of a synchronized {} scope or the possibility to define
a special kind of private member function that unlocks the mutex. Then,
inside whose blocks the compiler of course has to make sure that the
appropriate access rules are not broken (could be as conservative as
disallowing access to any class member).
7. optimization of pseudo-shared objects
Since the sharability/'synchronized' of an object is already decided at
class definition time, for performance reasons it should be possible to
somehow disable the mutex of those instances that are only used thread
locally. Maybe it should be necessary to declare objects as "shared C
c;" even if the class is defined as "synchronized class C {}" or you
will get an object without a mutex which is not shared?
8. very strong split of shared and non-shared worlds
For container classes in particular it is really nasty that you have to
define two versions of the container, one shared and the other
non-shared if you want to be able to use it in both contexts and be able
to put non-shared objects in it in a non-shared context. Also there
should really be a way to declare a class to be hygienic in a way
similar to pure, so that it would be possible to allow it to be used in
a synchronized context and store shared objects, although it is not
shared itself.
9. unique
Unique objects or chunks of data are really important not only to be
able to check that a cast to 'immutable' is correct, but also to allow
for passing objects to another thread for computations without making a
superfluous copy or doing superfluous computation.
10. recursive locking
The only option right now is to have mutexes behave recursively. This
is good to easily avoid deadlocks in the same thread. However, in my
experience they are very dangerous because typically no one takes into
account what happens when an algorithm is invoked recursively from the
middle of its computation. This can happen easily in a threaded
environment where you often use signals/slots or message passing. A
deadlock or at least an assertion in debug mode is a good indicator in
90% of the situations that there just happened something that should
not. Of course objects with shared mutexes are a different matter - in
this case you actually need to have an ownership relation to do anything
useful with non-recursive mutexes.
11. holes in the system
It seems like there are a lot of ways in which you can still slip in
non-shared data into a shared context.
One example is that you can pass a shared array
---
void fnc(int[] arr);
void fnc2(){
shared int[] arr;
spawn(&fnc, arr);
}
---
compiles. This is just a bug and probably easy to fix but what about:
---
class C {
private void method();
private void method2(){
spawn( void function(C inst){ inst.method(); }, this );
}
}
---
unless private functions to recursive locking (which in turn is usually
useless overhead), method() will be invoked in a completely unprotected
context. Tthis one has to be fixed somehow in the language. I'm sure
there are other things like these.
12. more practical examples need to be considered
It seems right now, that all the examples, that are used to explore the
features needed in the system, are somehow of a very academical nature.
Either the most simple i/o or pure functional comptation, maybe a
network protocol. However, when it comes to practical high performance
computation on real systems, where memory consumption and low-level
performance can really matter, there seems to be quite some no-mans-land
here.
Here some simple examples where I immediately came to a grinding halt:
I. A an object loader with background processing
You have a shared class Loader which uses multiple threads to load
objects on demand and then fires a signal or returns from its
loadObject(x) method.
The problem is that the actual loading of an object must happen
outside of a synchronized region of the loader or you get no parallelism
out of this. Also, you have to use an external function because of
'spawn' instead of being able to directly use a member function.
Fortunately in this case this is also the solution. Defining an external
function, that takes the arguments needed to load the object, loading
it, and then passing it back to the class.
Waiting for finished objects can be implemented using message passing
without worry here because the MP overhead is probably low enough.
Features missing:
- spawn with methods
- temporary unlock
II. Implementation of a ThreadPool
The majority of applications can very well be broken up into small
chunks of work that can be processed in parallel. Instead of using a
costly thread-create, run task, thread-destroy cycle, it would be wise
to reuse the threads for later tasks. The implementation of a thread
pool that does this is of course a low-level thing and you could argue
that it is ok to use some casts and such stuff here. Anyway, there are
quite some things missing here.
Features Missing:
- spawn with methods
- temporary unlock
- condition variables (message passing too slow + you need to manage
destinations)
III. multiple threads computing separate parts of an array
Probably the most simple form of parallelism is to perform similar
operations on each element of an array (or similar things on regions of
the array) and to do this in separate threads.
The good news is that this works in the current implementation. The
bad news is that this is really slow because you have to use atomic
operations on the elements or it is unsafe and prone to low-level races.
Right now the compiler checks almost nothing.
The alternative would be to pass unique
To illustrate the current situation, this compiles and runs:
---
import std.concurrency;
import std.stdio;
void doCompute(size_t offset, int[] arr){ // arr should be shared
foreach(i, ref el; arr){
el *= 2; // should be an atomic operation, which would make this
useless because of the performance penalty
writefln("Thread %s computed element %d: %d", thisTid(), i +
offset, cast(int)el);
}
}
void waitForThread(Tid thread){
// TODO: implement in some complex way using messages or maybe there
is a simple function for this
}
void main(){
shared int[] myarray = [1, 2, 3, 4];
Tid[2] threads;
foreach( i, ref t; threads )
t = spawn(&doCompute, i, myarray[i .. i+3]); // should error out
because the slice is not shared
foreach( t; threads )
waitForThread(t);
}
---
Features missing:
- unique
- some way to safely partition/slice an array and get a set of still
unique slices
- Sönke
More information about the Digitalmars-d
mailing list