Some missing things in the current threading implementation

Sun Sep 12 06:35:42 PDT 2010

Recently I thought it would be a good idea to try out the new 
concurrency system once again. Some time back, when 'shared' was still 
new, I already tried it several times but since it was completely 
unusable I gave up on it at that time (and as it seems, many others also 
did this).

Now, however, after TDPL has been released and there is some 
documentation + std.concurrency, the system should be in a state where 
it is actually useful and only some bugs should be there to fix - which 
does not include inherent system changes. The reality is quite different 
once you step anywhere beside the already walked path (defined by the 
book examples and similar things).

Just for the record, I've done a lot with most kinds of threading 
schemes (even if the only lockless thing I implemented was a simple 
Shared/WeakPtr implementation *shiver*). This may very well have the 
effect that there are some patterns burned into my head that somehow 
clash with some ideas behind the current system. But, for most of the 
points, I am quite sure that there is no viable alternative if 
performance and memory consumption should be anywhere new the optimum.

I apologize for the length of this post, although I already tried to 
make it as short as possible and left out a lot of details. Also it is 
very possible that I assume some false things about the concurrency 
implementation because my knowledge is mostly based only on the NG and 
the book chapter.

The following problems are those that I found during a one day endeavor 
to convert some parts of my code base to spawn/shared (not really 
successful, partly because of the very viral nature of shared).

1. spawn and objects

	Spawn only supports 'function' + some bound parameters. Since taking 
the address of an object method in D always yields a delegate, it is not 
possible to call class members without a static wrapper function. This 
can be quite disturbing when working object oriented (C++ obviously has 
the same problem).

2. error messages

	Right now, error messages just state that there is a shared/unshared 
mismatch somewhere. For a non-shared-expert, this can be a real bummer. 
You have to know a lot of implications 'shared' has to be able to 
correctly interpret these messages and track down the cause. Not very 
good for a feature that is meant to make threading easier.

3. everything in implicit

	This may seem kind of counter-intuitive, but using 'synchronized' 
classes and features like setSameMutex - which are deadly necessary, it 
is stupid to neglect the importance of lock based threading in an object 
oriented environment - creates a feeling of climbing without a safety 
rope. Not stating how you really want to synchronize/lock and not being 
able to directly read  from the code how this is really done just leaves 
a black-box feeling. This in turn means threading newcomers will not be 
educated, they just use the system somehow and it magically works. But 
as soon as you get problems such as deadlocks, you suddenly have to 
understand the details and in this moment you have to read up and 
remember everything that is going on in the background - plus everything 
you would have to know about threading/synchronization in C. I'm not 
sure if this is the right course here or if there is any better one.

4. steep learning curve - more a high learning wall to climb on

	Resulting from the first points, my feeling tells me that a newcomer, 
who has not followed the discussions and thoughts about the system here, 
will see himself standing before a very high barrier of material to 
learn, before he can actually put anything of it to use. Also I imagine 
this to be a very painful process because of all the things that you 
discover are not possible or those error messages that potentially make 
you banging your head against the wall.

5. advanced synchronization primitives need to be considered

	Things such as core.sync.condition (the most important one) need to be 
considered in the 'shared'-system. This means there needs to be a 
condition variable that takes a shared object instead of a mutex or you 
have to be able to query an objects mutex.

6. temporary unlock

	There are often situations when you do lock-based programming, in which 
you need to temporarily unlock your mutex, perform some time consuming 
external task (disk i/o, ...) and then reaquire the mutex. For this 
feature, which is really important also because it is really difficult 
and dirty to work around it, needs language support, could be something 
like the inverse of a synchronized {} scope or the possibility to define 
a special kind of private member function that unlocks the mutex. Then, 
inside whose blocks the compiler of course has to make sure that the 
appropriate access rules are not broken (could be as conservative as 
disallowing access to any class member).

7. optimization of pseudo-shared objects

	Since the sharability/'synchronized' of an object is already decided at 
class definition time, for performance reasons it should be possible to 
somehow disable the mutex of those instances that are only used thread 
locally. Maybe it should be necessary to declare objects as "shared C 
c;" even if the class is defined as "synchronized class C {}" or you 
will get an object without a mutex which is not shared?

8. very strong split of shared and non-shared worlds

	For container classes in particular it is really nasty that you have to 
define two versions of the container, one shared and the other 
non-shared if you want to be able to use it in both contexts and be able 
to put non-shared objects in it in a non-shared context. Also there 
should really be a way to declare a class to be hygienic in a way 
similar to pure, so that it would be possible to allow it to be used in 
a synchronized context and store shared objects, although it is not 
shared itself.

9. unique

	Unique objects or chunks of data are really important not only to be 
able to check that a cast to 'immutable' is correct, but also to allow 
for passing objects to another thread for computations without making a 
superfluous copy or doing superfluous computation.

10. recursive locking

	The only option right now is to have mutexes behave recursively. This 
is good to easily avoid deadlocks in the same thread. However, in my 
experience they are very dangerous because typically no one takes into 
account what happens when an algorithm is invoked recursively from the 
middle of its computation. This can happen easily in a threaded 
environment where you often use signals/slots or message passing. A 
deadlock or at least an assertion in debug mode is a good indicator in 
90% of the situations that there just happened something that should 
not. Of course objects with shared mutexes are a different matter - in 
this case you actually need to have an ownership relation to do anything 
useful with non-recursive mutexes.

11. holes in the system

	It seems like there are a lot of ways in which you can still slip in 
non-shared data into a shared context.

	One example is that you can pass a shared array
	---
		void fnc(int[] arr);
		void fnc2(){
			shared int[] arr;
			spawn(&fnc, arr);
		}
	---

	compiles. This is just a bug and probably easy to fix but what about:

	---
		class C {
			private void method();
			private void method2(){
				spawn( void function(C inst){ inst.method(); }, this );
			}
		}
	---

	unless private functions to recursive locking (which in turn is usually 
useless overhead), method() will be invoked in a completely unprotected 
context. Tthis one has to be fixed somehow in the language. I'm sure 
there are other things like these.

12. more practical examples need to be considered

	It seems right now, that all the examples, that are used to explore the 
features needed in the system, are somehow of a very academical nature. 
Either the most simple i/o or pure functional comptation, maybe a 
network protocol. However, when it comes to practical high performance 
computation on real systems, where memory consumption and low-level 
performance can really matter, there seems to be quite some no-mans-land 
here.

	Here some simple examples where I immediately came to a grinding halt:

	I. A an object loader with background processing

		You have a shared class Loader which uses multiple threads to load 
objects on demand and then fires a signal or returns from its 
loadObject(x) method.

		The problem is that the actual loading of an object must happen 
outside of a synchronized region of the loader or you get no parallelism 
out of this. Also, you have to use an external function because of 
'spawn' instead of being able to directly use a member function. 
Fortunately in this case this is also the solution. Defining an external 
function, that takes the arguments needed to load the object, loading 
it, and then passing it back to the class.
		Waiting for finished objects can be implemented using message passing 
without worry here because the MP overhead is probably low enough.

		Features missing:
			- spawn with methods
			- temporary unlock

	II. Implementation of a ThreadPool

		The majority of applications can very well be broken up into small 
chunks of work that can be processed in parallel. Instead of using a 
costly thread-create, run task, thread-destroy cycle, it would be wise 
to reuse the threads for later tasks. The implementation of a thread 
pool that does this is of course a low-level thing and you could argue 
that it is ok to use some casts and such stuff here. Anyway, there are 
quite some things missing here.

		Features Missing:
			- spawn with methods
			- temporary unlock
			- condition variables (message passing too slow + you need to manage 
destinations)

	III. multiple threads computing separate parts of an array

		Probably the most simple form of parallelism is to perform similar 
operations on each element of an array (or similar things on regions of 
the array) and to do this in separate threads.
		The good news is that this works in the current implementation. The 
bad news is that this is really slow because you have to use atomic 
operations on the elements or it is unsafe and prone to low-level races. 
Right now the compiler checks almost nothing.
		The alternative would be to pass unique

		To illustrate the current situation, this compiles and runs:

		---
			import std.concurrency;
			import std.stdio;

			void doCompute(size_t offset, int[] arr){ // arr should be shared
				foreach(i, ref el; arr){
					el *= 2; // should be an atomic operation, which would make this 
useless because of the performance penalty
					writefln("Thread %s computed element %d: %d", thisTid(), i + 
offset, cast(int)el);
				}
			}

			void waitForThread(Tid thread){
				// TODO: implement in some complex way using messages or maybe there 
is a simple function for this
			}

			void main(){
				shared int[] myarray = [1, 2, 3, 4];
				Tid[2] threads;
				foreach( i, ref t; threads )
					t = spawn(&doCompute, i, myarray[i .. i+3]); // should error out 
because the slice is not shared
				foreach( t; threads )
					waitForThread(t);
			}
		---

		Features missing:
			- unique
			- some way to safely partition/slice an array and get a set of still 
unique slices

- Sönke