Asynchronicity and more

Sat Apr 2 17:24:50 PDT 2011

There are several difficult issues connected with asynchronicity, high  
performace networking and connected things.
I had to deal with them developing blip ( http://fawzi.github.com/ 
blip ).
My goal with it was to have a good basis for my program dchem, and as  
consequence is not so optimized in particular for non recursive tasks,  
and it is D1, but I think that the issues are generally relevant.

i/o and asynchronicity is a very important aspect and one that will  
tend to "pollute" many parts of the library, and introduce  
dependencies that are difficult to remove thus those choices have to  
be done carefully.

Overview:
========

Threads vs fibers:
-----------------------

* an issue not yet brought up is that thread wire some memory, and so  
have an extra cost that fibers don't.
* evaluation strategy of fibers can be chosen by the user, this is  
relevant for recursive tasks where each task
   spawns other tasks, different strategies (breadth first evaluation  
like threads uses a *lot* more resources
   than depth first, by having many more tasks concurrently in  
evaluation)

Otherwise the relevant points already  brought forth by others are:

- context switch of fibers (assuming that memory is active) is much  
faster
- context switch are chosen by the user in fibers (cooperative  
multitasking), this allows
   one to choose the optima point to switch, but a "bad" fibers can  
ruin the response time the others.
- d is not stackless (like Go for example), so each fiber needs to  
have enough space for the stack
   (something that often is not so easy to predict). This makes fiber  
still a bit costly if one really needs a lot of them.
   64 bit can help here, because hopefully the active part is small,  
and it can be kept in RAM, even using a rather
   large virtual space. Still as correctly said by Brad for heavily  
uniform handling of many tasks manual
   management (and using stateless functions as much as possible) can  
be much more efficient.

Closures
------------
When possible and for the low level (often used) operations delegates  
and functions calls are a better solution than , structs and manual  
memory handling for "closures" are a good choice for low level  
operations, because one can avoid the heap allocation connected with  
the automatic closure.
This approach cannot be avoided in D1, whereas D2 has the very useful  
closures, but at low level their cost should be avoided when possible.
About using structs there are subtle issues that I think are connected  
with optimization of the compiler (I never really investigated them, I  
always changed the code, or resorted to heap allocation.
The main issue is that one would like to optimize as much as possible,  
and to do it it normally assumes that the current thread is the only  
user of the stack. If you pass stack stored structures to other  
threads these assumptions aren't true anymore, so the memory of a  
stack allocated struct might be reused even before the function  
returns (unless I am mistaken and the ABI forbids it, in this case  
tell me).

Async i/o
----------

* almost always i/o is much slower than CPU, so an i/o operation is  
bound to make the cpu wait, so one wants to use the wait efficiently.
   - A very simple way is to just use blocking i/o, and just have  
other threads do other threads.
   - async i/o allows overlap of several operations in a single thread.
   - for files an even more efficient way to communicate sharing of  
the buffer with the kernel (aio_*)
   - an important issue is avoiding waste of cpu cycles while waiting,  
to achieve this one can collect several waiting operations and use a  
single thread to wait on several of them, select, poll and epoll allow  
this, and increase the efficiency of several kinds of programs
   - libev and libevent are cross platform libraries that can help  
having an event based approach, taking care to check a large number of  
events and call a user defined callback when they happen in a robust  
cross platform way

locks, semaphores
------------
to synchronize between threads locks and semaphores are a standard way  
to synchronize.
One has to be careful to mix them with fiber scheduling with locks, as  
one can easily deadlock.

Hardware informationy
-----------------------------
Efficient usage of computational resource depends also on being able  
to identify the available hardware.
Don did quite some hacking to get useful information out of cpuinfo,  
but if one is interested in more complex computers more info would be  
nice.
I use hwloc for this purpose, it is cross plattform, can be embedded.

Possible solutions
==============

Having async i/o can be presented as normal synchronous (blocking) i/ 
o, but this makes sense only if one has several objects waiting, or  
uses fibers, and executes other fiber while waiting.
How acceptable it is to rely (and thus introduce a dependency on)  
things like libev or hwloc?
For my purposes using them was ok, and they are cross platform and  
embeddable, but is it true also for phobos?

Asynchronicity means being able to have work to be executed  
concurrently and then resynchronize at a later point.
One can use processes (that also give memory protection), threads, or  
fibers to achieve this.
If one uses just threads, then asynchronous i/o makes sense only with  
a fully manual (explicit) handling of it, hiding it away will be  
equivalent to blocking i/o.
Fibers allow one to hide async io and make it look as blocking, but as  
Sean told there are issues with using fibers with D2 TLS.
I kind of dislike the use of TSL for non low level infrastructure  
stuff, but that is just me around here it seems.

In blip I choose to go with fiber based switching.
I wrapped libev both at low level and at a higher level, in such a way  
than one can use them directly (for maximum performance)
For the sockets I use non blocking calls, and a single "waiting" (io)  
thread, but hide them so that they are used just like blocking calls.

An important design decision if using fibers is if one should be able  
to have a "naked" thread, or hide the fiber scheduling in each thread.
In blip I went for yes, because it is entirely realized as a normal  
library, but that gives some ugly corner cases when one uses a method  
that wants to suspend a thread that doesn't  have scheduling place.
Building the scheduling into all threads is probably cleaner if one  
goes with fibers.
The problem of TSL and fibers remains though, especially if one allows  
the migration of fibers from one thread to the other (as I do in blip).

An important design choice in blip was being able to cope with  
recursive parallelism (typical of computation tasks), not just with  
the (server like) concurrent parallelism that is typical of servers.
I feel that it is important, but is something that might not be seen  
as such by others.

To do
====
Now about async io the first step is for sure to expose an  
asynchronous API. This doesn't influence or depends on other parts of  
the library much.
An important decision if/which external libraries one can rely on.

Making the async API nicer to use, or even use it "behind the scenes"  
as I do in blip needs more complex choices on the basic handling of  
suspension and synchronization.
Something like that is bound to be used in several parts of phobos so  
a careful choice is needed.

This parts are also partially connected with high performance  
networking (another GSoC project).

Fawzi

Fawzi