A very interesting slide deck comparing sync and async IO

Sat Mar 5 00:55:14 PST 2016

Am 03.03.2016 um 18:31 schrieb Andrei Alexandrescu:
> https://www.mailinator.com/tymaPaulMultithreaded.pdf
>
> Andrei

A few points that come to mind:

- Comparing random different high-level libraries is bound to give 
results that measure abstraction overhead/non-optimal system API use. 
Comparing on a JVM instead of bare-metal might skew the results further 
(e.g. some JIT optimizations not kicking in due to the use of callbacks, 
or something like that). It would be interesting to redo the benchmark 
in C/D using plain system APIs.

- Comparing single-thread NBIO to multi-threaded BIO is obviously wrong 
when measuring peak-performance. NBIO should use a pool of on thread per 
core, each running an event/select loop, or alternatively using one 
process per core. The "Make better use of multi-cores" pro-BIO argument 
is pointless for that same reason.

- Missing any hints about how the benchmark was performed (e.g. 
send()/recv() chunk size). For anything other than tiny packets, NBIO 
for sure is not measurably slower than BIO. Latency may be a bit worse, 
but that reverses once many connections come into play.

- The "simpler to write" argument also breaks down when adding fibers to 
the mix.

- Main argument for NBIO is that threads are relatively heavy system 
resources, context switches are rather expensive, and are limited in 
their number (irrespective of the amount of RAM). Depending on the 
kernel, the scheduler overhead may also grow with the number of threads. 
For small numbers of connections, IO for sure is perfectly fine, as long 
as synchronization overhead isn't an issue.

- AIO/NBIO+fibers also allows to further reduce memory footprint by 
detaching the connection from the fiber between requests (e.g. for a 
keep-alive HTTP connection). This isn't possible with blocking IO.

- The optimal approach always depends on the system being modelled, 
NBIO+fibers simply gives the maximum flexibility in that regard. You can 
let fibers run in isolation on different threads, use synchronization 
between then, or you can have concurrency without CPU-level 
synchronization overhead within the same thread. Especially the latter 
can become really interesting with thread-local memory allocators etc. 
It also becomes really interesting in situations where 
thread-synchronization gets difficult (lock-less structures).