Curl wrapper

Wed May 18 15:29:16 PDT 2011

Den 18-05-2011 16:53, Andrei Alexandrescu skrev:
> On 5/18/11 6:07 AM, Jonas Drewsen wrote:
>> Select will wait for data to be ready and ask curl to handle the data
>> chunk. Curl in turn calls back to a registered callback handler with the
>> data read. That handler fills the buffer provided by the user. If not
>> enough data has been receive an new select is performed until the
>> requested amount of data is read. Then the blocking method can return.
>
> Perhaps this would be too complicated. In any case the core
> functionality must be paid top attention. And the core functionality is
> streaming.
>
> Currently there are two proposed ways to stream data from an HTTP
> address: (a) by using the onReceive callback, and (b) by using
> byLine/byChunk. If either of these perform slower than the
> best-of-the-breed streaming using libcurl, we have failed.
>
> The onReceive method is not particularly appealing because the client
> and libcurl block each other: the client is blocked while libcurl is
> waiting for data, and the client blocks libcurl while inside the
> callback. (Please correct me if I'm wrong.)
>
> To make byLine/byChunk fast, the basic setup should include a hidden
> thread that does the download in separation from the client's thread.
> There should be K buffers allocated (K = 2 to e.g. 10), and a simple
> protocol for passing the buffers back and forth between the client
> thread and the hidden thread. That way, in the quiescent state, there is
> no memory allocation and either both client and libcurl are busy doing
> work, or one is much slower than the other, which waits.
>
> The same mechanism should be used in byChunkAsync or byFileAsync.

If byChunk is using a hidden thread to download into buffers, then how 
does it differ from the byChunkAsync that you mention?

The current curl wrapper actually does the hidden thread trick (based on 
a hint you gave me a while ago). It does not reuse buffers because I 
thought that all data had to be immutable or by value to go through the 
message passing system. I'll fix this since it is a good place to do 
some type casting to allow passing the buffers for reuse.

I think that we have to consider the context of the streaming before we 
can tell the best solution. I do not have any number to back the 
following up, but this is how I see it:

If data that is read is going to be processed (e.g. compressed) in some 
way it is most likely a benefit to spawn a thread to handle the data 
buffering.

If no processing is done (e.g. a simple copy from net to disk) I believe 
keeping things in the same thread and simply select on sockets (disk or 
net) is fastest. This way no message passing and context switching is 
taking place and does cause any overhead. libcurl can give you access to 
the file descriptors for this exact purpose but it does have some 
drawbacks: you are not in control of the buffers used by libcurl. This 
means that reading from one curl connection and sending on another you 
would have to copy the data. libcurl does in fact provide even simpler 
methods where you can provide your own buffers for read/writes. 
Unfortunately this is only supported for HTTP and a lot of the 
convenience features such as redirections are lost. The more you want to 
control to get the last drop of performance, the more you have to 
manually handle yourself.

In my opinion I think that providing the performance of the standard 
libcurl API in the D wrapper is the way to go (as done in the current 
curl wrapper). Generic and efficient streaming across protocols is best 
done in std.net where buffers can be handled entirely in D. I know this 
is not a small task which is why I started out with wrapping libcurl.

Thanks
Jonas