Curl wrapper

Thu May 19 08:08:11 PDT 2011

Den 19-05-2011 00:54, Andrei Alexandrescu skrev:
> On 5/18/11 5:29 PM, jdrewsen wrote:
>> Den 18-05-2011 16:53, Andrei Alexandrescu skrev:
>>> On 5/18/11 6:07 AM, Jonas Drewsen wrote:
>>>> Select will wait for data to be ready and ask curl to handle the data
>>>> chunk. Curl in turn calls back to a registered callback handler with
>>>> the
>>>> data read. That handler fills the buffer provided by the user. If not
>>>> enough data has been receive an new select is performed until the
>>>> requested amount of data is read. Then the blocking method can return.
>>>
>>> Perhaps this would be too complicated. In any case the core
>>> functionality must be paid top attention. And the core functionality is
>>> streaming.
>>>
>>> Currently there are two proposed ways to stream data from an HTTP
>>> address: (a) by using the onReceive callback, and (b) by using
>>> byLine/byChunk. If either of these perform slower than the
>>> best-of-the-breed streaming using libcurl, we have failed.
>>>
>>> The onReceive method is not particularly appealing because the client
>>> and libcurl block each other: the client is blocked while libcurl is
>>> waiting for data, and the client blocks libcurl while inside the
>>> callback. (Please correct me if I'm wrong.)
>>>
>>> To make byLine/byChunk fast, the basic setup should include a hidden
>>> thread that does the download in separation from the client's thread.
>>> There should be K buffers allocated (K = 2 to e.g. 10), and a simple
>>> protocol for passing the buffers back and forth between the client
>>> thread and the hidden thread. That way, in the quiescent state, there is
>>> no memory allocation and either both client and libcurl are busy doing
>>> work, or one is much slower than the other, which waits.
>>>
>>> The same mechanism should be used in byChunkAsync or byFileAsync.
>>
>> If byChunk is using a hidden thread to download into buffers, then how
>> does it differ from the byChunkAsync that you mention?
>
> Sorry, byChunkAsync and byLineAsync (which I wrongly denoted as
> byFileAsync) would be methods File.
>
>> The current curl wrapper actually does the hidden thread trick (based on
>> a hint you gave me a while ago). It does not reuse buffers because I
>> thought that all data had to be immutable or by value to go through the
>> message passing system. I'll fix this since it is a good place to do
>> some type casting to allow passing the buffers for reuse.
>
> Great, thanks. Don't forget there's great motivation to do so.
>
>> I think that we have to consider the context of the streaming before we
>> can tell the best solution. I do not have any number to back the
>> following up, but this is how I see it:
>>
>> If data that is read is going to be processed (e.g. compressed) in some
>> way it is most likely a benefit to spawn a thread to handle the data
>> buffering.
>>
>> If no processing is done (e.g. a simple copy from net to disk) I believe
>> keeping things in the same thread and simply select on sockets (disk or
>> net) is fastest.
>
> Not at all. If operating with the network and operating with the disk
> block each other, you're guaranteed to be slower than the slowest of them.
>
> Consider that disk speed is V1 MB/s and network speed is V2 MB/s, and
> that they're independent of each other. If you do one thing at a time,
> you need to take 1/V1 + 1/V2 seconds to transfer one MB. The speed of
> the process is therefore 1/(1/V1 + 1/V2) = V1 * V2 / (V1 + V2).
 >
> If the two devices have comparable speeds, you're halving the speed. As
> soon as you do buffering with two threads you can easily reach close to
> the minimum of the two speeds, which is the theoretical best.

It see your point. By buffering data asynchronously the reads and writes 
don't block each other and this increases performance. The thing is that 
the OS already does buffering for us. So while we're writing data to 
disk the OS is buffering incoming data from the network asynchronously.

>> In my opinion I think that providing the performance of the standard
>> libcurl API in the D wrapper is the way to go (as done in the current
>> curl wrapper). Generic and efficient streaming across protocols is best
>> done in std.net where buffers can be handled entirely in D. I know this
>> is not a small task which is why I started out with wrapping libcurl.
>
> Sounds reasonable, although if you take care of recycling the buffers in
> the implementation you have, your wrapper may as well be the best of breed.
>
>
> Andrei