Streaming transport interfaces: input

Thu Oct 14 11:22:07 PDT 2010

On 10/14/10 12:56 CDT, Denis Koroskin wrote:
> appendDelim *requires* buffering for to be implemented. No OS provides
> an API to read from a file (be it pipe, socket, whatever) to read up to
> some abstract delimiter. It *always* reads in blocks.

Clear. What may be not so clear is that read(ubyte[] buf) ALSO requires 
buffering. Disk I/O comes in fixed buffer sizes (sometimes aligned at 
512 bytes or whatever), so ANY protocol that allows the user to set the 
maximum bytes to read will require buffering and copying. So how is 
appendDelim worse than read?

> As such, if you
> need to read until a delimeter, you need to fetch block to some internal
> buffer, MANUALLY search through it and THEN copy to output string.

And there's no way for the client to efficiently do that.

> I've
> implemented that on top of chunked read interface, and it was 5% faster
> than getline()/getdelim() that GNU libc provides (despite you claming it
> to be "many times faster"). It's not.

Please post your code.

> Buffering requires and additional level of data copying, and this is bad
> for fast I/O.

Agreed. But then you define routines that also requires buffering. How 
do you reconcile your own requirement with your own interface?

> If you need fast I/O or must pull that out of the stream
> interface. Otherwise chunked read will be less efficient due to
> additional copies to and from buffers.
>
> On the contrary line-based reading can be implemented on top of the
> chunked read without sacrificing a tiny bit of efficiency.

Except for extra copying.

appendDelim implementation:

1. Low-level read in internal buffers

2. Search for delimiter (assume found for simplicity)

3. Resize user buffer

4. Copy

That's one copy, with the necessary corner cases when the delimiter 
isn't found yet etc. (which increase copying ONLY if the buffer is 
actually moved when reallocated).

The implementation in your message on 10/13/2010 21:20 CDT:

1. Low-level read in internal buffers

2. Copy from internal buffers into the internal buffer provided by your 
ByLine implementation

3. Copy from the internal buffer of ByLine into the user-supplied buffer

That's two copies. Agreed?

Andrei