Streaming library

Wed Oct 13 12:02:48 PDT 2010

On Wed, 13 Oct 2010 20:55:04 +0400, Andrei Alexandrescu  
<SeeWebsiteForEmail at erdani.org> wrote:

> On 10/13/10 11:16 CDT, Denis Koroskin wrote:
>> On Wed, 13 Oct 2010 18:32:15 +0400, Andrei Alexandrescu
>>> So far so good. I will point out, however, that the classic read/write
>>> routines are not all that good. For example if you want to implement a
>>> line-buffered stream on top of a block-buffered stream you'll be
>>> forced to write inefficient code.
>>>
>>
>> Never heard of filesystems that allow reading files in lines - they
>> always read in blocks, and that's what streams should do.
>
> http://www.gnu.org/s/libc/manual/html_node/Buffering-Concepts.html
>
> I don't think streams must mimic the low-level OS I/O interface.
>

I in contrast think that Streams should be a lowest-level possible  
platform-independent abstraction.
No buffering besides what an OS provides, no additional functionality. If  
you need to be able to read something up to some character (besides, what  
should be considered a new-line separator: \r, \n, \r\n?), this should be  
done manually in "byLine".

>> That's because
>> most of the steams are binary streams, and there is no such thing as a
>> "line" in them (e.g. how often do you need to read a line from a
>> SocketStream?).
>
> http://www.opengroup.org/onlinepubs/009695399/functions/isatty.html
>

These are special cases I don't like. There is no such thing in Windows  
anyway.

> You need a line when e.g. you parse a HTML header or a email header or  
> an FTP response. Again, if at a low level the transfer occurs in blocks,  
> that doesn't mean the API must do the same at all levels.
>

BSD sockets transmits in blocks. If you need to find a special sequence in  
a socket stream, you are forced to fetch a chunk, and manually search for  
a needed sequence. My position is that you should do it with an external  
predicate (e.g. read until whitespace).

>> I don't think streams should buffer anything either (what an underlying
>> OS I/O API caches should suffice), buffered streams adapters can do that
>> in a stream-independent way (why duplicate code when you can do that as
>> efficiently with external methods?).
>
> Most OS primitives don't give access to their own internal buffers.  
> Instead, they ask user code to provide a buffer and transfer data into  
> it.

Right. This is why Stream may not cache.

> So clearly buffering on the client side is a must.
>

I don't see how is it implied from above.

>> Besides, as you noted, the buffering is redundant for byChunk/byLine
>> adapter ranges. It means that byChunk/byLine should operate on
>> unbuffered streams.
>
> Chunks keep their own buffer so indeed they could operate on streams  
> that don't do additional buffering. The story with lines is a fair  
> amount more complicated if it needs to be done efficiently.
>

Yes. But line-reading is a case that I don't see a need to be handled  
specially.

>> I'll explain my I/O streams implementation below in case you didn't read
>> my message (I've changed some stuff a little since then).
>
> Honest, I opened it to remember to read it but somehow your fonts are  
> small and make my eyes hurt.
>
>> My Stream
>> interface is very simple:
>>
>> // A generic stream
>> interface Stream
>> {
>> @property InputStream input();
>> @property OutputStream output();
>> @property SeekableStream seekable();
>> @property bool endOfStream();
>> void close();
>> }
>>
>> You may ask, why separate Input and Output streams?
>
> I think my first question is: why doesn't Stream inherit InputStream and  
> OutputStream? My hypothesis: you want to sometimes return null. Nice.
>

Right.

>> Well, that's because
>> you either read from them, write from them, or both.
>> Some streams are read-only (think Stdin), some write-only (Stdout), some
>> support both, like FileStream. Right?
>
> Sounds good. But then where's flush()? Must be in OutputStream.
>

That's probably because unbuffered streams don't need them.

>> Not exactly. Does FileStream support writing when you open file for
>> reading? Does it support reading when you open for writing?
>> So, you may or may not read from a generic stream, and you also may or
>> may not write to a generic stream. With a design like that you can make
>> a mistake: if a stream isn't readable, you have no reference to invoke
>> read() method on.
>
> That is indeed pretty nifty. I hope you would allow us to copy that  
> feature in Phobos (unless you are considering submitting your library  
> wholesale). Let me know.
>

Would love to contribute with design and implementation.

>> Similarly, a stream is either seekable, or not. SeekableStreams allow
>> stream cursor manipulation:
>>
>> interface SeekableStream : Stream
>> {
>> long getPosition(Anchor whence = Anchor.begin);
>> void setPosition(long position, Anchor whence = Anchor.begin);
>> }
>
> Makes sense. Why is getPosition signed? Why do you need an anchor for  
> getPosition?
>

long is chosen to be consistent with setPosition. Also getPosition may  
return a negative value:

long pos = getPosition(Anchor.end); // how far is it till file end?

Also this is how you can get file size (need to invert though). This is  
consistent with setPosition:

setPosition(getPosition(anchor), anchor); // a no-op for any kind of achor

I just thought why not? I'm okay with dropping it, but I find it nice.

>> InputStream doesn't really has many methods:
>>
>> interface InputStream
>> {
>> // reads up to buffer.length bytes from a stream
>> // returns number of bytes read
>> // throws on error
>> size_t read(ubyte[] buffer);
>
> That makes implementation of line buffering inefficient :o).
>

There is no way you can do it more efficient on Windows. Fetch a chunk;  
search for a line end; found ? return : continue.

>> // reads from current position
>> AsyncReadRequest readAsync(ubyte[] buffer, Mailbox* mailbox = null);
>> }
>
> Why doesn't Sean's concurrency API scale for your needs? Can that be  
> fixed? Would you consider submitting some informed bug reports?
>

It's rather a design issue than a bug on its own. I'll write a separate  
letter on that.

>> So is OutputStream:
>>
>> interface OutputStream
>> {
>> // returns number of bytes written
>> // throws on error
>> size_t write(const(ubyte)[] buffer);
>>
>> // writes from current position
>> AsyncWriteRequest writeAsync(const(ubyte)[] buffer, Mailbox* mailbox =
>> null);
>> }
>>
>> They basically support only reading and writing in blocks, nothing else.
>
> I'm surprised there's no flush().
>

No buffering - no flush.

>> However, they support asynchronous reads/writes, too (think of mailbox
>> as a std.concurrency's Tid).
>>
>> Unlike Daniel's proposal, my design reads up to buffer size bytes for
>> two reasons:
>> - it avoids potential buffering and multiple sys calls
>
> But there's a problem. It's very rare that the user knows what a good  
> buffer size is. And often there are size and alignment restrictions at  
> the low level.

I agree, but he can guess. Or a library can give him a hint. E.g.  
BUFFER_SIZE is a good buffer size to start with :)

> So somewhere there is still buffering going on, and also there are  
> potential inefficiencies (if a user reads small buffers).
>
>> - it is the only way to go with SocketStreams. I mean, you often don't
>> know how many bytes an incoming socket message contains. You either have
>> to read it byte-by-byte, or your application might stall for potentially
>> infinite time (if message was shorter than your buffer, and no more
>> messages are being sent)
>
> But if you don't know how many bytes are in an incoming socket message,  
> a better design is to do this:
>
> void read(ref ubyte[] buffer);
>

That could work, too.

> and resize the buffer to accommodate the incoming packet. Your design  
> _imposes_ that the socket does additional buffering.
>

The socket API does it anyway. I just don't complicate it even further but  
providing an additional layer of buffering.

>> Why do my streams provide async methods? Because it's the modern
>> approach to I/O - blocking I/O (aka one thread per client) doesn't
>> scale. E.g. Java adds a second revision of Async I/O API in JDK7 (called
>> NIO2, first appeared in February, 2002), C# has asynchronous operations
>> as part of their Stream interface since .NET 1.1 (April, 2003).
>
> Async I/O is nice, no two ways about that. I have on my list to define  
> byChunkAsync that works exactly like byChunk from the client's  
> perspective, except it does I/O concurrently with client code.
>
> [snip]
>> I strongly believe we shouldn't ignore this type of API.
>>
>> P.S. For threads this deep it's better fork a new one, especially when
>> changing the subject.
>
> I thought I did by changing the title...
>
>
> Andrei

No, changing title isn't enough.