Streaming library

Wed Oct 13 09:55:04 PDT 2010

On 10/13/10 11:16 CDT, Denis Koroskin wrote:
> On Wed, 13 Oct 2010 18:32:15 +0400, Andrei Alexandrescu
>> So far so good. I will point out, however, that the classic read/write
>> routines are not all that good. For example if you want to implement a
>> line-buffered stream on top of a block-buffered stream you'll be
>> forced to write inefficient code.
>>
>
> Never heard of filesystems that allow reading files in lines - they
> always read in blocks, and that's what streams should do.

http://www.gnu.org/s/libc/manual/html_node/Buffering-Concepts.html

I don't think streams must mimic the low-level OS I/O interface.

> That's because
> most of the steams are binary streams, and there is no such thing as a
> "line" in them (e.g. how often do you need to read a line from a
> SocketStream?).

http://www.opengroup.org/onlinepubs/009695399/functions/isatty.html

You need a line when e.g. you parse a HTML header or a email header or 
an FTP response. Again, if at a low level the transfer occurs in blocks, 
that doesn't mean the API must do the same at all levels.

> I don't think streams should buffer anything either (what an underlying
> OS I/O API caches should suffice), buffered streams adapters can do that
> in a stream-independent way (why duplicate code when you can do that as
> efficiently with external methods?).

Most OS primitives don't give access to their own internal buffers. 
Instead, they ask user code to provide a buffer and transfer data into 
it. So clearly buffering on the client side is a must.

> Besides, as you noted, the buffering is redundant for byChunk/byLine
> adapter ranges. It means that byChunk/byLine should operate on
> unbuffered streams.

Chunks keep their own buffer so indeed they could operate on streams 
that don't do additional buffering. The story with lines is a fair 
amount more complicated if it needs to be done efficiently.

> I'll explain my I/O streams implementation below in case you didn't read
> my message (I've changed some stuff a little since then).

Honest, I opened it to remember to read it but somehow your fonts are 
small and make my eyes hurt.

> My Stream
> interface is very simple:
>
> // A generic stream
> interface Stream
> {
> @property InputStream input();
> @property OutputStream output();
> @property SeekableStream seekable();
> @property bool endOfStream();
> void close();
> }
>
> You may ask, why separate Input and Output streams?

I think my first question is: why doesn't Stream inherit InputStream and 
OutputStream? My hypothesis: you want to sometimes return null. Nice.

> Well, that's because
> you either read from them, write from them, or both.
> Some streams are read-only (think Stdin), some write-only (Stdout), some
> support both, like FileStream. Right?

Sounds good. But then where's flush()? Must be in OutputStream.

> Not exactly. Does FileStream support writing when you open file for
> reading? Does it support reading when you open for writing?
> So, you may or may not read from a generic stream, and you also may or
> may not write to a generic stream. With a design like that you can make
> a mistake: if a stream isn't readable, you have no reference to invoke
> read() method on.

That is indeed pretty nifty. I hope you would allow us to copy that 
feature in Phobos (unless you are considering submitting your library 
wholesale). Let me know.

> Similarly, a stream is either seekable, or not. SeekableStreams allow
> stream cursor manipulation:
>
> interface SeekableStream : Stream
> {
> long getPosition(Anchor whence = Anchor.begin);
> void setPosition(long position, Anchor whence = Anchor.begin);
> }

Makes sense. Why is getPosition signed? Why do you need an anchor for 
getPosition?

> InputStream doesn't really has many methods:
>
> interface InputStream
> {
> // reads up to buffer.length bytes from a stream
> // returns number of bytes read
> // throws on error
> size_t read(ubyte[] buffer);

That makes implementation of line buffering inefficient :o).

> // reads from current position
> AsyncReadRequest readAsync(ubyte[] buffer, Mailbox* mailbox = null);
> }

Why doesn't Sean's concurrency API scale for your needs? Can that be 
fixed? Would you consider submitting some informed bug reports?

> So is OutputStream:
>
> interface OutputStream
> {
> // returns number of bytes written
> // throws on error
> size_t write(const(ubyte)[] buffer);
>
> // writes from current position
> AsyncWriteRequest writeAsync(const(ubyte)[] buffer, Mailbox* mailbox =
> null);
> }
>
> They basically support only reading and writing in blocks, nothing else.

I'm surprised there's no flush().

> However, they support asynchronous reads/writes, too (think of mailbox
> as a std.concurrency's Tid).
>
> Unlike Daniel's proposal, my design reads up to buffer size bytes for
> two reasons:
> - it avoids potential buffering and multiple sys calls

But there's a problem. It's very rare that the user knows what a good 
buffer size is. And often there are size and alignment restrictions at 
the low level. So somewhere there is still buffering going on, and also 
there are potential inefficiencies (if a user reads small buffers).

> - it is the only way to go with SocketStreams. I mean, you often don't
> know how many bytes an incoming socket message contains. You either have
> to read it byte-by-byte, or your application might stall for potentially
> infinite time (if message was shorter than your buffer, and no more
> messages are being sent)

But if you don't know how many bytes are in an incoming socket message, 
a better design is to do this:

void read(ref ubyte[] buffer);

and resize the buffer to accommodate the incoming packet. Your design 
_imposes_ that the socket does additional buffering.

> Why do my streams provide async methods? Because it's the modern
> approach to I/O - blocking I/O (aka one thread per client) doesn't
> scale. E.g. Java adds a second revision of Async I/O API in JDK7 (called
> NIO2, first appeared in February, 2002), C# has asynchronous operations
> as part of their Stream interface since .NET 1.1 (April, 2003).

Async I/O is nice, no two ways about that. I have on my list to define 
byChunkAsync that works exactly like byChunk from the client's 
perspective, except it does I/O concurrently with client code.

[snip]
> I strongly believe we shouldn't ignore this type of API.
>
> P.S. For threads this deep it's better fork a new one, especially when
> changing the subject.

I thought I did by changing the title...

Andrei