Streaming library
Denis Koroskin
2korden at gmail.com
Wed Oct 13 12:02:48 PDT 2010
On Wed, 13 Oct 2010 20:55:04 +0400, Andrei Alexandrescu
<SeeWebsiteForEmail at erdani.org> wrote:
> On 10/13/10 11:16 CDT, Denis Koroskin wrote:
>> On Wed, 13 Oct 2010 18:32:15 +0400, Andrei Alexandrescu
>>> So far so good. I will point out, however, that the classic read/write
>>> routines are not all that good. For example if you want to implement a
>>> line-buffered stream on top of a block-buffered stream you'll be
>>> forced to write inefficient code.
>>>
>>
>> Never heard of filesystems that allow reading files in lines - they
>> always read in blocks, and that's what streams should do.
>
> http://www.gnu.org/s/libc/manual/html_node/Buffering-Concepts.html
>
> I don't think streams must mimic the low-level OS I/O interface.
>
I in contrast think that Streams should be a lowest-level possible
platform-independent abstraction.
No buffering besides what an OS provides, no additional functionality. If
you need to be able to read something up to some character (besides, what
should be considered a new-line separator: \r, \n, \r\n?), this should be
done manually in "byLine".
>> That's because
>> most of the steams are binary streams, and there is no such thing as a
>> "line" in them (e.g. how often do you need to read a line from a
>> SocketStream?).
>
> http://www.opengroup.org/onlinepubs/009695399/functions/isatty.html
>
These are special cases I don't like. There is no such thing in Windows
anyway.
> You need a line when e.g. you parse a HTML header or a email header or
> an FTP response. Again, if at a low level the transfer occurs in blocks,
> that doesn't mean the API must do the same at all levels.
>
BSD sockets transmits in blocks. If you need to find a special sequence in
a socket stream, you are forced to fetch a chunk, and manually search for
a needed sequence. My position is that you should do it with an external
predicate (e.g. read until whitespace).
>> I don't think streams should buffer anything either (what an underlying
>> OS I/O API caches should suffice), buffered streams adapters can do that
>> in a stream-independent way (why duplicate code when you can do that as
>> efficiently with external methods?).
>
> Most OS primitives don't give access to their own internal buffers.
> Instead, they ask user code to provide a buffer and transfer data into
> it.
Right. This is why Stream may not cache.
> So clearly buffering on the client side is a must.
>
I don't see how is it implied from above.
>> Besides, as you noted, the buffering is redundant for byChunk/byLine
>> adapter ranges. It means that byChunk/byLine should operate on
>> unbuffered streams.
>
> Chunks keep their own buffer so indeed they could operate on streams
> that don't do additional buffering. The story with lines is a fair
> amount more complicated if it needs to be done efficiently.
>
Yes. But line-reading is a case that I don't see a need to be handled
specially.
>> I'll explain my I/O streams implementation below in case you didn't read
>> my message (I've changed some stuff a little since then).
>
> Honest, I opened it to remember to read it but somehow your fonts are
> small and make my eyes hurt.
>
>> My Stream
>> interface is very simple:
>>
>> // A generic stream
>> interface Stream
>> {
>> @property InputStream input();
>> @property OutputStream output();
>> @property SeekableStream seekable();
>> @property bool endOfStream();
>> void close();
>> }
>>
>> You may ask, why separate Input and Output streams?
>
> I think my first question is: why doesn't Stream inherit InputStream and
> OutputStream? My hypothesis: you want to sometimes return null. Nice.
>
Right.
>> Well, that's because
>> you either read from them, write from them, or both.
>> Some streams are read-only (think Stdin), some write-only (Stdout), some
>> support both, like FileStream. Right?
>
> Sounds good. But then where's flush()? Must be in OutputStream.
>
That's probably because unbuffered streams don't need them.
>> Not exactly. Does FileStream support writing when you open file for
>> reading? Does it support reading when you open for writing?
>> So, you may or may not read from a generic stream, and you also may or
>> may not write to a generic stream. With a design like that you can make
>> a mistake: if a stream isn't readable, you have no reference to invoke
>> read() method on.
>
> That is indeed pretty nifty. I hope you would allow us to copy that
> feature in Phobos (unless you are considering submitting your library
> wholesale). Let me know.
>
Would love to contribute with design and implementation.
>> Similarly, a stream is either seekable, or not. SeekableStreams allow
>> stream cursor manipulation:
>>
>> interface SeekableStream : Stream
>> {
>> long getPosition(Anchor whence = Anchor.begin);
>> void setPosition(long position, Anchor whence = Anchor.begin);
>> }
>
> Makes sense. Why is getPosition signed? Why do you need an anchor for
> getPosition?
>
long is chosen to be consistent with setPosition. Also getPosition may
return a negative value:
long pos = getPosition(Anchor.end); // how far is it till file end?
Also this is how you can get file size (need to invert though). This is
consistent with setPosition:
setPosition(getPosition(anchor), anchor); // a no-op for any kind of achor
I just thought why not? I'm okay with dropping it, but I find it nice.
>> InputStream doesn't really has many methods:
>>
>> interface InputStream
>> {
>> // reads up to buffer.length bytes from a stream
>> // returns number of bytes read
>> // throws on error
>> size_t read(ubyte[] buffer);
>
> That makes implementation of line buffering inefficient :o).
>
There is no way you can do it more efficient on Windows. Fetch a chunk;
search for a line end; found ? return : continue.
>> // reads from current position
>> AsyncReadRequest readAsync(ubyte[] buffer, Mailbox* mailbox = null);
>> }
>
> Why doesn't Sean's concurrency API scale for your needs? Can that be
> fixed? Would you consider submitting some informed bug reports?
>
It's rather a design issue than a bug on its own. I'll write a separate
letter on that.
>> So is OutputStream:
>>
>> interface OutputStream
>> {
>> // returns number of bytes written
>> // throws on error
>> size_t write(const(ubyte)[] buffer);
>>
>> // writes from current position
>> AsyncWriteRequest writeAsync(const(ubyte)[] buffer, Mailbox* mailbox =
>> null);
>> }
>>
>> They basically support only reading and writing in blocks, nothing else.
>
> I'm surprised there's no flush().
>
No buffering - no flush.
>> However, they support asynchronous reads/writes, too (think of mailbox
>> as a std.concurrency's Tid).
>>
>> Unlike Daniel's proposal, my design reads up to buffer size bytes for
>> two reasons:
>> - it avoids potential buffering and multiple sys calls
>
> But there's a problem. It's very rare that the user knows what a good
> buffer size is. And often there are size and alignment restrictions at
> the low level.
I agree, but he can guess. Or a library can give him a hint. E.g.
BUFFER_SIZE is a good buffer size to start with :)
> So somewhere there is still buffering going on, and also there are
> potential inefficiencies (if a user reads small buffers).
>
>> - it is the only way to go with SocketStreams. I mean, you often don't
>> know how many bytes an incoming socket message contains. You either have
>> to read it byte-by-byte, or your application might stall for potentially
>> infinite time (if message was shorter than your buffer, and no more
>> messages are being sent)
>
> But if you don't know how many bytes are in an incoming socket message,
> a better design is to do this:
>
> void read(ref ubyte[] buffer);
>
That could work, too.
> and resize the buffer to accommodate the incoming packet. Your design
> _imposes_ that the socket does additional buffering.
>
The socket API does it anyway. I just don't complicate it even further but
providing an additional layer of buffering.
>> Why do my streams provide async methods? Because it's the modern
>> approach to I/O - blocking I/O (aka one thread per client) doesn't
>> scale. E.g. Java adds a second revision of Async I/O API in JDK7 (called
>> NIO2, first appeared in February, 2002), C# has asynchronous operations
>> as part of their Stream interface since .NET 1.1 (April, 2003).
>
> Async I/O is nice, no two ways about that. I have on my list to define
> byChunkAsync that works exactly like byChunk from the client's
> perspective, except it does I/O concurrently with client code.
>
> [snip]
>> I strongly believe we shouldn't ignore this type of API.
>>
>> P.S. For threads this deep it's better fork a new one, especially when
>> changing the subject.
>
> I thought I did by changing the title...
>
>
> Andrei
No, changing title isn't enough.
More information about the Digitalmars-d
mailing list