Streaming transport interfaces: input

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Thu Oct 14 11:43:56 PDT 2010


On 10/14/10 13:14 CDT, Steven Schveighoffer wrote:
> On Thu, 14 Oct 2010 13:39:03 -0400, Andrei Alexandrescu
> <SeeWebsiteForEmail at erdani.org> wrote:
>
>> On 10/14/10 12:27 CDT, Steven Schveighoffer wrote:
>>> On Thu, 14 Oct 2010 11:34:12 -0400, Andrei Alexandrescu
>>> <SeeWebsiteForEmail at erdani.org> wrote:
>>> Please, use the term "seek", and allow an anchor. Every OS allows this,
>>> it makes no sense not to provide it.
>>
>> I've always thought that's a crappy appendix. Every OS that ever
>> allows seek/tell with anchors allows ALL anchors, and always allows
>> either both or none of seek and tell. So I decided to cut through the
>> crap and simplify. You want to seek 100 bytes from here, you write
>> stream.position = stream.position + 100.
>
> Um.. yuck. We need to use two system calls to seek 100 bytes?

seek and tell don't always issue system calls.

>> Oh, that reminds me I need to provide length as a property as well.
>> This would save us crap like seek(0, SEEK_END); ftell() to figure out
>> the length of a file.
>
> So now you need to do stream.position = stream.length to seek to the end
> of the file instead of stream.seek(0, Anchor.END)?

Yes.

> Plus, how will you
> implement length, probably like this:
> auto curpos = seek(0, SEEK_CUR);
> auto len = seek(0, SEEK_END);
> seek(curpos, SEEK_BEG);
> return len;

Depends. For files, you can just use stat.

> So that looks like 3 system calls instead of one, plus you just wasted
> time seeking back to the current position.

Well again they don't always issue system calls, but point taken. I do 
see a need for fast positioning at end of stream. Perhaps we could 
accommodate an enum equal to ulong.max such that this goes to the end of 
stream:

stream.position = StreamBase.atEnd;

>>> I don't like appendDelim. We don't need to define that until we have
>>> buffering.
>>
>> Why?
>
> Because appendDelim deals with buffering. If I defined a buffered
> stream, I'd include a function like this:
>
> size_t read(bool delegate(T[] data) sink);
>
> which buffers data until sink returned false (passing each read chunk
> into sink), extending the buffer as necessary.
>
> Then it's trivial to implement readDelim on top of this.

Interesting. But that would still force readDelim to store leftover bytes.

>>> The simple function of an input stream is to read data.
>>
>> It does read data.
>
> I mean, that's *all* it should do. It should not be appending to buffers.

This comes from a practical need. I've often had a buffer and wanted to 
read one more line into it, keeping the existing content. It was 
impossible without extra allocation and copying.

>> I think the appendDelim method allows fast and simple implementations
>> of a variety of patterns. As I (thought I) have shown elsethread,
>> without appendDelim there's no way to efficiently implement a
>> line-oriented stream on top of a block-oriented one.
>
> Um... the read system call is the same interface as the proposed
> block-oriented interface. How are you avoiding using system calls?

I think we don't have the same definition for "system call". For example 
by my definition fread is NOT a system call.

>>> Basically, appendDelim can be defined outside this class, because the
>>> primitive read is enough.
>>
>> You can only define it if you accept extra copying. I'd say one extra
>> interface function is acceptable for fast I/O.
>
> No, you can define it without extra copying.

How? Denis' implementation has two copies in the mix. (I'm not counting 
.dup etc.) Anyhow, let's do this - write down your interfaces so I can 
comment on them. We talk "oh that's a buffering interface" and "that 
requires buffering" and "that's an extra copy" and so on but we have 
little concrete contenders. I put my cards on the table, you put yours.

> If you don't allow direct
> access to the buffer, then you have extra copying. But we don't have to
> mimic C here. We should not be encouraging constant reinventing of the
> buffer wheel here. Buffering is a well-defined task that can be
> implemented once.
>
> Just as a note, Tango does this, and it's very fast. There is certainly
> no extra copying there.
>
>>> Shouldn't the text transport be defined on top of the binary transport?
>>
>> No, because there are transports that genuinely do not accept binary
>> data.
>
> I mean, a text transport uses a binary transport underneath. What text
> transport doesn't use a binary transport to do its dirty work? And what
> exactly does a text transport do so differently that it needs to be a
> separate interface?

A text transport does not accept raw binary data and requires e.g. 
Base64 encoding (e.g. mail attachments do that). The console is a text 
device - makes no sense to dump binary data on it. A JSON encoder is 
also a text transport.

> In other words, if 90% of the text transport duplicates the binary
> transport, I see an opportunity for consolidation.

Consolidation brings simplification, which is good. But I believe there 
exist text entities that do make the distinction worthwhile.

>>> And in any case, I'd expect buffering to go between the two.
>>
>> How do you define buffering? Would a buffered transport implement a
>> different interface?
>
> Yes, but if we expect to reuse code, I'd expect a buffered transport to
> use a primitive transport underneath for actually reading/writing data.
> If you have multiple versions of the class that actually reads/writes
> data (such as binary vs. text), then the buffer which uses it must
> support all of them.
>
> Text based processing to me seems to be a buffered activity (reading
> lines, ensuring you don't have sliced utf-8 data, etc.).

Yes. What may be not so obvious is that binary processing with 
user-imposed data lenghts is ALSO a buffer activity. This is because the 
low-level buffers do NOT come at arbitrary positions (alignment 
restrictions) and to NOT come at arbitrary lengths.

>>> If all you
>>> are adding are the different widths of characters, I don't think you
>>> need this extra layer. It's going to make the buffering layer more
>>> difficult to implement (now it must handle both a text version and
>>> abinary version).
>>
>> I don't understand this.
>
> buffer uses a transport. If you have two different transport interfaces,
> the buffer must support them both. And if the benefit is, one simply
> defines [w|d]char versions of read, then we haven't gained much for the
> trouble of having to support both.

I'll be looking forward to seeing your interfaces.


Andrei


More information about the Digitalmars-d mailing list