Streaming transport interfaces: input

Thu Oct 14 13:47:13 PDT 2010

On Thu, 14 Oct 2010 14:43:56 -0400, Andrei Alexandrescu  
<SeeWebsiteForEmail at erdani.org> wrote:

> On 10/14/10 13:14 CDT, Steven Schveighoffer wrote:
>> On Thu, 14 Oct 2010 13:39:03 -0400, Andrei Alexandrescu
>> <SeeWebsiteForEmail at erdani.org> wrote:
>>
>>> On 10/14/10 12:27 CDT, Steven Schveighoffer wrote:
>>>> On Thu, 14 Oct 2010 11:34:12 -0400, Andrei Alexandrescu
>>>> <SeeWebsiteForEmail at erdani.org> wrote:
>>>> Please, use the term "seek", and allow an anchor. Every OS allows  
>>>> this,
>>>> it makes no sense not to provide it.
>>>
>>> I've always thought that's a crappy appendix. Every OS that ever
>>> allows seek/tell with anchors allows ALL anchors, and always allows
>>> either both or none of seek and tell. So I decided to cut through the
>>> crap and simplify. You want to seek 100 bytes from here, you write
>>> stream.position = stream.position + 100.
>>
>> Um.. yuck. We need to use two system calls to seek 100 bytes?
>
> seek and tell don't always issue system calls.

seek *is* a system call (lseek64).  tell is simply seeking with (0,  
SEEK_CUR).

Are we talking about the same thing?

>
>>> Oh, that reminds me I need to provide length as a property as well.
>>> This would save us crap like seek(0, SEEK_END); ftell() to figure out
>>> the length of a file.
>>
>> So now you need to do stream.position = stream.length to seek to the end
>> of the file instead of stream.seek(0, Anchor.END)?
>
> Yes.
>
>> Plus, how will you
>> implement length, probably like this:
>> auto curpos = seek(0, SEEK_CUR);
>> auto len = seek(0, SEEK_END);
>> seek(curpos, SEEK_BEG);
>> return len;
>
> Depends. For files, you can just use stat.

True, but still, 2 system calls (stat and then seek).

>
>> So that looks like 3 system calls instead of one, plus you just wasted
>> time seeking back to the current position.
>
> Well again they don't always issue system calls, but point taken. I do  
> see a need for fast positioning at end of stream. Perhaps we could  
> accommodate an enum equal to ulong.max such that this goes to the end of  
> stream:
>
> stream.position = StreamBase.atEnd;

OK, that is acceptable.  What about seeking to N bytes before the end?

What about seeking N bytes ahead of the current position (as previously  
stated)?

>
>>>> I don't like appendDelim. We don't need to define that until we have
>>>> buffering.
>>>
>>> Why?
>>
>> Because appendDelim deals with buffering. If I defined a buffered
>> stream, I'd include a function like this:
>>
>> size_t read(bool delegate(T[] data) sink);
>>
>> which buffers data until sink returned false (passing each read chunk
>> into sink), extending the buffer as necessary.
>>
>> Then it's trivial to implement readDelim on top of this.
>
> Interesting. But that would still force readDelim to store leftover  
> bytes.

What does readDelim do with them in your implementation?  It must read  
data via a block read, that's the only thing the OS provides (via read  
system call), so what do you do, just return the extra data or throw it  
away?  You need to buffer it somewhere.

>>>> The simple function of an input stream is to read data.
>>>
>>> It does read data.
>>
>> I mean, that's *all* it should do. It should not be appending to  
>> buffers.
>
> This comes from a practical need. I've often had a buffer and wanted to  
> read one more line into it, keeping the existing content. It was  
> impossible without extra allocation and copying.

This can be accomodated via a buffer type.  Buffering provides everything  
you need to implement readDelim.

>>> I think the appendDelim method allows fast and simple implementations
>>> of a variety of patterns. As I (thought I) have shown elsethread,
>>> without appendDelim there's no way to efficiently implement a
>>> line-oriented stream on top of a block-oriented one.
>>
>> Um... the read system call is the same interface as the proposed
>> block-oriented interface. How are you avoiding using system calls?
>
> I think we don't have the same definition for "system call". For example  
> by my definition fread is NOT a system call.

OK, now I think I see the issue :)  You are assuming we are implementing  
all of this on top of FILE *.  FILE * provides buffering already, so you  
are not avoiding buffering at all and you are unnecessarily using an  
extremely outdated interface.

BTW, fread eventually calls read, there's no way around it.

I think we can provide a version of the BufferedStream interface which  
uses C's FILE * for stdout/stdin/stderr (to play nice with C), but we  
should avoid FILE * for everything else.

>>>> Basically, appendDelim can be defined outside this class, because the
>>>> primitive read is enough.
>>>
>>> You can only define it if you accept extra copying. I'd say one extra
>>> interface function is acceptable for fast I/O.
>>
>> No, you can define it without extra copying.
>
> How? Denis' implementation has two copies in the mix. (I'm not counting  
> .dup etc.) Anyhow, let's do this - write down your interfaces so I can  
> comment on them. We talk "oh that's a buffering interface" and "that  
> requires buffering" and "that's an extra copy" and so on but we have  
> little concrete contenders. I put my cards on the table, you put yours.

I'll see if I can put something together.

>
>> If you don't allow direct
>> access to the buffer, then you have extra copying. But we don't have to
>> mimic C here. We should not be encouraging constant reinventing of the
>> buffer wheel here. Buffering is a well-defined task that can be
>> implemented once.
>>
>> Just as a note, Tango does this, and it's very fast. There is certainly
>> no extra copying there.
>>
>>>> Shouldn't the text transport be defined on top of the binary  
>>>> transport?
>>>
>>> No, because there are transports that genuinely do not accept binary
>>> data.
>>
>> I mean, a text transport uses a binary transport underneath. What text
>> transport doesn't use a binary transport to do its dirty work? And what
>> exactly does a text transport do so differently that it needs to be a
>> separate interface?
>
> A text transport does not accept raw binary data and requires e.g.  
> Base64 encoding (e.g. mail attachments do that). The console is a text  
> device - makes no sense to dump binary data on it. A JSON encoder is  
> also a text transport.

-- that writes to/reads from a binary transport.  A file is a binary  
transport.

My opinion is that a text reader/writer should be a class/struct that uses  
a binary transport.  If you are going to implement the binary transport  
stuff also inside the text transport, then it seems like an unnecessary  
duplication.

>>>> And in any case, I'd expect buffering to go between the two.
>>>
>>> How do you define buffering? Would a buffered transport implement a
>>> different interface?
>>
>> Yes, but if we expect to reuse code, I'd expect a buffered transport to
>> use a primitive transport underneath for actually reading/writing data.
>> If you have multiple versions of the class that actually reads/writes
>> data (such as binary vs. text), then the buffer which uses it must
>> support all of them.
>>
>> Text based processing to me seems to be a buffered activity (reading
>> lines, ensuring you don't have sliced utf-8 data, etc.).
>
> Yes. What may be not so obvious is that binary processing with  
> user-imposed data lenghts is ALSO a buffer activity. This is because the  
> low-level buffers do NOT come at arbitrary positions (alignment  
> restrictions) and to NOT come at arbitrary lengths.

You can still avoid copying.  I'll see if I can show you.

-Steve