[RFC] I/O and Buffer Range

Thu Jan 16 12:00:16 PST 2014

On Thu, 16 Jan 2014 13:44:08 -0500, Dmitry Olshansky  
<dmitry.olsh at gmail.com> wrote:

> 16-Jan-2014 19:55, Steven Schveighoffer пишет:
>> On Tue, 07 Jan 2014 05:04:07 -0500, Dmitry Olshansky
>> <dmitry.olsh at gmail.com> wrote:
>>> Then our goals are aligned. Be sure to take a peek at (if you haven't
>>> already):
>>> https://github.com/schveiguy/phobos/blob/new-io/std/io.d
>>
>> Yes, I'm gearing up to revisit that after a long D hiatus, and I came
>> across this thread.
>>
>> At this point, I really really like the ideas that you have in this. It
>> solves an issue that I struggled with, and my solution was quite clunky.
>>
>> I am thinking of this layout for streams/buffers:
>>
>> 1. Unbuffered stream used for raw i/o, based on a class hierarchy (which
>> I have pretty much written)
>> 2. Buffer like you have, based on a struct, with specific primitives.
>> It's job is to collect data from the underlying stream, and present it
>> to consumers as a random-access buffer.
>
> The only interesting thing I'd add here s that some buffer may work  
> without underlying stream. Best examples are arrays and MM-files.

Yes, but I would stress that for convenience, the buffer should forward  
some of the stream primitives (such as seeking) in cases where seeking is  
possible, at least in the case of a buffer that wraps a stream.

That actually is another point that would have sucked with my class-based  
solution -- allocating a class to use an array as backing.

>
>> 3. Filter that has access to transform the buffer data/copy it.
>> 4. Ranges that use the buffer/filter to process/present the data.
>>
>
> Yes, yes and yes. I find it surprisingly good to see our vision seems to  
> match. I was half-expecting you'd come along and destroy it all ;)

:) I've been preaching for a while that ranges don't make good streams,  
and that streams should be classes, but I hadn't considered splitting out  
the buffer. I think it's the right balance.

>
>> The problem I struggled with is the presentation of UTF data of any
>> format as char[] wchar[] or dchar[]. 2 things need to happen. First is
>> that the data needs to be post-processed to perform any necessary byte
>> swapping. The second is to transcode the data into the correct width.
>>
>> In this way, you can process UTF data of any type (I even have code to
>> detect the encoding and automatically process it), and then use it in a
>> way that makes sense for your code.
>>
>> My solution was to paste in a "processing" delegate into the class
>> hierarchy of buffered streams that allowed one read/write access to the
>> buffer. But it's clunky, and difficult to deal with in a generalized
>> fashion.
>>
>> But the idea of using a buffer in between the stream and the range, and
>> possibly bolting together multiple transformations in a clean way, makes
>> this problem easy to solve, and I think it is closer to the vision
>> Andrei/Walter have.
>
> In essence a transcoding filter for UTF-16 would wrap a buffer of ubyte  
> and itself present a buffer interface (but of wchar).

My intended interface allows you to specify the desired type per read.  
Think of the case of stdin, where the clients will be varied and written  
by many different people, and its interface is decided by Phobos.

But a transcoding buffer may make some optimizations. For instance,  
reading a UTF32 file as utf-8 can re-use the same buffer, as no code unit  
uses more than 4 code points (did I get that right?).

>> I am going to study your code some more and see how I can update my code
>> to use it. I still need to maintain the std.stdio.File interface, and
>> Walter is insistent that the initial state of stdout/err/in must be
>> synchronous with C (which kind of sucks, but I have plans on how to make
>> it not be so bad).
>
> I seriously not seeing how interfacing with C runtime could be fast  
> enough.

It's not. But an important stipulation in order for this to all be  
accepted is that it doesn't break existing code that expects things like  
printf and writef to interleave properly.

However, I think we can have an opt-in scheme, and there are certain cases  
where we can proactively switch to a D-buffer scheme. For example, if you  
get a ByLine range, it expects to exhaust the data from stream, and may  
not properly work with C printf.

The idea is that stdio.File can switch at runtime from FILE * to D streams  
as needed or directed.

>> There is still a lot of work left to do, but I think one of the hard
>> parts is done, namely dealing with UTF transcoding. The remaining sticky
>> part is dealing with shared. But with structs, this should make things
>> much easier.
>
> I'm thinking a generic locking wrapper is possible along the lines of:
>
> shared Locked!(GenericBuffer!char) stdin; //usage
>
> struct Locked(T){
> shared:
> private:
> 	T _this;
> 	Mutex mut;
> public:
> 	//forwarded methods
> }
>
> The wrapper will introduce a lock, and implement every method of wrapped  
> struct roughly like this:
> mut.lock();
> scope(exit) mut.unlock();
> (cast(T*)_this).method(args);
>
> I'm sure it could be pretty automatic.

This would be a key addition for ANY type in order to properly work with  
shared. BUT, I don't see how it works safely generically because you  
necessarily have to cast away shared in order to call the methods. You  
would have to limit this to only working on types it was intended for.

I've been expecting to have to do something like this, but not looking  
forward to it :(

>> One question, is there a reason a buffer type has to be a range at all?
>> I can see where it's easy to make it a range, but I don't see
>> higher-level code using the range primitives when dealing with chunks of
>> a stream.
>
> Lexers/parsers enjoy it - i.e. they work pretty much as ranges  
> especially when skipping spaces and the like. As I said the main reason  
> was: if it fits as range why not? After all it makes one-pass processing  
> of data trivial as it rides on top of foreach:
>
> foreach(octect; mybuffer)
> {
> 	if(intersting(octect))
> 		do_cool_stuff();
> }
>
> Things like countUntil make perfect sense when called on buffer (e.g. to  
> find matching sentinel).
>

I think I misstated my question. What I am curious about is why a type  
must be a forward range to pass isBuffer. Of course, if it makes sense for  
a buffer type to also be a range, it can certainly implement that  
interface as well. But I don't know that I would need those primitives in  
all cases. I don't have any specific use case for having a buffer that  
doesn't implement a range interface, but I am hesitant to necessarily  
couple the buffer interface to ranges just because we can't think of a  
counter-case :)

-Steve