stream interfaces - with ranges

Steven Schveighoffer schveiguy at yahoo.com
Fri May 18 06:44:43 PDT 2012


On Fri, 18 May 2012 03:52:51 -0400, Mehrdad <wfunction at hotmail.com> wrote:

> On Thursday, 17 May 2012 at 14:02:09 UTC, Steven Schveighoffer wrote:
>> 2. I realized, buffering input stream of type T is actually an input  
>> range of type T[].
>
> The trouble is, why a slice? Why not an std.array.Array? Why not some  
> other data source?
> (Check/egg problem....)

Well, because that's what i/o buffers are :)  There isn't an OS primitive  
that reads a file descriptor into an e.g. linked list.  Anything other  
than a slice would go through a translation.

I don't know what std.array.Array is.

> Another problem I've noticed is the following:
>
>
> Say you're tokenizing some input range, and it happens to just be a  
> huge, gigantic string.
>
> It *should* be possible to turn it into tokens with slices referring to  
> the ORIGINAL string, which is VERY efficient because it doesn't require  
> *any* heap allocations whatsoever. (You just tokenize with opApply() as  
> you go, without every requiring a heap allocation...)
>
> However, this is *only* possible if you don't use the concept of an  
> input range!

How so?  A slice is an input range, and so is a string.

> Since you can't slice an input range, you'd be forced to use the front()  
> and popFront() properties. But, as soon as you do that, you're gonna  
> have to store the data somewhere... so your next-best option is to  
> append it to some new gigantic array (instead of a bunch of small  
> arrays, which require a lot of heap allocations), but even then, it's  
> not as efficient as possible, because there's O(n) extra memory involved  
> -- which defeats the whole purpose of working on small chunks at a time  
> with no heap allocations.
> (If you're going to do that, after all, you might as well read the  
> entire thing into a giant string at the beginning, and work with an  
> array anyway, discarding the whole idea of a range while doing your  
> tokenization.)
>
>
> Any ideas on how to solve this problem?

I think I get what you are saying here -- if you are processing, say, an  
XML file, and you want to split that into tokens, you have to dup each  
token from the stream, because the buffer may be reused.

But doing the same thing for a string would be wasteful.

I think in these cases, we need two types of parsing.  One is process the  
stream as it's read into a temporary buffer.  If you need data from the  
temporary buffer beyond the scope of the processing loop, you need to dup  
it.

Other way is read the entire file/stream into a buffer, then process that  
buffer with the knowledge that it's never going to change.

We probably can have buffer identify which situation it's in, so the code  
can make a runtime decision on whether to dup or not.

-Steve


More information about the Digitalmars-d mailing list