Proposal for SentinelInputRange
Jonathan M Davis
jmdavisProg at gmx.com
Thu Feb 28 03:29:15 PST 2013
On Thursday, February 28, 2013 02:54:24 Walter Bright wrote:
> On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
> >> Again, please see how lexer.c works. I assure you, there is no double
> >> copying going on, nor is there a double test for the terminating 0.
> >
> > I know what the lexer does, and remember that it _doesn't_ operate on
> > ranges, and there are subtle differences between being able to just use
> > char* and trying to handle generic ranges.
>
> Hence the need to invent SentinelInputRange.
>
> > Given how a lexer works (and I have been working on a lexer off and on
> > recently), the only real difference is that you'd just use a couple of
> > static ifs like
> >
> > static if(!isSomeString!R)
> > {
> >
> > if(range.empty)
> >
> > break; //or whatever you do at the end
> >
> > }
> >
> > static if(isSomeString!R)
> > {
> >
> > case 0:
> > break; //or whatever you do at the end
> >
> > }
>
> There are so many places where this would occur, it cries out for a new
> type.
And you were just claiming that the lexer checked the sentinel type in only
one place. If that's indeed the case (and I think that it's quite close to
being true if it isn't true), then you _wouldn't_ need to use static ifs like
this in many places. So, which is it? If you need to check the sentinel often
enough that using static ifs is a problem, then it's probably not buying you
much of anything over checking empty anyway.
> > The idea of sentinels certainly isn't useless, but anything caring about
> > that sort of speed is likely to just use strings or arrays, and those can
> > trivially be special cased to avoid unnecessary empty checks and to add
> > the check for the sentinel, making the whole sentinel range idea an
> > unnecessary complication IMHO.
>
> You can't do efficient lookahead without sentinels, either. Lexers are
> sensitive to every instruction executed per character read. No sentinels
> mean double the number of instructions per source character.
But my point is that outside of strings or arrays, you're almost certainly
stuck with that. Most ranges would have to be wrapped in order to act as a
sentinel range, and the sentinel range would have to check empty on the
internal range on every popFront. The only time that a sentinel range would
buy you anything outside of strings or arrays is when you have a range type
written specifically to be a sentinel range and which doesn't have to call
empty on a range internally. But that means that it's going to have to manage
whatever it's holding internally itself and can't wrap another range - and
that memory will almost certainly be an array of some sort. And if that's the
case, what did we buy by wrapping it in a sentinel range? Why not just use the
array directly?
Pretty much the only type of range which isn't wrapping another range which
wouldn't have an array internally would be one that was either reading from a
file (or some other input) as it iterates or one which was generative. In
either case, you would be unable to put a zero on the end without checking
whether it was empty with each popFront, meaning that making it a sentinel
range was pointless. So, pretty much the only ranges which could gain efficiency
as sentinel ranges are those that hold arrays internally, meaning that the
only gain from having the sentinel range over special casing the code on
strings or arrays is the fact that you won't have to special case strings or
arrays at all with static ifs or template constraints, but I'm quite convinced
that in most cases, the number of static ifs involved would be quite few,
making it quite trivial to special case on strings, meaning that there's
pretty much no gain in having the sentinel ranges. It just incurs more
overhead that the compiler must optimize away, because you have to wrap
strings for them to work as sentinel ranges, or no template constraint will be
able to detect that they're sentinel ranges. Of course, you could special case
them so that they're treated as sentinel ranges in spite of that, but if
you're special casing, why use the sentinel ranges in the first place instead
of just special casing on strings?
> InputRanges are an abject failure if "anyone caring about speed" is not
> going to use them. And yes, I care very much about the D lexing speed.
Pure input ranges fail utterly as you can't save them, so you get _zero_
lookahead unless you start copying their contents somewhere else - which is
quite expensive in comparison to slicing an array and which complicates the
code. Forward ranges do better, but the lack of random access can be a killer
in some cases. For instance, anything involving unicode (which a lexer mostly
avoids, because it rarely has to decode unicode, but it still happens
sometimes) requires random access for speed at least some of the time
(especially when skipping code units), and it makes comparisons faster,
because you can compare slices at once rather than compare a character and
then pop it, compare, pop, compare, pop... For the fastest results, you need
random-access ranges.
There's tons of stuff that's fine with forward ranges (though I seriously
question the viability of pure input ranges given how insanely limiting is to
be unable to save their state), but there's also plenty of stuff that needs
random access, and if you _really_ care about speed, there's a decent chance
that you need random access (if nothing else, to be able to pop off multiple
elements at a time). That doesn't necessarily mean that you're using an array,
but odds are whatever you're using wraps an array if it isn't one. I don't
think that there's much of anything which is random access without having an
array or pointer underneath the hood. Reading from a file, I'd be very inclined
to use something like MmFile or reading the whole file into an array rather
than trying to wrap in something that's a pure input range. They're just too
limiting.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list