Proposal for SentinelInputRange

Thu Feb 28 03:29:15 PST 2013

On Thursday, February 28, 2013 02:54:24 Walter Bright wrote:
> On 2/27/2013 11:55 PM, Jonathan M Davis wrote:
> >> Again, please see how lexer.c works. I assure you, there is no double
> >> copying going on, nor is there a double test for the terminating 0.
> > 
> > I know what the lexer does, and remember that it _doesn't_ operate on
> > ranges, and there are subtle differences between being able to just use
> > char* and trying to handle generic ranges.
> 
> Hence the need to invent SentinelInputRange.
> 
> > Given how a lexer works (and I have been working on a lexer off and on
> > recently), the only real difference is that you'd just use a couple of
> > static ifs like
> > 
> > static if(!isSomeString!R)
> > {
> > 
> >      if(range.empty)
> >      
> >          break; //or whatever you do at the end
> > 
> > }
> > 
> > static if(isSomeString!R)
> > {
> > 
> >      case 0:
> >          break; //or whatever you do at the end
> > 
> > }
> 
> There are so many places where this would occur, it cries out for a new
> type.

And you were just claiming that the lexer checked the sentinel type in only 
one place. If that's indeed the case (and I think that it's quite close to 
being true if it isn't true), then you _wouldn't_ need to use static ifs like 
this in many places. So, which is it? If you need to check the sentinel often 
enough that using static ifs is a problem, then it's probably not buying you 
much of anything over checking empty anyway.

> > The idea of sentinels certainly isn't useless, but anything caring about
> > that sort of speed is likely to just use strings or arrays, and those can
> > trivially be special cased to avoid unnecessary empty checks and to add
> > the check for the sentinel, making the whole sentinel range idea an
> > unnecessary complication IMHO.
> 
> You can't do efficient lookahead without sentinels, either. Lexers are
> sensitive to every instruction executed per character read. No sentinels
> mean double the number of instructions per source character.

But my point is that outside of strings or arrays, you're almost certainly 
stuck with that. Most ranges would have to be wrapped in order to act as a 
sentinel range, and the sentinel range would have to check empty on the 
internal range on every popFront. The only time that a sentinel range would 
buy you anything outside of strings or arrays is when you have a range type 
written specifically to be a sentinel range and which doesn't have to call 
empty on a range internally. But that means that it's going to have to manage 
whatever it's holding internally itself and can't wrap another range - and 
that memory will almost certainly be an array of some sort. And if that's the 
case, what did we buy by wrapping it in a sentinel range? Why not just use the 
array directly?

Pretty much the only type of range which isn't wrapping another range which 
wouldn't have an array internally would be one that was either reading from a 
file (or some other input) as it iterates or one which was generative. In 
either case, you would be unable to put a zero on the end without checking 
whether it was empty with each popFront, meaning that making it a sentinel 
range was pointless. So, pretty much the only ranges which could gain efficiency 
as sentinel ranges are those that hold arrays internally, meaning that the 
only gain from having the sentinel range over special casing the code on 
strings or arrays is the fact that you won't have to special case strings or 
arrays at all with static ifs or template constraints, but I'm quite convinced 
that in most cases, the number of static ifs involved would be quite few, 
making it quite trivial to special case on strings, meaning that there's 
pretty much no gain in having the sentinel ranges. It just incurs more 
overhead that the compiler must optimize away, because you have to wrap 
strings for them to work as sentinel ranges, or no template constraint will be 
able to detect that they're sentinel ranges. Of course, you could special case 
them so that they're treated as sentinel ranges in spite of that, but if 
you're special casing, why use the sentinel ranges in the first place instead 
of just special casing on strings?

> InputRanges are an abject failure if "anyone caring about speed" is not
> going to use them. And yes, I care very much about the D lexing speed.

Pure input ranges fail utterly as you can't save them, so you get _zero_ 
lookahead unless you start copying their contents somewhere else - which is 
quite expensive in comparison to slicing an array and which complicates the 
code. Forward ranges do better, but the lack of random access can be a killer 
in some cases. For instance, anything involving unicode (which a lexer mostly 
avoids, because it rarely has to decode unicode, but it still happens 
sometimes) requires random access for speed at least some of the time 
(especially when skipping code units), and it makes comparisons faster, 
because you can compare slices at once rather than compare a character and 
then pop it, compare, pop, compare, pop... For the fastest results, you need 
random-access ranges.

There's tons of stuff that's fine with forward ranges (though I seriously 
question the viability of pure input ranges given how insanely limiting is to 
be unable to save their state), but there's also plenty of stuff that needs 
random access, and if you _really_ care about speed, there's a decent chance 
that you need random access (if nothing else, to be able to pop off multiple 
elements at a time). That doesn't necessarily mean that you're using an array, 
but odds are whatever you're using wraps an array if it isn't one. I don't 
think that there's much of anything which is random access without having an 
array or pointer underneath the hood. Reading from a file, I'd be very inclined 
to use something like MmFile or reading the whole file into an array rather 
than trying to wrap in something that's a pure input range. They're just too 
limiting.

- Jonathan M Davis