Proposal for SentinelInputRange

Thu Feb 28 16:07:22 PST 2013

On Thursday, February 28, 2013 13:42:34 Walter Bright wrote:
> On 2/28/2013 10:52 AM, Jonathan M Davis wrote:
> > Notice that it has to check for empty every time that the front is popped,
> > and it can't avoid that,
> 
> Yes it can avoid it - that is the whole point. Notice there are no checks
> for the sentinel here - but the code is correct:
> 
> case '>':
> p++;
> if (*p == '=')
> { p++;
> t->value = TOKge; // >=
> }
> else if (*p == '>')
> { p++;
> if (*p == '=')
> { p++;
> t->value = TOKshrass; // >>=
> }
> else if (*p == '>')
> { p++;
> if (*p == '=')
> { p++;
> t->value = TOKushrass; // >>>=
> }
> else
> t->value = TOKushr; // >>>
> }
> else
> t->value = TOKshr; // >>
> }
> else
> t->value = TOKgt; // >
> return;

But that's with a pointer. You see every ++ there? That would be a popFront 
call, and for most ranges, that would mean checking empty internally if the 
range needs to have a sentinel value on its end, because most ranges will be a 
wrapper around another range, and the only way to know whether you've reached 
their end (and that therefore, front now needs to be the sentinel) is to check 
for empty. If a range isn't a wrapper around another range, then it's probably 
an array, but arrays won't work with the sentinel range scheme, because they 
won't pass isSentinelRange. So, they'll still need a wrapper which allows them 
to be treated as a sentinel range. And so the pretty much the only case where 
you get any performance gain is with strings wrapped in a sentinel range, and 
if that's the case, I'd just special case them to use sentinels and avoid the 
sentinel range concept entirely. I just don't see how you're going to get a 
performance gain from much of anything other than strings.

For the lexer I'm working on, I just use string mixins for any operation which 
might differ between range types (e.g. strings can't use front or they'll 
decode, whereas with a range of code units, you need to use front). So, 
instead of p++ like above, I might have something like

mixin(popCodeUnit!R());

But if you're dealing with strings, and you want your function to operate on 
ranges of code units, then you're pretty much stuck either casting the strings 
to arrays of ubyte to operate on them, wrapping them in another range, or 
special casing them. And wrapping them (like your solution would require) is 
arguably the worst, because you lose out on any possibility of any helper 
functions you use optimizing for your strings, unless you write all of them 
yourself, because they won't be recognized as strings - and that includes 
functions like std.utf.decode, which any unicode-aware lexer will have to use 
at least some of the time.

- Jonathan M Davis