Request for comments: std.d.lexer

Jonathan M Davis jmdavisProg at gmx.com
Fri Feb 8 00:01:25 PST 2013


On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:
> I think it would be reasonable for a lexer to require a range of ubyte
> as input, and carry its own decoding. In the first approximation it may
> even require a random-access range of ubyte.

Another big issue is the fact that in some ways, using a pointer like dmd's 
lexer does is actually superior to using a range. In particular, it's trivial 
to determine where in the text a token is, because you can simply subtract the 
pointer in the token from the initial pointer. Strings would be okay too, 
because you can subtract their ptr properties. But the closest that you'll get 
with ranges is to subtract their lengths, and the only ranges that are likely 
to define length are random-access ranges. And to do that, you'd either have to 
keep calculating the index for each token as its generated or save the range 
with ever token (rather than just having a pointer) so that you could 
determine the index later if you needed to. And depending on the range, all of 
that saving could be expensive.

And for any other type of range, you'd literally have to count the code units 
as you iterated in order to figure out what the index is (and you'd have to 
keep saving the range as you went along if you wanted to slice it at all, 
since it wouldn't actually be sliceable, and so getting to a particular index 
in the range would be very expensive even if you kept track of it). And for 
syntax highlighting and some error reporting and a variety of other uses, you 
need to be able to determine where in the text a token was (not just its 
column and line number). And that's simply a lot easier with a pointer.

So, dealing with generic ranges is a bit problematic - far more so than any 
issues with character types. If the lexer is well-written, the extra overhead 
had be obviated by having the lexer function do stuff a bit differently 
depending on the type of the range, but regardless, restricting it to strings 
or pointers would be cleaner in that regard. It's not quite a use case where 
ranges shine - especially when efficiency is a top priority.

- Jonathan M Davis


More information about the Digitalmars-d mailing list