Request for comments: std.d.lexer
Jonathan M Davis
jmdavisProg at gmx.com
Fri Feb 8 00:01:25 PST 2013
On Tuesday, February 05, 2013 22:51:32 Andrei Alexandrescu wrote:
> I think it would be reasonable for a lexer to require a range of ubyte
> as input, and carry its own decoding. In the first approximation it may
> even require a random-access range of ubyte.
Another big issue is the fact that in some ways, using a pointer like dmd's
lexer does is actually superior to using a range. In particular, it's trivial
to determine where in the text a token is, because you can simply subtract the
pointer in the token from the initial pointer. Strings would be okay too,
because you can subtract their ptr properties. But the closest that you'll get
with ranges is to subtract their lengths, and the only ranges that are likely
to define length are random-access ranges. And to do that, you'd either have to
keep calculating the index for each token as its generated or save the range
with ever token (rather than just having a pointer) so that you could
determine the index later if you needed to. And depending on the range, all of
that saving could be expensive.
And for any other type of range, you'd literally have to count the code units
as you iterated in order to figure out what the index is (and you'd have to
keep saving the range as you went along if you wanted to slice it at all,
since it wouldn't actually be sliceable, and so getting to a particular index
in the range would be very expensive even if you kept track of it). And for
syntax highlighting and some error reporting and a variety of other uses, you
need to be able to determine where in the text a token was (not just its
column and line number). And that's simply a lot easier with a pointer.
So, dealing with generic ranges is a bit problematic - far more so than any
issues with character types. If the lexer is well-written, the extra overhead
had be obviated by having the lexer function do stuff a bit differently
depending on the type of the range, but regardless, restricting it to strings
or pointers would be cleaner in that regard. It's not quite a use case where
ranges shine - especially when efficiency is a top priority.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list