Lexers (was: std.d.lexer : voting thread)

Mon Oct 7 09:35:34 PDT 2013

On 10/06/13 10:57, Jacob Carlborg wrote:
> On 2013-10-05 19:52, Artur Skawina wrote:
> 
>> The assumption, that a hand-written lexer will be much faster than a generated
>> one, is wrong.
> 
> I never said that the generated one would be slow. I only said that the hand written would be fast :)

I know, but you said that having both is an option -- that would not
make sense unless there's a significant advantage.
A lexer is really a rather trivial piece of software, there's not much
room for improvement over the obvious "fetch-a-character, use-it-to-
determine-a-new-state, repeat-until-done, return the found state
( == matched token)" approach. So the core of an efficient hand-written
lexer will not be very different from this:
http://repo.or.cz/w/girtod.git/blob/refs/heads/lexer:/mainloop.d
That is already ~2kLOC and it's *just* the top-level loop; it does not
include handling of nontrivial tokens (matches just keywords, punctuators
and identifiers). Could a handwritten lexer be faster? Not by much, and
any trick that would help the manually-written one could also be used
by the generator. In fact, working on the generator is much easier than
dealing with this kind of fragile hand-tuned mess. Imagine changing the
lexical grammar a bit, or introducing a new kind of literal. With a
more declarative solution this only involves a local change spanning
a few lines and is relatively risk-free. Updating a handwritten lexer
would involve many more changes, often in several different areas, and
lots of opportunities for making mistakes.

> Would it be able to lex Scala and Ruby? Method names in Scala can contain many symbols that is not usually allowed in other languages. You can have a method named "==". In Ruby method names are allowed to end with "=", "?" or "!".

Yes, D makes it easy, you can for example simply define a function
that determines what is and what isn't an identifier and pass that as
an alias or mixin parameter. "Lexing" binary formats would be
possible too :^).

A complete D lexer can look as simple as this:
http://repo.or.cz/w/girtod.git/blob/refs/heads/lexer:/dlanglexer.d
which should also give you a good idea of how easy supporting
other languages would be. (The "actions" are defined in separate
modules, so that the grammars can be reused everywhere).
There's a D PEG lexical grammar in there too, btw.

I forgot to change the subject previously, sorry; was not trying
to attempt or influence the voting. I'm just saying that Andrei's
approach goes into the right direction (even if i disagree with
the details). And IMHO the time before a useful std-lib-worthy
lexer infrastructure materializes is measured in months, if not years.
So if I was voting I'd probably say "yes" - because waiting for a
better, but non-existent alternative is not going to help anybody.
The hard part of the required work isn't coding - it's the design.
If a better solution appears later, it should be able to /replace/
the hand-written one. And in the mean time, the experience from using
the less-generic lexer can only help any "new" design.

artur