std.d.lexer : voting thread

Sat Oct 5 10:52:29 PDT 2013

On 10/05/13 13:45, Jacob Carlborg wrote:
> I think we can have both. A hand written lexer, specifically targeted for D that is very fast. Then a more general lexer that can be used for many languages.

The assumption, that a hand-written lexer will be much faster than a generated
one, is wrong.
If there's any significant perf difference then it's just a matter of improving
the generator. An automatically generated lexer will be much more flexible (the
source spec can be reused without a single modification for anything from an
intelligent LOC-like counter or a syntax highlighter to a compiler), easier to
maintain/review and less buggy.

Compare the perf numbers previously posted here for the various lexers with:

$ time ./tokenstats stats std/datetime.d  
Lexed 1589336 bytes, found 461315 tokens, 13770 keywords, 65946 identifiers.
Comments:  Line: 958 @ ~40.16  Block: 1 @ ~16  Nesting: 534 @ ~441.7 [count @ avg_len]
0m0.010s user   0m0.001s system   0m0.011s elapsed   99.61% CPU
$ time ./tokenstats dump-no-io std/datetime.d  
0m0.013s user   0m0.001s system   0m0.014s elapsed   99.78% CPU

'tokenstats' is built from PEG-like spec plus a bit CT magic. The generator
supports inline rules written in D too, but the only ones actually written in D 
are for defining what an identifier is, matching EOLs and handling DelimitedStrings.
Initially, performance was not a consideration at all and there's some very low
hanging fruit in there; there's still room for improvement.
Unfortunately, the language and compiler situation has prevented me from doing
any work on this for the last half year or so. The code won't work with any
current compiler and needs a lot of cleanups (which I have been planning to do
/after/ updating the tooling, which seems very unlikely to be possible now), hence
it's not in a releasable state. [1]

artur

[1] If anyone wants to play with it, use as a reference etc and isn't
    afraid of running a binary, a linux x86 one can be gotten from
    http://d-h.st/xtX
    The only really useful functionality is 'tokenstats dump file.d',
    which will dump all found tokens with line and columns numbers.
    It's just a tool i've been using for identifying regressions and benching.