Request for comments: std.d.lexer

Brian Schott briancschott at gmail.com
Wed Jan 30 01:49:07 PST 2013


On Monday, 28 January 2013 at 21:03:21 UTC, Timon Gehr wrote:
> Better, but still slow.

I implemented the various suggestions from a past thread and made 
the lexer only work ubyte[] (to aviod phobos converting 
everything to dchar all the time) and gave the tokenizer instance 
a character buffer that it re-uses.

Results:

$ avgtime -q -r 200 ./dscanner --tokenCount 
../phobos/std/datetime.d

------------------------
Total time (ms): 13861.8
Repetitions    : 200
Sample mode    : 69 (90 ocurrences)
Median time    : 69.0745
Avg time       : 69.3088
Std dev.       : 0.670203
Minimum        : 68.613
Maximum        : 72.635
95% conf.int.  : [67.9952, 70.6223]  e = 1.31357
99% conf.int.  : [67.5824, 71.0351]  e = 1.72633
EstimatedAvg95%: [69.2159, 69.4016]  e = 0.0928836
EstimatedAvg99%: [69.1867, 69.4308]  e = 0.12207

If my math is right, that means it's getting 4.9 million 
tokens/second now. According to Valgrind the only way to really 
improve things now is to require that the input to the lexer 
support slicing. (Remember the secret of Tango's XML parser...) 
The bottleneck is now on the calls to .idup to construct the 
token strings from slices of the buffer.

> I guess that at some point
>
> pure nothrow TokenType lookupTokenType(const string input)
>
> might become a bottleneck. (DMD does not generate near-optimal 
> string switches, I think.)

Right now that's a fairly small box on KCachegrind.



More information about the Digitalmars-d mailing list