Request for comments: std.d.lexer

Wed Jan 30 08:50:56 PST 2013

30-Jan-2013 13:49, Brian Schott пишет:
> On Monday, 28 January 2013 at 21:03:21 UTC, Timon Gehr wrote:
>> Better, but still slow.
>
> I implemented the various suggestions from a past thread and made the
> lexer only work ubyte[] (to aviod phobos converting everything to dchar
> all the time) and gave the tokenizer instance a character buffer that it
> re-uses.
>
> Results:
>
> $ avgtime -q -r 200 ./dscanner --tokenCount ../phobos/std/datetime.d
>
> ------------------------
> Total time (ms): 13861.8
> Repetitions    : 200
> Sample mode    : 69 (90 ocurrences)
> Median time    : 69.0745
> Avg time       : 69.3088
> Std dev.       : 0.670203
> Minimum        : 68.613
> Maximum        : 72.635
> 95% conf.int.  : [67.9952, 70.6223]  e = 1.31357
> 99% conf.int.  : [67.5824, 71.0351]  e = 1.72633
> EstimatedAvg95%: [69.2159, 69.4016]  e = 0.0928836
> EstimatedAvg99%: [69.1867, 69.4308]  e = 0.12207
>
> If my math is right, that means it's getting 4.9 million tokens/second
> now. According to Valgrind the only way to really improve things now is
> to require that the input to the lexer support slicing. (Remember the
> secret of Tango's XML parser...) The bottleneck is now on the calls to
> .idup to construct the token strings from slices of the buffer.
>

idup --> allocation

Instead I suggest to try and allocate a big block of fixed size (say 
about 16-64K) upfront and copy identifiers one by one there. When it 
fills just allocate another one and move on.

If identifier is exceptionally long then you can just idup it as before.

This should bring down the number of calls to GC significantly.

>> I guess that at some point
>>
>> pure nothrow TokenType lookupTokenType(const string input)
>>
>> might become a bottleneck. (DMD does not generate near-optimal string
>> switches, I think.)
>
> Right now that's a fairly small box on KCachegrind.
>

-- 
Dmitry Olshansky