Request for comments: std.d.lexer

Mon Jan 28 13:03:21 PST 2013

On 01/28/2013 01:53 AM, Brian Schott wrote:
> ...
>
> On the topic of performance, I realized that the numbers posted
> previously were actually for a debug build. Fail.
>
> For whatever reason, the current version of the lexer code isn't
> triggering my heisenbug[1] and I was able to build with -release -inline
> -O.
>
> Here's what avgtime has to say:
>
> $ avgtime -q -h -r 200 dscanner --tokenCount ../phobos/std/datetime.d
>
> ------------------------
> Total time (ms): 51409.8
> Repetitions    : 200
> Sample mode    : 250 (169 ocurrences)
> Median time    : 255.57
> Avg time       : 257.049
> Std dev.       : 4.39338
> Minimum        : 252.931
> Maximum        : 278.658
> 95% conf.int.  : [248.438, 265.66]  e = 8.61087
> 99% conf.int.  : [245.733, 268.366]  e = 11.3166
> EstimatedAvg95%: [256.44, 257.658]  e = 0.608881
> EstimatedAvg99%: [256.249, 257.849]  e = 0.800205
> Histogram      :
>      msecs: count  normalized bar
>        250:   169  ########################################
>        260:    22  #####
>        270:     9  ##
>
> Which works out to 1,327,784 tokens per second on my Ivy Bridge i7.
>

Better, but still slow.

> I created a small program that demangles the output of valgrind so that
> tools like KCachegrind can display profiling information more clearly.
> It's now on the wiki[2]
>
> The bottleneck in std.d.lexer as it stands is the appender instances
> that assemble Token.value during iteration and front() on the array of
> char[]. (As I'm sure everyone expected)
>

I see, probably there should be an option to do this by slicing instead.
Also try to treat narrow strings in such a way that they do not incur 
undue decoding overhead.

I guess that at some point

pure nothrow TokenType lookupTokenType(const string input)

might become a bottleneck. (DMD does not generate near-optimal string 
switches, I think.)