String compare performance

Sun Nov 28 03:00:49 PST 2010

Austin Hastings:

> It's not clear to me if the point of your post is "how do I make this go 
> faster?" or "Waa! D underperforms Python by default".

The third D version in my post is fast enough, so the point of that post is that a very common string compare operation, done in the natural way, is a bit too much slow in D/DMD.

> If it is the former, a state machine is probably the right answer for 
> long input strings.

This case is very simple, so a simple tree (as in the third D version of my post) is probably good enough.

In some similar but more complex situations of genomic processing I've seen that in C it's good to have computed gotos, to implement faster state machines. See my recent post "explicitly unimplemented computed gotos" :-) This was my use case beside the desire to create interpreters in D (it's a similar thing).

> Since the probability of failure is about .75, 0.5, {.5 or .75}, the 
> overall comparison cost is about 1 + .25 + .125 = 1.375, times 
> compare+loop times n, plus all the function call overhead, etc.

All 3 codons start with T, so if you test that first letter before the 3 string compare you speed up the original code a lot. But I was looking for a more general solution. Implementing a state machine is good, but in this case I'd like the compiler to do more optimizations by itself. A simple but good enough optimization for the compiler is to replace tests with strings (known to be short at compile-time) with inlined comparisons of their single chars.

> Using a DFA should reduce the cost to 1 * index, times n, assuming you 
> are using a small alphabet (I'm guessing that your codons are 
> ascii+uppercase only, table-size = 26).

In this case the alphabet was just 4 symbols: {'A', 'C', 'G', 'T'} (with no undecided or modified nucleosides).

Bye,
bearophile