Optimization fun

Thu Nov 6 14:58:16 PST 2014

So today, I was playing around with profiling and optimizing my sliding
block puzzle solver, and found some interesting things:

1) The GC could use some serious improvement: it just so happens that
the solver's algorithm only ever needs to allocate memory, never release
it (it keeps a hash of visited states that only grows, never shrinks).
The profiler indicated that one of the hotspots is the GC. So I added an
option to the program to call GC.disable() if a command line option is
given. The result is a 40%-50% reduction in running time.

2) Auto-decoding is evil: the solver currently uses a naïve char[]
representation for the puzzle board, and uses countUntil() extensively
to locate things on the board. The original code had:

	auto offset = currentBoard.countUntil(ch);

which, of course, triggers autodecoding (even though there is actually
no reason to do so: ch is a char, and therefore countUntil could've used
strchr instead). Simply changing that to:

	auto offset = currentBoard.representation.countUntil(ch);

gave an additional 20%-30% performance boost.

3) However, countUntil remains at the top of the profiler's list of
hotspots. DMD's optimizer was rather disappointing here, so I looked at
gdc's output instead. Digging into the disassembly, I saw that it was
using a foreach loop over the char[], but for some reason even gdc -O3
failed to simplify that loop significantly (I'm not sure why -- maybe
what threw off the optimizer is the array indexing + length check
operation, where in C one would normally jump bump the pointer
instead?). Eventually, I tried writing a manual loop:

            auto ptr = current.ptr;
            auto c = current.length;
            while (c > 0 && *ptr != m.id)
                ptr++;
            cert.headOffset = ptr - current.ptr;

This pushed gdc far enough in the right direction that it finally
produced a 3-instruction inner loop. Strangely enough, the assembly had
no check for (c > 0)... could it be because there is an assert following
the loop claiming that that will never happen, therefore gdc elided the
check? I thought Walter hasn't gotten around to implementing
optimizations based on assert yet??  Anyway, with this optimization, I
managed to shave off another 3%-5% of the total running time.

On that note, at one point I was wondering why gdc -O3 didn't generate a
"rep scasb" for the inner loop instead of manually incrementing the loop
pointer; but a little research revealed that I'm about 6-7 years out of
date: since about 2008 gcc's optimizer has not used "rep scasb" because
on modern hardware it has atrocious performance -- much worse than a
manually-written C loop! So much for "built-in" string processing that
used to be touted in the old days. Yet another proof of the rule that
overly-complex CPU instruction sets rarely actually perform better.

Anyway, after everything is said and done, a puzzle that used to take
about 20 seconds to solve now only takes about 6 seconds. W00t!

T

-- 
Which is worse: ignorance or apathy? Who knows? Who cares? -- Erich Schubert