Optimization fun

Ary Borenszweig via Digitalmars-d digitalmars-d at puremagic.com
Fri Nov 7 11:06:44 PST 2014


On 11/6/14, 7:58 PM, H. S. Teoh via Digitalmars-d wrote:
> So today, I was playing around with profiling and optimizing my sliding
> block puzzle solver, and found some interesting things:
>
> 1) The GC could use some serious improvement: it just so happens that
> the solver's algorithm only ever needs to allocate memory, never release
> it (it keeps a hash of visited states that only grows, never shrinks).
> The profiler indicated that one of the hotspots is the GC. So I added an
> option to the program to call GC.disable() if a command line option is
> given. The result is a 40%-50% reduction in running time.
>
> 2) Auto-decoding is evil: the solver currently uses a naïve char[]
> representation for the puzzle board, and uses countUntil() extensively
> to locate things on the board. The original code had:
>
> 	auto offset = currentBoard.countUntil(ch);
>
> which, of course, triggers autodecoding (even though there is actually
> no reason to do so: ch is a char, and therefore countUntil could've used
> strchr instead). Simply changing that to:
>
> 	auto offset = currentBoard.representation.countUntil(ch);
>
> gave an additional 20%-30% performance boost.
>
>
> 3) However, countUntil remains at the top of the profiler's list of
> hotspots. DMD's optimizer was rather disappointing here, so I looked at
> gdc's output instead. Digging into the disassembly, I saw that it was
> using a foreach loop over the char[], but for some reason even gdc -O3
> failed to simplify that loop significantly (I'm not sure why -- maybe
> what threw off the optimizer is the array indexing + length check
> operation, where in C one would normally jump bump the pointer
> instead?). Eventually, I tried writing a manual loop:
>
>              auto ptr = current.ptr;
>              auto c = current.length;
>              while (c > 0 && *ptr != m.id)
>                  ptr++;
>              cert.headOffset = ptr - current.ptr;
>
> This pushed gdc far enough in the right direction that it finally
> produced a 3-instruction inner loop. Strangely enough, the assembly had
> no check for (c > 0)... could it be because there is an assert following
> the loop claiming that that will never happen, therefore gdc elided the
> check? I thought Walter hasn't gotten around to implementing
> optimizations based on assert yet??  Anyway, with this optimization, I
> managed to shave off another 3%-5% of the total running time.
>
> On that note, at one point I was wondering why gdc -O3 didn't generate a
> "rep scasb" for the inner loop instead of manually incrementing the loop
> pointer; but a little research revealed that I'm about 6-7 years out of
> date: since about 2008 gcc's optimizer has not used "rep scasb" because
> on modern hardware it has atrocious performance -- much worse than a
> manually-written C loop! So much for "built-in" string processing that
> used to be touted in the old days. Yet another proof of the rule that
> overly-complex CPU instruction sets rarely actually perform better.
>
>
> Anyway, after everything is said and done, a puzzle that used to take
> about 20 seconds to solve now only takes about 6 seconds. W00t!
>
>
> T
>

Hi,

Is the code public? I'd like to port it to other languages and see how 
they behave, see if this is a general problem or just something specific 
to D.

Thanks!


More information about the Digitalmars-d mailing list