howto count lines - fast

H. S. Teoh via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Wed May 31 22:50:19 PDT 2017


On Wed, May 31, 2017 at 10:05:50PM -0700, Jonathan M Davis via Digitalmars-d-learn wrote:
> On Thursday, June 01, 2017 04:52:40 Patrick Schluter via Digitalmars-d-learn 
[...]
> > See my link above to realdworldtech. Using SIMD can give good
> > results in micro-benchmarks but completely screw up performance
> > of other things in practice (the alignment requirements are heavy
> > and result in code bloat, cache misses, TLB misses, cost of
> > context switches, AVX warm up time (Agner Fog observed around
> > 10000 cycles before AVX switches from 128 bits to 256 bits
> > operations), reduced turboing, etc.).
> 
> Whenever you attempt more complicated optimizations, it becomes harder
> to get it right, and you always have the problem of figuring out
> whether you really did make it better in general. It's the sort of
> thing that's easier when you have a specific use case and it's very
> difficult to get right when dealing with a general solution for a
> standard library. So, it doesn't surprise me at all if a particular
> optimization turns out to be a bad idea for Phobos even if it's great
> for some use cases.
[...]

It makes me want to just say, write a naïve loop expressing exactly what
you intend to achieve, and let the compiler's optimizer figure out how
to best optimize it for your target architecture.

Unfortunately, just earlier today while testing an incomplete version of
count() that uses the ulong iteration optimization, I discovered to my
horror that ldc (at -O3) apparently recognizes that exact hack and turns
the loop into a massive bowl of SSE/MMX/AVX/etc soup that's many times
the size of the "unoptimized" loop.  After reading the thread Patrick
pointed out from realworldtech, I'm starting to wonder if the result is
actually faster in practice, or if it only looks good in benchmarks,
because that code bloat is going to add instruction cache pressure and
probably TLB misses, etc..  If your program mostly calls count() on
smallish arrays (which I'd say is rather likely in cases that matter,
because I can't imagine someone would want to count bytes in 1MB arrays
inside an inner loop -- in inner loops you'd tend to be working with
smaller chunks of data and thus you'd want count() to be fast for small
to medium-sized arrays), then a small, tight "unoptimized" loop is going
to perform much better, because it would be easily inlineable, won't add
1KB to your function body size, and thus increase the chances your
function body won't overflow the instruction cache. Plus, reducing the
amount of complicated branches and other paraphrenalia will make the
CPU's branch predictor more likely to get it right, so you're less
likely to cause pipeline stalls.

Perhaps the right approach is to check if the array length is less than
some arbitrary threshold, and just use a naïve loop below that, and only
switch to the complicated hackish stuff where you're sure it will
actually benefit, rather than hurt, performance.


T

-- 
Not all rumours are as misleading as this one.


More information about the Digitalmars-d-learn mailing list