Andrei Alexandrescu needs to read this

Thu Oct 24 00:53:27 UTC 2019

On Wed, Oct 23, 2019 at 11:20:07PM +0000, welkam via Digitalmars-d wrote:
[...]
> When performance change is in order of magnitude then you can safely
> assume it was because of code change you made but when the difference
> is less than 10% it becomes unclear what actually is responsible for
> that difference. If you had read the paper you would find out that
> gcc's -O3 changes performance over -O2 from -8% to +12% on the same
> application.
> 
> Simple measurement is not sufficient to conclude that your change in
> algorithm is what is responsible for measured performance increase.

Yeah, I tend to get skeptical when people start micro-optimizing for
small performance increases. When it's a 30%-40% or higher increase,
then I'd say it's reasonably safe to conclude that the algorithm change
was responsible. But if it's 2% or 5% then it's harder to be confident
you aren't being misled by other factors.  Also, I tend to ignore
differences of less than 1s or 0.5s in benchmarks, because it's hard to
tell whether the 0.002s increase is caused by the code, or by other
factors. When people start optimizing over sub-second performance
increases I start getting suspicious.

As a general principle I'd say that if a set of benchmarks are being
compared, they need to be run in the *exact* same environment and then
compared. If a set of measurements were made 5 days ago and we compare
them with measurements made today, there's no accounting for what subtle
system differences may have crept in in the meantime, that may throw off
the results.

But this paper reveals even more interesting points, though, about
performance differences across systems or CPU models. For this, I'd
propose that any benchmarks we're basing algorithm decisions on ought to
be verified on at least two (preferably more) divergent systems. E.g.,
if running benchmark B on a Debian Linux server leads to the conclusion
that algorithm A1 is better than A2, then I'd say we should check
whether running B on a Windows machine leads to the same conclusion. Or
if the code change is Linux-specific, I'd say test it also on a
Slackware desktop system to see if we get different conclusions from a
Debian server system. The more variety incorporated into the sample set
the better.

However, I felt the point the paper makes about address randomization
should actually be a *beneficial* point: rather than turn it off, run
the exact same benchmark, say, 500 times with ASLR *enabled*, which
should balance out any biases by incorporating into the results both
beneficial and detrimental address alignments that may have resulted
from ASLR.  If you make conclusions based on benchmarks taken with ASLR
disabled, then you run the risk that the algorithm performs better
without ASLR but worse in typical user environments which *would* have
ASLR enabled.

Another problem with benchmark-based optimizations, that the paper
didn't mention, is that you run into the risk of optimizing for the
benchmark at the expense of actual, real-world software. Typical
software, for example, doesn't run memcpy 50 million times in a tight
loop; typically memcpy calls are sprinkled among a whole bunch of other
stuff. If you micro-optimize memcpy to beat the benchmark, you could be
misled by CPU cache / branch predictor effects, which would *not*
trigger in actual application software because the usage pattern is
completely different. You could potentially be making memcpy run
*slower* in Excel even though it runs faster in a benchmark, for
example.

Anecdote.  Once I wrote several D analogues of the Unix wc utility and
ran a comparison with my Debian distro's stock wc utility (which is the
GNU version). It basically amounted to calling memchr on newline
characters, for GNU wc. In the D versions I used various different
algorithms, including using std.mmap, std.parallelism, reading the file
the traditional way by blocks, etc..  Then I ran the various versions of
wc on various sets of data with different characteristics.

I discovered something very interesting: GNU wc was generally on par
with, or outperformed the D versions of the code for files that
contained long lines, but performed more poorly when given files that
contained short lines.

Glancing at the glibc source code revealed why: glibc's memchr used an
elaborate bit hack based algorithm that scanned the target string 8
bytes at a time. This required the data to be aligned, however, so when
the string was not aligned, it had to manually process up to 7 bytes at
either end of the string with a different algorithm.  So when the lines
were long, the overall performance was dominated by the 8-byte at a time
scanning code, which was very fast for large buffers.  However, when
given a large number of short strings, the overhead of setting up for
the 8-byte scan became more costly than the savings, so it performed
more poorly than a naïve byte-by-byte scan.

I confirmed this conclusion by artificially constructing an input file
with extremely long lines, and another input file with extremely short
lines.  Then I compared GNU wc's performance with a naïve D function
that basically did a byte-by-byte scan.  As expected, wc lost to the
naïve scanner on the input file with extremely short lines, but won by a
large margin on the file with extremely long lines.  Subsequently I was
able to duplicate this result in D by writing the same 8-byte at a time
scanner in D.

Taking a step back, I realized that this was a classic case of
optimizing for the benchmark: it seemed likely that glibc's memchr was
optimized for scanning large buffers, given the kind of algorithm used
to implement it. Since benchmarks tend to be written for large test
cases, my suspicion was that this algorithm was favored because the
author(s) were testing the code on large buffers. But this optimization
came at the expense of performance in the small buffer case.  Which
means the actual benefit of this optimization depended on what your
application uses memchr for.  If you regularly scan large buffers, then
glibc's implementation will give you better performance.  However, if
your application deals with a lot of small strings, you might be better
off writing your own naïve byte-by-byte scanning code, because it will
actually outperform glibc.

My gut feeling is that a lot of software actually frequently deals with
short strings: think about your typical Facebook post or text message,
or database of customer names. Your customers aren't going to have names
that are several KB long, so if the software uses glibc's memchr on
names, then it's performing rather poorly. A live chat system sends
short messages at a time, and a lot of network protocols also center
around short(ish) messages.  Large data like video or music generally
don't use memchr() because they aren't textual. And even then, you tend
to process them in blocks typically 4KB each or so.  So it's
questionable whether glibc's choice of memchr implementation is really
optimal in the sense of benefitting the most common applications, rather
than just excelling at an artificial benchmark that doesn't accurately
represent typical real-world usage.

Another corollary from all this, is that sometimes, there is no unique
optimization that will work best for everybody.  There is no "optimal"
code that's detached from its surrounding context; you optimize for one
use case at the detriment of another.  And unless you have specific use
cases in mind, it's hard, or even impossible, to make the "right"
decisions -- and this is especially the case for standard libraries that
must be as generic as possible. The more generic you get, the harder it
is to choose the best algorithm. At a certain point it becomes outright
impossible because "best" becomes ill-defined (best for who?).

And this comes back to my point that I get suspicious when people start
trying to squeeze that last 5-10% performance out of their code. Beware,
because you could be optimizing for your benchmark rather than the
user's actual environment.

T

-- 
Старый друг лучше новых двух.