Humble benchmark (fisher's exact test)

H. S. Teoh hsteoh at quickfur.ath.cx
Mon Aug 23 18:07:09 UTC 2021


On Mon, Aug 23, 2021 at 05:37:01PM +0000, russhy via Digitalmars-d wrote:
> On Monday, 23 August 2021 at 07:52:01 UTC, Imperatorn wrote:
[...]
> > Interesting. I know people say benchmarks aren't important, but I
> > disagree. I think it's healthy to compare from time to time 👍
> 
> I agree

I wouldn't say benchmarks aren't *important*, I think the issue is how
you interpret the results.  Sometimes people read too much into it,
i.e., over-generalize it far beyond the scope of the actual test,
forgetting that a benchmark is simply just that: a benchmark. It's a
rough estimation of performance in a contrived use case that may or may
not reflect real-world usage.

Many of the simpler benchmarks consist of repeating a task 100000...
times, which can be useful to measure, but hardly represents real-world
usage.  You can optimize the heck out of memcpy (substitute your
favorite function to measure here) until it beats everybody else in the
benchmark, but that does not necessarily mean it will actually make your
real-life program run faster. It may, it may not.  Because after running
the 100th time, all the code/data you're using has gone into cache, so
it will run much faster than the first few iterations.  But in a real
program, you usually don't need to call memcpy (etc) 100000 times in a
row. Usually you need to do something else in between, and that
something else may cause the memcpy-related data to be evicted from
cache, change the state of the branch predictor, etc..  So your
ultra-optimized code may not behave the same way in a real-life program
vs. the benchmark, and may turn out to be actually slower than expected.

Note that this does not mean optimizing for a benchmark is completely
worthless; usually the first few iterations represents actual
bottlenecks in the code, and improving that will improve performance in
a real-life scenario.  But up to a certain point.  Trying to optimize
beyond that, and you risk the danger of optimizing for your specific
benchmark instead of the general use-case, and therefore may end up
pessimizing the code instead of optimizing it.

Real-life example: GNU wc, which counts the number of lines in a text
file.  Some time ago I did a comparison with several different D
implementations of the same program that I wrote, and discovered that
because GNU wc uses glibc's memchr, which was ultra-optimized for
scanning large buffers, if your text files contains long lines then wc
will run faster; but if the text file contains many short lines, a naïve
File.byLine / foreach (ch; line) D implementation will actually beat wc.
The reason is that glibc's memchr implementation is optimized for large
buffers, and has a rather expensive setup overhead for the main
fast-scanning loop.  For long lines, the fast-scan more than outweighs
this overhead, but for short lines, the overhead dominates the running
time, so a naïve char-by-char implementation works faster.

I don't know the history behind glibc's memchr implementation, but I'll
bet that at some point somebody came up with a benchmark that tests
scanning of large buffers and said, look, this complicated bit-hack
fast-scanning loop improves the benchmark by X%!  So the code change was
accepted.  But not realizing that this fast scanning of large buffers
comes at the cost of pessimizing the small buffers case.

Past a certain point, optimization becomes a trade-off, and which route
you go depends on your specific application, not some general benchmarks
that do not represent real-world usage patterns.


T

-- 
Famous last words: I *think* this will work...


More information about the Digitalmars-d mailing list