TypeFunction example: ImplictConvTargets

H. S. Teoh hsteoh at quickfur.ath.cx
Wed Oct 7 04:17:59 UTC 2020


On Tue, Oct 06, 2020 at 08:47:33PM -0700, Walter Bright via Digitalmars-d wrote:
> On 10/6/2020 11:09 AM, Adam D. Ruppe wrote:
> > The Phobos implementation started life with a very simple
> > implementation too. It became what it is because it *had to*,
> > specifically for performance reasons.
> 
> Professional C Standard library implementations tend to be hideous
> code to perform objectively simple operations. The reason is speed is
> so desirable in foundational code that it drives out all other
> considerations. (memcpy() is a standout example.)

A little tangential aside: one time as a little coffee break challenge,
I decided to see how D compares to GNU wc in terms of the speed of
counting lines in a text file.  The core of wc, of course, is in memchr
-- because it's scanning for newline characters in a buffer.  In the
course of ferreting out what made wc so fast, I studied how GNU libc
implemented memchr.  Basically, in order to speed up scanning large
buffers, it uses a series of fancy bit operations on 64-bit words in
order to scan 8 bytes at a time, I suppose the goal being to achieve
close to 8x speedup, and also to reduce the number of branches per
iteration to capitalize on the CPU's pipeline.

In order to do this, however, some not-so-cheap setup was necessary at
the beginning and end of the buffer, which generally are not 8-byte
aligned.  When a particular call to memchr scanned many bytes, of
course, the speed of scanning 8 bytes at a time outweighed this setup /
teardown cost, and wc generally outperformed my D code.

However, when the lines being scanned were on the short side, the
overhead of setting up the 8-byte-at-a-time scanning added up
significantly -- the shorter lines meant less time was spend scanning 8
bytes at a time, and more time spent in the complicated code dealing
with non-aligned start/end of the buffer. In this case, I discovered
that a simplistic for-loop in D outperformed wc.

This led me to think, realistically speaking, how long do lines in a
text file tend to be?  My wild guess is that they tend to be on the
shortish side -- maybe about 60-70 characters on average? For code, a
lot less, since code generally has more whitespace for readability
reasons.  Files that have very long lines tend to be things like XML or
compressed Javascript, which generally you don't really use wc on
anyway, so in my mind, they seem to be more the exception than the norm
when it comes to the applicability of wc.  Furthermore, I venture to
propose that your average string length in C is probably closer to the
80-120 character ballpark, than to large buffers of 1K or 8K which is
where the 8-byte-at-a-time scanning would perform much better. Sure,
large text buffers do get handled in C code routinely; but how many of
them realistically would you find being handed to strchr to scan for
some given byte?  The kind of C strings you'd want to search for a
particular byte in, IME, are usually the shorter kind.

What this means, is that yes memchr may be "optimized" to next week and
back -- but that optimization came with some implicit assumptions about
how long it takes to find a character in a string buffer. These
assumptions unfortunately pessimized the common case with the overhead
of a fancy 8-byte-at-a-time algorithm: one that does better in the
*less* common case of looking for a particular byte in a very large
buffer.

My point behind all this, is that what's considered optimal sometimes
changes depending on what kind of use case you anticipate; and with
different usage patterns, one man's optimal algorithm may be another
man's suboptimal algorithm, and one man's slow, humble for-loop may be
another man's speed demon.  Optimizing for the general case in a
standard library is generally a very tough problem, because how do you
make any assumptions about the general case?  Whichever way you choose
to optimize a primitive like memchr, you're imposing additional
assumptions that may make it worse for some people, even if you think
that you're improving it for your chosen target metric.



> I remember back in the 80's when Borland came out with Turbo C. The
> compiler didn't generate very good code, but applications built with
> it were reasonably fast. How was this done?
> 
> Borland carefully implemented the C library in hand-optimized
> assembler by some very good programmers. Even printf was coded this
> way. The speedups gained there sped up every Turbo C program. At the
> time, nobody else had done that.
> 
> Much as I hate to admit it, Borland made the smart move there.

So what does this imply in terms of Phobos code in D? ;-)  Should we
uglify Phobos for the sake of performance?  Keeping in mind, of course,
what I said about the difficulty of optimizing for the general case
without pessimizing some legitimate use cases -- or maybe even the
*common* case, if we lose sight of the forest of common use cases while
trying to optimize our chosen benchmark trees.


T

-- 
Life begins when you can spend your spare time programming instead of watching television. -- Cal Keegan


More information about the Digitalmars-d mailing list