Andrei Alexandrescu needs to read this
Jon Degenhardt
jond at noreply.com
Thu Oct 24 19:25:29 UTC 2019
On Thursday, 24 October 2019 at 00:53:27 UTC, H. S. Teoh wrote:
> I discovered something very interesting: GNU wc was generally
> on par with, or outperformed the D versions of the code for
> files that contained long lines, but performed more poorly when
> given files that contained short lines.
>
> Glancing at the glibc source code revealed why: glibc's memchr
> used an elaborate bit hack based algorithm that scanned the
> target string 8 bytes at a time. This required the data to be
> aligned, however, so when the string was not aligned, it had to
> manually process up to 7 bytes at either end of the string with
> a different algorithm. So when the lines were long, the
> overall performance was dominated by the 8-byte at a time
> scanning code, which was very fast for large buffers. However,
> when given a large number of short strings, the overhead of
> setting up for the 8-byte scan became more costly than the
> savings, so it performed more poorly than a naïve byte-by-byte
> scan.
Interesting observation. On the surface it seems this might also
apply to splitter and find when used on narrow strings. I believe
these call memchr on narrow strings. A common paradigm is to read
lines, then call splitter to identify individual fields. Fields
are often short, even when lines are long.
--Jon
More information about the Digitalmars-d
mailing list