howto count lines - fast

Wed May 31 16:03:54 PDT 2017

On Wed, May 31, 2017 at 03:46:17PM -0700, Jonathan M Davis via Digitalmars-d-learn wrote:
> On Wednesday, May 31, 2017 12:13:04 H. S. Teoh via Digitalmars-d-learn 
> wrote:
> > I did some digging around, and it seems that wc is using glibc's
> > memchr, which is highly-optimized, whereas std.algorithm.count just
> > uses a simplistic loop. Which is strange, because I'm pretty sure
> > somebody optimized std.algorithm some time ago to use memchr()
> > instead of a loop when searching for a byte value in an array.
> > Whatever happened to that??
> 
> I don't know, but memchr wouldn't work with CTFE, so someone might
> have removed it to make it work in CTFE (though that could be done
> with a different branch for CTFE). Or maybe it never made it into
> std.algorithm for one reason or another.
[...]

I checked the Phobos code again, and it appears that my memory deceived
me. Somebody *did* add memchr optimization to find() and its friends,
but not to count().

CTFE compatibility is not a problem, since we can just if(__ctfe) the
optimized block away.

I'm currently experimenting with a memchr-optimized version of count(),
but I'm getting mixed results: on small arrays or large arrays densely
packed with matching elements, the memchr version runs rather slowly,
because it involves a function call into the C library per matching
element.  On large arrays only sparsely populated with matching
elements, though, the memchr-optimized version beats the current code by
about an order of magnitude.

Since it wouldn't be a wise idea to assume sparsity of matches in
Phobos, I decided to do a little more digging, and looked up the glibc
implementation of memchr. The main optimization is that it iterates over
the array not by byte, as a naïve loop would do, but by ulong's. (Of
course, the first n bytes and last n bytes that are not ulong-aligned
are checked with a per-byte loop; so for very short arrays it doesn't
lose out to the naïve loop.)  In each iteration over ulong, it performs
the bit-twiddling hack alluded to by Nitram to detect the presence of
matching bytes, upon which it breaks out to the closing per-byte loop to
find the first match. For short arrays, or arrays where a match is
quickly found, it's comparable in performance to the naïve loop; for
large arrays where the match is not found until later, it easily
outperforms the naïve loop.

My current thought is to adopt the same approach: iterate over size_t or
some such larger unit, and adapt the bit-twiddling hack to be able to
count the number of matches in each size_t.  This is turning out to be
trickier than I'd like, though, because there is a case where carry
propagation makes it unclear how to derive the number of matches without
iterating over the bytes a second time.

But this may not be a big problem, since size_t.sizeof is relatively
small, so I can probably loop over individual bytes when one or more
matches is detected, and a sufficiently-capable optimizer like ldc or
gdc would be able to unroll this into a series of sete + add
instructions, no branches that might stall the CPU pipeline. For
densely-matching arrays, this should still have comparable performance
to the naïve loops; for sparsely-matching arrays this should show
significant speedups.

T

-- 
My program has no bugs! Only unintentional features...