howto count lines - fast

Wed May 31 21:39:17 PDT 2017

On Wednesday, May 31, 2017 16:03:54 H. S. Teoh via Digitalmars-d-learn 
wrote:
> On Wed, May 31, 2017 at 03:46:17PM -0700, Jonathan M Davis via 
Digitalmars-d-learn wrote:
> > On Wednesday, May 31, 2017 12:13:04 H. S. Teoh via Digitalmars-d-learn
> >
> > wrote:
> > > I did some digging around, and it seems that wc is using glibc's
> > > memchr, which is highly-optimized, whereas std.algorithm.count just
> > > uses a simplistic loop. Which is strange, because I'm pretty sure
> > > somebody optimized std.algorithm some time ago to use memchr()
> > > instead of a loop when searching for a byte value in an array.
> > > Whatever happened to that??
> >
> > I don't know, but memchr wouldn't work with CTFE, so someone might
> > have removed it to make it work in CTFE (though that could be done
> > with a different branch for CTFE). Or maybe it never made it into
> > std.algorithm for one reason or another.
>
> [...]
>
> I checked the Phobos code again, and it appears that my memory deceived
> me. Somebody *did* add memchr optimization to find() and its friends,
> but not to count().
>
> CTFE compatibility is not a problem, since we can just if(__ctfe) the
> optimized block away.
>
> I'm currently experimenting with a memchr-optimized version of count(),
> but I'm getting mixed results: on small arrays or large arrays densely
> packed with matching elements, the memchr version runs rather slowly,
> because it involves a function call into the C library per matching
> element.  On large arrays only sparsely populated with matching
> elements, though, the memchr-optimized version beats the current code by
> about an order of magnitude.
>
> Since it wouldn't be a wise idea to assume sparsity of matches in
> Phobos, I decided to do a little more digging, and looked up the glibc
> implementation of memchr. The main optimization is that it iterates over
> the array not by byte, as a naïve loop would do, but by ulong's. (Of
> course, the first n bytes and last n bytes that are not ulong-aligned
> are checked with a per-byte loop; so for very short arrays it doesn't
> lose out to the naïve loop.)  In each iteration over ulong, it performs
> the bit-twiddling hack alluded to by Nitram to detect the presence of
> matching bytes, upon which it breaks out to the closing per-byte loop to
> find the first match. For short arrays, or arrays where a match is
> quickly found, it's comparable in performance to the naïve loop; for
> large arrays where the match is not found until later, it easily
> outperforms the naïve loop.
>
> My current thought is to adopt the same approach: iterate over size_t or
> some such larger unit, and adapt the bit-twiddling hack to be able to
> count the number of matches in each size_t.  This is turning out to be
> trickier than I'd like, though, because there is a case where carry
> propagation makes it unclear how to derive the number of matches without
> iterating over the bytes a second time.
>
> But this may not be a big problem, since size_t.sizeof is relatively
> small, so I can probably loop over individual bytes when one or more
> matches is detected, and a sufficiently-capable optimizer like ldc or
> gdc would be able to unroll this into a series of sete + add
> instructions, no branches that might stall the CPU pipeline. For
> densely-matching arrays, this should still have comparable performance
> to the naïve loops; for sparsely-matching arrays this should show
> significant speedups.

If you're really trying to make it fast, there may be something that you can
do with SIMD. IIRC, Brian Schott did that with his lexer (or maybe he was
just talking about it - I don't remember for sure).

- Jonathan M Davis