howto count lines - fast

Wed May 31 21:37:59 PDT 2017

On Wednesday, 31 May 2017 at 23:03:54 UTC, H. S. Teoh wrote:
> On Wed, May 31, 2017 at 03:46:17PM -0700, Jonathan M Davis via 
> Digitalmars-d-learn wrote:
>> On Wednesday, May 31, 2017 12:13:04 H. S. Teoh via 
>> Digitalmars-d-learn wrote:
>> > I did some digging around, and it seems that wc is using 
>> > glibc's memchr, which is highly-optimized, whereas 
>> > std.algorithm.count just uses a simplistic loop. Which is 
>> > strange, because I'm pretty sure somebody optimized 
>> > std.algorithm some time ago to use memchr() instead of a 
>> > loop when searching for a byte value in an array. Whatever 
>> > happened to that??
>> 
>> I don't know, but memchr wouldn't work with CTFE, so someone 
>> might have removed it to make it work in CTFE (though that 
>> could be done with a different branch for CTFE). Or maybe it 
>> never made it into std.algorithm for one reason or another.
> [...]
>
> I checked the Phobos code again, and it appears that my memory 
> deceived
> me. Somebody *did* add memchr optimization to find() and its 
> friends,
> but not to count().
>
> CTFE compatibility is not a problem, since we can just 
> if(__ctfe) the
> optimized block away.
>
> I'm currently experimenting with a memchr-optimized version of 
> count(), but I'm getting mixed results: on small arrays or 
> large arrays densely packed with matching elements, the memchr 
> version runs rather slowly, because it involves a function call 
> into the C library per matching element.  On large arrays only 
> sparsely populated with matching elements, though, the 
> memchr-optimized version beats the current code by about an 
> order of magnitude.
>
> Since it wouldn't be a wise idea to assume sparsity of matches 
> in Phobos, I decided to do a little more digging, and looked up 
> the glibc implementation of memchr. The main optimization is 
> that it iterates over the array not by byte, as a naïve loop 
> would do, but by ulong's.

That's what I suggested above. It's the first optimisation to do 
when looping over a buffer (memcpy, memset, memchr etc.).

  (Of course, the first n bytes and
> last n bytes that are not ulong-aligned are checked with a 
> per-byte loop; so for very short arrays it doesn't lose out to 
> the naïve loop.)  In each iteration over ulong, it performs the 
> bit-twiddling hack alluded to by Nitram to detect the presence 
> of matching bytes, upon which it breaks out to the closing 
> per-byte loop to find the first match. For short arrays, or 
> arrays where a match is quickly found, it's comparable in 
> performance to the naïve loop; for large arrays where the match 
> is not found until later, it easily outperforms the naïve loop.

It is also important to not overdo the optimisations as it can 
happen that the overhead generated manifests in pessimations not 
visible in a specific benchmark. The code size explosion may 
induce I-cache misses, it can also cost I-TLB misses. Worse, 
using SSE or AVX can kill thread switch time or worse even reduce 
the turboing of the CPU.
It's currently a hot topic on realworldtech[1]. Linus Torvalds 
rants about this issue wit memcpy() which is over-engineered and 
does more harm than good in practice but has nice benchmark 
result.

>
> My current thought is to adopt the same approach: iterate over 
> size_t or some such larger unit, and adapt the bit-twiddling 
> hack to be able to count the number of matches in each size_t.  
> This is turning out to be trickier than I'd like, though, 
> because there is a case where carry propagation makes it 
> unclear how to derive the number of matches without iterating 
> over the bytes a second time.
>
> But this may not be a big problem, since size_t.sizeof is 
> relatively small, so I can probably loop over individual bytes 
> when one or more matches is detected, and a 
> sufficiently-capable optimizer like ldc or gdc would be able to 
> unroll this into a series of sete + add instructions, no 
> branches that might stall the CPU pipeline. For 
> densely-matching arrays, this should still have comparable 
> performance to the naïve loops; for sparsely-matching arrays 
> this should show significant speedups.
>
That's what I think too, that a small and simple loop to count 
the matching bytes in the ulong would be a somehow faster than 
the bit twiddling trick which requires a population count of bits.

[1]: 
http://www.realworldtech.com/forum/?threadid=168200&curpostid=168700