Fast linear search in D array (slice)

Wed Nov 7 21:34:09 UTC 2018

On Wed, Nov 07, 2018 at 09:06:14PM +0000, Per Nordlöw via Digitalmars-d wrote:
[...]
> - Is this the preferred way of implementing a linear search in D?
> - Andrei has mentioned that sentinel-based search can be faster if the
> haystack is allowed to be temporarily mutated at the end. But I
> presume such a solution is only faster above a certain length for the
> haystack. Has anybody benchmarked and found a suitable length limit
> for this?

I would think the suitable length limit would be highly CPU-dependent
and memory-configuration-dependent.  There is no single magic number
that's going to work everywhere.  We can sample a wide variety of
hardware configurations and arbitrarily choose a middle-ground figure, I
suppose, but it's not going to be optimal for everybody, and it will
probably become quickly outdated with every new release of hardware.

> - As a special-case of the above there's also the possibility of
> mutating the last character to be a null and then reusing `strchr`.

That's only because strchr is in the C library, and the assumption is
that whoever implemented strchr for that platform has already done his
homework to find the best-performing implementation for that platform.
If you really wanted to, you could do the same in D and figure out the
"best" implementation that way.

> - When is it preferred to search in blocks of 2, 4, 8 or 16 bytes? I
> presume for a certain alignment and, again, length limit.

I don't have the luxury of being able to compare performance on a wide
variety of platforms, but I did peruse the Glibc source code for memchr
once, and it basically used a clever bit-mask hack to scan for byte
patterns 64 bits at a time, with extra code at the start/end to account
for non-aligned boundaries of the array.  I ran some tests against it,
and found that while it was definitely faster when the byte being
searched occurred deep into the array, it was actually slower when the
byte occurred near the beginning, because of the complexity of setting
up the bit-mask hack and the starting/ending boundary code.  When the
byte being searched for occurs very frequently in a series of repeated
calls to memchr, the glibc memchr actually performed worse than a naïve
for-loop that compared byte by byte.

The bottom line is that even supposedly optimized-to-hell-and-back code
like glibc's memchr may have poor performance characteristics for
certain use cases.  As with all things, it's always a matter of
compromise, and the best solution for you will be heavily dependent on
exactly what you're going to be using the code for.

> - Would it be relevant to make such a function a compiler intrinsic
> that handles all these cases for the generic non-decoding case where
> haystack is a `T[]` and needle a `T`?

I thought somebody had already put in some specializations in
std.algorithm.find to actually call C's memchr when the haystack is an
array and the needle is a 1-byte value?  I'm not sure a compiler
intrinsic is necessary for something like this... sounds like something
that the standard library (i.e., Phobos) should be responsible for.
Have you tried profiling std.algorithm.find to see how it performs on
your use case?

And if indeed std.algorithm.find is suboptimal, we could consider
whether your use case is general enough to be worth adding another
specialization to handle it, and that way it would benefit everybody,
instead of everyone rolling their own probably-suboptimal version.

(One thing to watch out for with std.algorithm.find is that it may
trigger the dreaded autodecoding machinery. You may want to cast to
ubyte[] just to eliminate that possibility. Or use .byCodeUnit, but I'm
unsure if .find is able to recognize that as an array, so the optimized
path may not be chosen.)

T

-- 
Let's call it an accidental feature. -- Larry Wall