in not working for arrays is silly, change my view

Steven Schveighoffer schveiguy at gmail.com
Mon Mar 2 23:27:22 UTC 2020


On 3/2/20 5:21 PM, aliak wrote:
> On Monday, 2 March 2020 at 21:33:37 UTC, Steven Schveighoffer wrote:
>> On 3/2/20 3:52 PM, aliak wrote:
>>> On Monday, 2 March 2020 at 15:47:26 UTC, Steven Schveighoffer wrote:
>>>> On 3/2/20 6:52 AM, Andrea Fontana wrote:
>>>>> On Saturday, 29 February 2020 at 20:11:24 UTC, Steven Schveighoffer 
>>>>> wrote:
>>>>>> 1. in is supposed to be O(lg(n)) or better. Generic code may 
>>>>>> depend on this property. Searching an array is O(n).
>>>>>
>>>>> Probably it should work if we're using a "SortedRange".
>>>>>
>>>>>
>>>>> int[] a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9];
>>>>> auto p = assumeSorted(a);
>>>>>
>>>>> assert(3 in p);
>>>>>
>>>>>
>>>>
>>>> That could work. Currently, you need to use p.contains(3). opIn 
>>>> could be added as a shortcut.
>>>>
>>>> It only makes sense if you have it as a literal though, as 
>>>> p.contains(3) isn't that bad to use:
>>>>
>>>> assert(3 in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].assumeSorted);
>>>>
>>> There's no guarantee that checking if a value is in a sorted list is 
>>> any faster than checking if it's in a non sorted list. It's why sort 
>>> usually switches from a binary-esque algorithm to a linear one at a 
>>> certain size.
>>
>> Well of course! A binary search needs Lg(n) comparisons for pretty 
>> much any value, whereas a linear search is going to end early when it 
>> finds it. So there's no guarantee that searching for an element in the 
>> list is going to be faster one way or the other. But Binary search is 
>> going to be faster overall because the complexity is favorable.
> 
> Overall tending towards infinity maybe, but not overall on the average 
> case it would seem. Branch prediction in CPUs changes that in that with 
> a binary search it is always a miss. Whereas with linear it's always a hit.
> 
>>
>>> The list could potentially need to be _very_ large for p.contains to 
>>> make a significant impact over canFind(p) AFAIK.
>>>
>>> Here's a small test program, try playing with the numbers and see 
>>> what happens:
>>>
>>> import std.random;
>>> import std.range;
>>> import std.algorithm;
>>> import std.datetime.stopwatch;
>>> import std.stdio;
>>>
>>> void main()
>>> {
>>>      auto count = 1_000;
>>>      auto max = int.max;
>>>
>>>      alias randoms = generate!(() => uniform(0, max));
>>>
>>>      auto r1 = randoms.take(count).array;
>>>      auto r2 = r1.dup.sort;
>>>      auto elem = r1[uniform(0, count)];
>>
>> auto elem = r1[$-1]; // try this instead
>>
>>>
>>>      benchmark!(
>>>          () => r1.canFind(elem),
>>>          () => r2.contains(elem),
>>>      )(1_000).writeln;
>>> }
>>>
>>> Use LDC and -O3 of course. I was hard pressed to get the sorted 
>>> contains to be any faster than canFind.
>>>
>>> This begs the question then: do these requirements on in make any 
>>> sense? An algorithm can be log n (ala the sorted search) but still be 
>>> a magnitude slower than a linear search... what has the world come to 
>>> 🤦‍♂️
>>>
>>> PS: Why is it named contains if it's on a SortedRange and canFind 
>>> otherwise?
>>>
>>
>> A SortedRange uses O(lgn) steps vs. canFind which uses O(n) steps.
> 
> canFind is supposed to tell the reader that it's O(n) and contains O(lgn)?

canFind means find will not return empty. find is linear search. 
Probably not the clearest distinction, but contains isn't an algorithm, 
it's a member. I don't think there's any general naming convention for 
members like this, but I would say if contains means O(lgn) then that's 
what we should use everywhere to mean that.

I suppose the naming could be improved.

> 
>>
>> If you change your code to testing 1000 random numbers, instead of a 
>> random number guaranteed to be included, then you will see a 
>> significant improvement with the sorted version. I found it to be 
>> about 10x faster. (most of the time, none of the other random numbers 
>> are included). Even if you randomly select 1000 numbers from the 
>> elements, the binary search will be faster. In my tests, it was about 
>> 5x faster.
> 
> Hmm... What am I doing wrong with this code? And also how are you 
> compiling?:
> 
> void main()
> {
>      auto count = 1_000_000;
>      auto max = int.max;
> 
>      alias randoms = generate!(() => uniform(0, max - 1));
> 
>      auto r1 = randoms.take(count).array;
>      auto r2 = r1.dup.sort;
>      auto r3 = r1.dup.randomShuffle;
> 
>      auto results = benchmark!(
>          () => r1.canFind(max),
>          () => r2.contains(max),
>          () => r3.canFind(max),
>      )(5_000);
> 
>      results.writeln;
> }
> 
> 
> $ ldc2 -O3 test.d && ./test
> [1 hnsec, 84 μs and 7 hnsecs, 0 hnsecs]

Yeah, this looked very fishy to me. ldc can do some nasty "helpful" 
things to save you time! When I posted my results, I was using DMD.

I used run.dlang.io with ldc, and verified I get the same (similar) 
result as you. But searching through 5 billion integers can't be 
instantaneous, on any modern hardware.

So I also tried this.

     auto results = benchmark!(
         () => false,
         () => r2.contains(max),
         () => r3.canFind(max),
     )(5_000);

[4 μs and 9 hnsecs, 166 μs and 8 hnsecs, 4 μs and 7 hnsecs]

Hey look! returning a boolean is more expensive than a 1 million element 
linear search!

What I think is happening is that it determines nobody is using the 
result, and the function is pure, so it doesn't bother calling that 
function (probably not even the lambda, and then probably removes the 
loop completely).

I'm assuming for some reason, the binary search is not flagged pure, so 
it's not being skipped.

If I change to this to ensure side effects:

bool makeImpure; // TLS variable outside of main

...

     auto results = benchmark!(
         () => makeImpure = r1.canFind(max),
         () => makeImpure = r2.contains(max),
         () => makeImpure = r3.canFind(max),
     )(5_000);

writefln("%(%s\n%)", results); // modified to help with the comma confusion

I now get:
4 secs, 428 ms, and 3 hnsecs
221 μs and 9 hnsecs
4 secs, 49 ms, 982 μs, and 5 hnsecs

More like what I expected!

> General would be the most common case. I don't think extremely large 
> (for some definition of large) lists are the more common ones. Or maybe 
> they are. But I'd be surprised. I also don't think phobos is a very 
> data-driven library. But, that's a whole other conversation :)

I don't think the general case is going to be large data sets either. 
But that doesn't mean Phobos should assume they are all small. And as 
you can see, when it actually has to end up running the code, the binary 
search is significantly faster.

-Steve


More information about the Digitalmars-d-learn mailing list