in not working for arrays is silly, change my view

Mon Mar 2 22:21:49 UTC 2020

On Monday, 2 March 2020 at 21:33:37 UTC, Steven Schveighoffer 
wrote:
> On 3/2/20 3:52 PM, aliak wrote:
>> On Monday, 2 March 2020 at 15:47:26 UTC, Steven Schveighoffer 
>> wrote:
>>> On 3/2/20 6:52 AM, Andrea Fontana wrote:
>>>> On Saturday, 29 February 2020 at 20:11:24 UTC, Steven 
>>>> Schveighoffer wrote:
>>>>> 1. in is supposed to be O(lg(n)) or better. Generic code 
>>>>> may depend on this property. Searching an array is O(n).
>>>>
>>>> Probably it should work if we're using a "SortedRange".
>>>>
>>>>
>>>> int[] a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9];
>>>> auto p = assumeSorted(a);
>>>>
>>>> assert(3 in p);
>>>>
>>>>
>>>
>>> That could work. Currently, you need to use p.contains(3). 
>>> opIn could be added as a shortcut.
>>>
>>> It only makes sense if you have it as a literal though, as 
>>> p.contains(3) isn't that bad to use:
>>>
>>> assert(3 in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].assumeSorted);
>>>
>> There's no guarantee that checking if a value is in a sorted 
>> list is any faster than checking if it's in a non sorted list. 
>> It's why sort usually switches from a binary-esque algorithm 
>> to a linear one at a certain size.
>
> Well of course! A binary search needs Lg(n) comparisons for 
> pretty much any value, whereas a linear search is going to end 
> early when it finds it. So there's no guarantee that searching 
> for an element in the list is going to be faster one way or the 
> other. But Binary search is going to be faster overall because 
> the complexity is favorable.

Overall tending towards infinity maybe, but not overall on the 
average case it would seem. Branch prediction in CPUs changes 
that in that with a binary search it is always a miss. Whereas 
with linear it's always a hit.

>
>> The list could potentially need to be _very_ large for 
>> p.contains to make a significant impact over canFind(p) AFAIK.
>> 
>> Here's a small test program, try playing with the numbers and 
>> see what happens:
>> 
>> import std.random;
>> import std.range;
>> import std.algorithm;
>> import std.datetime.stopwatch;
>> import std.stdio;
>> 
>> void main()
>> {
>>      auto count = 1_000;
>>      auto max = int.max;
>> 
>>      alias randoms = generate!(() => uniform(0, max));
>> 
>>      auto r1 = randoms.take(count).array;
>>      auto r2 = r1.dup.sort;
>>      auto elem = r1[uniform(0, count)];
>
> auto elem = r1[$-1]; // try this instead
>
>> 
>>      benchmark!(
>>          () => r1.canFind(elem),
>>          () => r2.contains(elem),
>>      )(1_000).writeln;
>> }
>> 
>> Use LDC and -O3 of course. I was hard pressed to get the 
>> sorted contains to be any faster than canFind.
>> 
>> This begs the question then: do these requirements on in make 
>> any sense? An algorithm can be log n (ala the sorted search) 
>> but still be a magnitude slower than a linear search... what 
>> has the world come to 🤦‍♂️
>> 
>> PS: Why is it named contains if it's on a SortedRange and 
>> canFind otherwise?
>> 
>
> A SortedRange uses O(lgn) steps vs. canFind which uses O(n) 
> steps.

canFind is supposed to tell the reader that it's O(n) and 
contains O(lgn)?

>
> If you change your code to testing 1000 random numbers, instead 
> of a random number guaranteed to be included, then you will see 
> a significant improvement with the sorted version. I found it 
> to be about 10x faster. (most of the time, none of the other 
> random numbers are included). Even if you randomly select 1000 
> numbers from the elements, the binary search will be faster. In 
> my tests, it was about 5x faster.

Hmm... What am I doing wrong with this code? And also how are you 
compiling?:

void main()
{
     auto count = 1_000_000;
     auto max = int.max;

     alias randoms = generate!(() => uniform(0, max - 1));

     auto r1 = randoms.take(count).array;
     auto r2 = r1.dup.sort;
     auto r3 = r1.dup.randomShuffle;

     auto results = benchmark!(
         () => r1.canFind(max),
         () => r2.contains(max),
         () => r3.canFind(max),
     )(5_000);

     results.writeln;
}

$ ldc2 -O3 test.d && ./test
[1 hnsec, 84 μs and 7 hnsecs, 0 hnsecs]

>
> Note that the compiler can do a lot more tricks for linear 
> searches, and CPUs are REALLY good at searching sequential 
> data. But complexity is still going to win out eventually over 
> heuristics. Phobos needs to be a general library, not one that 
> only caters to certain situations.

General would be the most common case. I don't think extremely 
large (for some definition of large) lists are the more common 
ones. Or maybe they are. But I'd be surprised. I also don't think 
phobos is a very data-driven library. But, that's a whole other 
conversation :)

>
> -Steve