# in not working for arrays is silly, change my view

aliak something at something.com
Mon Mar 2 22:21:49 UTC 2020

```On Monday, 2 March 2020 at 21:33:37 UTC, Steven Schveighoffer
wrote:
> On 3/2/20 3:52 PM, aliak wrote:
>> On Monday, 2 March 2020 at 15:47:26 UTC, Steven Schveighoffer
>> wrote:
>>> On 3/2/20 6:52 AM, Andrea Fontana wrote:
>>>> On Saturday, 29 February 2020 at 20:11:24 UTC, Steven
>>>> Schveighoffer wrote:
>>>>> 1. in is supposed to be O(lg(n)) or better. Generic code
>>>>> may depend on this property. Searching an array is O(n).
>>>>
>>>> Probably it should work if we're using a "SortedRange".
>>>>
>>>>
>>>> int[] a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9];
>>>> auto p = assumeSorted(a);
>>>>
>>>> assert(3 in p);
>>>>
>>>>
>>>
>>> That could work. Currently, you need to use p.contains(3).
>>> opIn could be added as a shortcut.
>>>
>>> It only makes sense if you have it as a literal though, as
>>> p.contains(3) isn't that bad to use:
>>>
>>> assert(3 in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].assumeSorted);
>>>
>> There's no guarantee that checking if a value is in a sorted
>> list is any faster than checking if it's in a non sorted list.
>> It's why sort usually switches from a binary-esque algorithm
>> to a linear one at a certain size.
>
> Well of course! A binary search needs Lg(n) comparisons for
> pretty much any value, whereas a linear search is going to end
> early when it finds it. So there's no guarantee that searching
> for an element in the list is going to be faster one way or the
> other. But Binary search is going to be faster overall because
> the complexity is favorable.

Overall tending towards infinity maybe, but not overall on the
average case it would seem. Branch prediction in CPUs changes
that in that with a binary search it is always a miss. Whereas
with linear it's always a hit.

>
>> The list could potentially need to be _very_ large for
>> p.contains to make a significant impact over canFind(p) AFAIK.
>>
>> Here's a small test program, try playing with the numbers and
>> see what happens:
>>
>> import std.random;
>> import std.range;
>> import std.algorithm;
>> import std.datetime.stopwatch;
>> import std.stdio;
>>
>> void main()
>> {
>>      auto count = 1_000;
>>      auto max = int.max;
>>
>>      alias randoms = generate!(() => uniform(0, max));
>>
>>      auto r1 = randoms.take(count).array;
>>      auto r2 = r1.dup.sort;
>>      auto elem = r1[uniform(0, count)];
>
> auto elem = r1[\$-1]; // try this instead
>
>>
>>      benchmark!(
>>          () => r1.canFind(elem),
>>          () => r2.contains(elem),
>>      )(1_000).writeln;
>> }
>>
>> Use LDC and -O3 of course. I was hard pressed to get the
>> sorted contains to be any faster than canFind.
>>
>> This begs the question then: do these requirements on in make
>> any sense? An algorithm can be log n (ala the sorted search)
>> but still be a magnitude slower than a linear search... what
>> has the world come to 🤦‍♂️
>>
>> PS: Why is it named contains if it's on a SortedRange and
>> canFind otherwise?
>>
>
> A SortedRange uses O(lgn) steps vs. canFind which uses O(n)
> steps.

canFind is supposed to tell the reader that it's O(n) and
contains O(lgn)?

>
> If you change your code to testing 1000 random numbers, instead
> of a random number guaranteed to be included, then you will see
> a significant improvement with the sorted version. I found it
> to be about 10x faster. (most of the time, none of the other
> random numbers are included). Even if you randomly select 1000
> numbers from the elements, the binary search will be faster. In
> my tests, it was about 5x faster.

Hmm... What am I doing wrong with this code? And also how are you
compiling?:

void main()
{
auto count = 1_000_000;
auto max = int.max;

alias randoms = generate!(() => uniform(0, max - 1));

auto r1 = randoms.take(count).array;
auto r2 = r1.dup.sort;
auto r3 = r1.dup.randomShuffle;

auto results = benchmark!(
() => r1.canFind(max),
() => r2.contains(max),
() => r3.canFind(max),
)(5_000);

results.writeln;
}

\$ ldc2 -O3 test.d && ./test
[1 hnsec, 84 μs and 7 hnsecs, 0 hnsecs]

>
> Note that the compiler can do a lot more tricks for linear
> searches, and CPUs are REALLY good at searching sequential
> data. But complexity is still going to win out eventually over
> heuristics. Phobos needs to be a general library, not one that
> only caters to certain situations.

General would be the most common case. I don't think extremely
large (for some definition of large) lists are the more common
ones. Or maybe they are. But I'd be surprised. I also don't think
phobos is a very data-driven library. But, that's a whole other
conversation :)

>
> -Steve

```