in not working for arrays is silly, change my view
aliak
something at something.com
Mon Mar 2 22:21:49 UTC 2020
On Monday, 2 March 2020 at 21:33:37 UTC, Steven Schveighoffer
wrote:
> On 3/2/20 3:52 PM, aliak wrote:
>> On Monday, 2 March 2020 at 15:47:26 UTC, Steven Schveighoffer
>> wrote:
>>> On 3/2/20 6:52 AM, Andrea Fontana wrote:
>>>> On Saturday, 29 February 2020 at 20:11:24 UTC, Steven
>>>> Schveighoffer wrote:
>>>>> 1. in is supposed to be O(lg(n)) or better. Generic code
>>>>> may depend on this property. Searching an array is O(n).
>>>>
>>>> Probably it should work if we're using a "SortedRange".
>>>>
>>>>
>>>> int[] a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9];
>>>> auto p = assumeSorted(a);
>>>>
>>>> assert(3 in p);
>>>>
>>>>
>>>
>>> That could work. Currently, you need to use p.contains(3).
>>> opIn could be added as a shortcut.
>>>
>>> It only makes sense if you have it as a literal though, as
>>> p.contains(3) isn't that bad to use:
>>>
>>> assert(3 in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].assumeSorted);
>>>
>> There's no guarantee that checking if a value is in a sorted
>> list is any faster than checking if it's in a non sorted list.
>> It's why sort usually switches from a binary-esque algorithm
>> to a linear one at a certain size.
>
> Well of course! A binary search needs Lg(n) comparisons for
> pretty much any value, whereas a linear search is going to end
> early when it finds it. So there's no guarantee that searching
> for an element in the list is going to be faster one way or the
> other. But Binary search is going to be faster overall because
> the complexity is favorable.
Overall tending towards infinity maybe, but not overall on the
average case it would seem. Branch prediction in CPUs changes
that in that with a binary search it is always a miss. Whereas
with linear it's always a hit.
>
>> The list could potentially need to be _very_ large for
>> p.contains to make a significant impact over canFind(p) AFAIK.
>>
>> Here's a small test program, try playing with the numbers and
>> see what happens:
>>
>> import std.random;
>> import std.range;
>> import std.algorithm;
>> import std.datetime.stopwatch;
>> import std.stdio;
>>
>> void main()
>> {
>> auto count = 1_000;
>> auto max = int.max;
>>
>> alias randoms = generate!(() => uniform(0, max));
>>
>> auto r1 = randoms.take(count).array;
>> auto r2 = r1.dup.sort;
>> auto elem = r1[uniform(0, count)];
>
> auto elem = r1[$-1]; // try this instead
>
>>
>> benchmark!(
>> () => r1.canFind(elem),
>> () => r2.contains(elem),
>> )(1_000).writeln;
>> }
>>
>> Use LDC and -O3 of course. I was hard pressed to get the
>> sorted contains to be any faster than canFind.
>>
>> This begs the question then: do these requirements on in make
>> any sense? An algorithm can be log n (ala the sorted search)
>> but still be a magnitude slower than a linear search... what
>> has the world come to 🤦♂️
>>
>> PS: Why is it named contains if it's on a SortedRange and
>> canFind otherwise?
>>
>
> A SortedRange uses O(lgn) steps vs. canFind which uses O(n)
> steps.
canFind is supposed to tell the reader that it's O(n) and
contains O(lgn)?
>
> If you change your code to testing 1000 random numbers, instead
> of a random number guaranteed to be included, then you will see
> a significant improvement with the sorted version. I found it
> to be about 10x faster. (most of the time, none of the other
> random numbers are included). Even if you randomly select 1000
> numbers from the elements, the binary search will be faster. In
> my tests, it was about 5x faster.
Hmm... What am I doing wrong with this code? And also how are you
compiling?:
void main()
{
auto count = 1_000_000;
auto max = int.max;
alias randoms = generate!(() => uniform(0, max - 1));
auto r1 = randoms.take(count).array;
auto r2 = r1.dup.sort;
auto r3 = r1.dup.randomShuffle;
auto results = benchmark!(
() => r1.canFind(max),
() => r2.contains(max),
() => r3.canFind(max),
)(5_000);
results.writeln;
}
$ ldc2 -O3 test.d && ./test
[1 hnsec, 84 μs and 7 hnsecs, 0 hnsecs]
>
> Note that the compiler can do a lot more tricks for linear
> searches, and CPUs are REALLY good at searching sequential
> data. But complexity is still going to win out eventually over
> heuristics. Phobos needs to be a general library, not one that
> only caters to certain situations.
General would be the most common case. I don't think extremely
large (for some definition of large) lists are the more common
ones. Or maybe they are. But I'd be surprised. I also don't think
phobos is a very data-driven library. But, that's a whole other
conversation :)
>
> -Steve
More information about the Digitalmars-d-learn
mailing list