Replacing tango.text.Ascii.isearch

Siarhei Siamashka siarhei.siamashka at gmail.com
Sun Oct 9 01:08:50 UTC 2022


On Saturday, 8 October 2022 at 01:07:46 UTC, rassoc wrote:
> On 10/8/22 00:50, Siarhei Siamashka via Digitalmars-d-learn 
> wrote:
>> On Friday, 7 October 2022 at 12:19:59 UTC, bachmeier wrote:
>> python -c "print(('a' * 49 + 'b') * 20000)" > test.lst
>
> That's generating a file with a single line:
>
> $> wc -l test.lst
> 1 test.lst

If you insist on having 100K lines in the input data, then you 
can try a different test (run it as a shell script):

```bash
python -c "print(('a' * 9999 + 'b') * 89 + '\\n' * 99999 + 'a' * 
10000)" > words.txt
cp words.txt test.lst
wc -l words.txt

NEEDLE=`python -c "print('a' * 10000, end='')"`

echo "=== Python ==="
time python search.py "$NEEDLE"

echo "=== Ruby ==="
time ruby search.rb "$NEEDLE"

echo "=== Crystal ==="
time ./search-cr "$NEEDLE"

echo "=== D (LDC) ==="
time ./search-ldc "$NEEDLE"
```

It is testing a 1MB file with 100K lines and a 10K characters 
long needle. Results from my computer (with Python 3.10, because 
earlier versions of Python are slower):

```
100000 words.txt
=== Python ===
1

real    0m0.083s
user    0m0.072s
sys     0m0.010s
=== Ruby ===
1

real    0m0.313s
user    0m0.313s
sys     0m0.000s
=== Crystal ===
1

real    0m1.492s
user    0m1.482s
sys     0m0.010s
=== D (LDC) ===
1

real    1m10.314s
user    1m10.234s
sys     0m0.050s
```

> Going with an appropriate 100k mixed line file and your 
> mentioned needle, D is still quite a bit slower, but the 
> results aren't as drastic

Your "appropriate" file is also entirely artificial and happens 
to be specifically crafted to work best with the current 
".canFind" implementation from Phobos. The primary source of 
slowness are partial matches. Such as the `"int i = "` prefix 
when searching for `"int i = 0;"` substring in a string, which 
contains `"int i = 1;"`. The time is wasted on comparing the 
initial 8 characters before encountering a mismatch and bailing 
out. But when searching in a randomly generated gibberish, the 
chances of encountering long partial matches are very low. At 
least lower than in the real text files.

> and nowhere near "two orders of magnitude".

Does the difference really have to be two orders of magnitude for 
you to acknowledge that there might be a performance problem in 
Phobos? Moreover, you are quoting torhu, who compared two D 
implementations compiled by DMD (regex vs. canFind) rather than 
standard libraries of different programming languages.

> ```D
> import std;
> void main(string[] args) {
>     enforce(args.length > 1, "Need a needle argument.");
>     auto needle = args[1].toLower;
>     File("words.txt").byLine.count!(ln => 
> ln.asLowerCase.canFind(needle)).writeln;
> }
> ```

It's a nicely looking one-liner implementation in D language and 
everything is fine. Except that similar one-liners implemented 
using other programming languages are faster and more versatile 
(can handle any input data without catastrophic performance 
pitfalls).

BTW, changing ".asLowerCase" to ".toLower" introduces an extra 
memory allocation, but still significantly improves handling of 
the worst case. Conversion to lowercase is expensive for unicode 
characters and revisiting the same character to repeat such 
conversion again and again is bad for performance.


More information about the Digitalmars-d-learn mailing list