200-600x slower Dlang performance with nested foreach loop

Tue Jan 26 18:17:31 UTC 2021

On Tue, Jan 26, 2021 at 05:40:36PM +0000, methonash via Digitalmars-d-learn wrote:
[...]
> 1) Read a list of strings from a file
> 2) De-duplicate all strings into the subset of unique strings
> 3) Sort the subset of unique strings by descending length and then by
> ascending lexicographic identity
> 4) Iterate through the sorted subset of unique strings, identifying smaller
> sequences with perfect identity to their largest possible parent string
[...]
> Things went sideways at step #4: because multiSort() returns a SortedRange,
> I used .array to convert the returned SortedRange into an array of type
> string[].

Do not do this. Every time you call .array it allocates a new array and
copies all its contents over. If this code runs frequently, it will
cause a big performance hit, not to mention high GC load.

The function you're looking for is .release, not .array.

[...]
> With the formally returned array, I then attempted to construct a
> double foreach loop to iterate through the sorted array of unique
> strings and find substring matches.
> 
> foreach( i, ref pStr; sortedArr )
> {
>     foreach( j, ref cStr; sortedArr[ i + 1 .. $ ] )
>     {
>         if( indexOf( pStr, cStr ) > -1 )
>         {
>             // ...
>         }
>     }
> }
> 
> Before adding the code excerpt above, the Dlang program was taking ~1
> second on an input file containing approx. 64,000 strings.
> 
> By adding the code above, the program now takes 6 minutes to complete.

That nested loop is an O(n^2) algorithm. Meaning it will slow down
*very* quickly as the size of the array n increases.  You might want to
think about how to improve this algorithm.

> An attempt was made to more efficiently perform ASCII-only substring
> searching by converting the sorted string[] into ubyte[][] and then
> using countUntil() instead of indexOf(), but this had an effect that
> was completely opposite to what I had previously experienced: the
> program then took over 20 minutes to complete!

How are you doing the conversion?  If you're using std.conv.to or
something like that, it will definitely cause a big performance hit
because of the needless allocations and copying.  You probably want a
direct cast instead.  I.e., you want to reinterpret the array reference,
not transcribe a copy of it into a ubyte[][].

Probably what you're looking for is to use .representation and
.countUntil, or maybe just .canFind to bypass the decoding overhead. (If
indeed that is the bottleneck; it may not be.  Have you used a profiler
to identify where the hotspot is?)

> Thus, I am entirely baffled.
> 
> My first attempt to solve this problem space used a small Perl program
> to perform steps 1 through 3, which would then pipe intermediate
> output to a small Dlang program handling only step #4 using dynamic
> arrays (no use of AAs) of ubyte[][] with use of countUntil().

Using AA's may not necessarily improve performance.  It depends on what
your code does with it.  Because AA's require random access to memory,
it's not friendly to the CPU cache hierarchy, whereas traversing linear
arrays is more cache-friendly and in some cases will out-perform AA's.

> The Dlang code for the nested foreach block above is essentially
> near-identical between my two Dlang implementations. Yet, the second
> implementation--where I'm trying to solve the entire problem space in
> D--has absolutely failed in terms of performance.
> 
> Perl+D, ubyte[][], countUntil() :: under 2 seconds
> only D, string[], indexOf() :: ~6 minutes
> only D, ubyte[][], countUntil() :: >20 minutes
> 
> Please advise. This nightmarish experience is shaking my confidence in
> using D.

First of all, you need to use a profiler to identify where the hotspots
are.  Otherwise you could well be pouring tons of effort into
"optimizing" code that doesn't actually need optimizing, while
completely missing the real source of the problem.  Whenever you run
into performance problems, do not assume you know where the problem is,
profile, profile, profile!

Second, you only posted a small fragment of your code, so it's hard to
say where the problem really is.  I can only guess based on what you
described.  If you could post the entire program, or at least a
complete, compilable and runnable excerpt thereof that displays the same
(or similar) performance problems, then we could better help you
pinpoint where the problem is.

In general, though, things to look for are:

(1) Unnecessary memory allocations, e.g., calling .array on a
SortedRange when you really should be using .release, or calling a
conversion function to transcribe an array instead of just casting to
reinterpret it;

(2) Algorithms with poor big-O characteristics, e.g., the O(n^2) nested
loop you have above.

(3) Expensive operations inside inner loops, because the loop nesting
magnifies any slowness in the code.

But above all, before randomly changing your code in the hopes that you
will hit upon a solution, use a profiler.  Don't shoot in the dark; use
a profiler to identify the actual performance bottlenecks.  Don't waste
time optimizing things that don't need optimizing; focus your efforts on
the hotspots identified by your profiler. (And don't think you're
smarter than your profiler -- I've been coding for ~20 years, and I
still find that my bottlenecks are usually not where I think they are.
Profile, profile, profile!)

T

-- 
What do you call optometrist jokes? Vitreous humor.