200-600x slower Dlang performance with nested foreach loop

Sun Jan 31 00:53:05 UTC 2021

On 1/30/21 6:13 PM, methonash wrote:
> Greetings all,
> 
> Many thanks for sharing your collective perspective and advice thus far! 
> It has been very helpful and instructive. I return bearing live data and 
> a minimally complete, compilable, and executable program to experiment 
> with and potentially optimize. The dataset can be pulled from here:
> 
> https://filebin.net/qf2km1ea9qgd5skp/seqs.fasta.gz?t=97kgpebg
> 
> Running "cksum" on this file:
> 
> 1477520542 2199192 seqs.fasta.gz
> 
> Naturally, you'll need to gunzip this file. The decompressed file 
> contains strings on every even-numbered line that have already been 
> reduced to the unique de-duplicated subset, and they have already been 
> sorted by descending length and alphabetical identity.
> 
>  From my initial post, the focus is now entirely on step #4: 
> finding/removing strings that can be entirely absorbed (substringed) by 
> their largest possible parent.
> 
> And now for the code:
> 
> 
> import std.stdio : writefln, File, stdin;
> import std.conv : to;
> import std.string : indexOf;
> 
> void main()
> {
>          string[] seqs;
> 
>          foreach( line; stdin.byLine() )
>          {
>                  if( line[ 0 ] == '>' ) continue;
>                  else seqs ~= to!string( line );
>          }
> 
>          foreach( i; 0 .. seqs.length )
>          {
>                  if( seqs[ i ].length == 0 ) continue;
> 
>                  foreach( j; i + 1 .. seqs.length )
>                  {
>                          if( seqs[ j ].length == 0 ||
>                              seqs[ i ].length == seqs[ j ].length ) 
> continue;
> 
>                          if( indexOf( seqs[ i ], seqs[ j ] ) > -1 )
>                          {
>                                  seqs[ j ] = "";
> 
>                                  writefln( "%s contains %s", i, j );
>                          }
>                  }
>          }
> }
> 
> 
> Compile the source and then run the executable via redirecting stdin:
> 
> ./substr < seqs.fasta
> 
> See any straightforward optimization paths here?
> 
> For curiosity, I experimented with use of string[] and ubyte[][] and 
> several functions (indexOf, canFind, countUntil) to assess for the best 
> potential performer. My off-the-cuff results:
> 
> string[] with indexOf() :: 26.5-27 minutes
> string[] with canFind() :: >28 minutes
> ubyte[][] with canFind() :: 27.5 minutes
> ubyte[][] with countUntil() :: 27.5 minutes
> 
> Resultantly, the code above uses string[] with indexOf(). Tests were 
> performed with an Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz.
> 
> I have additional questions/concerns/confusion surrounding the foreach() 
> syntax I have had to apply above, but performance remains my chief 
> immediate concern.

The code looks pretty minimal.

I'd suggest trying it in reverse. If you have the sequence "cba", "ba", 
"a", then determining "a" is in "ba" is probably cheaper than 
determining "a" is in "cba".

Are you still convinced that it's possible to do it in under 2 seconds? 
That would seem a huge discrepancy. If not, what specifically are you 
looking for in terms of performance?

-Steve