Code improvement for DNA reverse complement?

Mon May 22 01:58:24 PDT 2017

On Monday, 22 May 2017 at 06:50:45 UTC, Biotronic wrote:
> On Friday, 19 May 2017 at 22:53:39 UTC, crimaniak wrote:
>> On Friday, 19 May 2017 at 12:55:05 UTC, Biotronic wrote:
>>> revComp6 seems to be the fastest, but it's probably also the 
>>> least readable (a common trade-off).
>> Try revComp7 with -release :)
>>
>> string revComp7(string bps)
>> {
>>     char[] result = new char[bps.length];
>>     auto p1 = result.ptr;
>>     auto p2 = &bps[$ - 1];
>>     enum AT = 'A'^'T';
>>     enum CG = 'C'^'G';
>>
>>     while (p2 > bps.ptr)
>>     {
>>        *p1 = *p2 ^ ((*p2 == 'A' || *p2 == 'T') ? AT : CG);
>>         p1++;
>>         p2--;
>>     }
>>     return result.assumeUnique;
>> }
>>
>> In fact, when the size of the sequence is growing time 
>> difference between procedures is shrinking, so it's much more 
>> important to use memory-efficient presentation than to 
>> optimize logic.
>
> revComp7 is pretty fast, but I followed ag0aep6g's advice:
>
> On Friday, 19 May 2017 at 13:38:20 UTC, ag0aep6g wrote:
>> Use `static immutable` instead. It still forces compile-time 
>> calculation, but it doesn't have copy/paste behavior. Speeds 
>> up revComp3 a lot.
>
> Also, with DMD (2.073.0) -release -O instead of -debug from 
> this point. I'd blame someone else, but this is my fault. :p
>
> Anyways, full collection of the various versions I've written, 
> plus crimaniak's revComp7 (rebranded as revComp8, because I 
> already had 7 at the time):
>
> https://gist.github.com/Biotronic/20daaf0ed1262d313830bc8cd4199896
>
> Timings:
> revComp0:        158 ms, 926 us
> revComp1: 1 sec, 157 ms,  15 us
> revComp2:        604 ms,  37 us
> revComp3:         18 ms, 545 us
> revComp4:         92 ms, 293 us
> revComp5:         86 ms, 731 us
> revComp6:         94 ms,  56 us
> revComp7:        917 ms, 576 us
> revComp8:         62 ms, 917 us
>
> This actually matches my expectations - the table lookup 
> version should be crazy fast, and it is. It beats even your 
> revComp7 (revComp8 in the table).
>
> LDC (-release -O3) timings:
>
> revComp0: 166 ms, 190 us
> revComp1: 352 ms, 917 us
> revComp2: 300 ms, 493 us
> revComp3:  10 ms, 950 us
> revComp4: 148 ms, 106 us
> revComp5: 144 ms, 152 us
> revComp6: 142 ms, 307 us
> revComp7: 604 ms, 274 us
> revComp8:  26 ms, 612 us
>
> Interesting how revComp4-6 got slower. What I really wanted to 
> see with this though, was the effect on revComp1, which uses 
> ranges all the way.

Wow!!! Someone grab me a chair, I need to sit down. I can't tell 
enough how grateful I am to all you guys. This is so much fun to 
learn. Some specific comments and replies:

@Nicolas Wilson: Your explanation of the enum is clear and very 
helpful. I can recall to the same technique used in kh_hash in 
samtools and the associated. With that said, the chars enum is 
only to 'T' (85) elements.

  Regarding BioD, I have plan to work on it to add some more 
functionality. But first I need to sharpen my D skills a bit more.

@Laeeth Isharc: I do like ldc as well. I've came across several 
projects that use ldc, and learnt that it is a good choice for 
speed in general.

@ag0aep6g
>You fell into a trap there. The value is calculated at compile 
>time, but it has >copy/paste-like behavior. That is, whenever 
>you use `chars`, the code behaves as if you >typed out the array 
>literal. That means, the whole array is re-created on every 
>iteration.

>Use `static immutable` instead. It still forces compile-time 
>calculation, but it doesn't > have copy/paste behavior. Speeds 
>up revComp3 a lot.

With  'iteration' here you mean running lifetime of the function, 
or in other words, each one of the 10_000 cycles in the benchmark?

Could you provide some more reading for what you are telling 
here? I can only guess it is intrinsic behavior of an 'enum'.

@crimaniak, Nicolas Wilson and Biotronic: I've realized before 
the reversible/negate property of XOR: 'A'^'T'^'T' = 'A' and 
'A'^'T'^'A' = 'T'; To help myself and see it in bit patterns, I 
wrote this snippet:

void main(){

   enum AT = 'A'^'T';
   enum CG = 'C'^'G';
   enum chars = [Repeat!('A'-'\0', '\0'), 'T',
                 Repeat!('C'-'A'-1, '\0'), 'G',
                 Repeat!('G'-'C'-1, '\0'), 'C',
                 Repeat!('T'-'G'-1, '\0'), 'A'];

   writef("BIN %0 8b DEC %d\n", 'A', 'A');
   writef("BIN %0 8b DEC %d\n", 'T', 'T');
   writef("XOR %0 8b DEC %d\n", AT, AT);
   writef("TOR %0 8b DEC %d\n", AT^'T', AT^'T', AT^'T');
   writef("AOR %0 8b DEC %d\n", AT^'A', AT^'A', AT^'A');

   foreach (i, c; chars){
     if (i >= 60) writef("%02d: %0 8b, %d\n",i, c, c); // elements 
before 60 are all \0
   }
}

// Output
BIN 01000001 DEC 65
BIN 01010100 DEC 84
XOR 00010101 DEC 21
TOR 01000001 DEC 65
AOR 01010100 DEC 84
60: 00000000, 0
61: 00000000, 0
62: 00000000, 0
63: 00000000, 0
64: 00000000, 0
65: 01010100, 84
66: 00000000, 0
67: 01000111, 71
68: 00000000, 0
69: 00000000, 0
70: 00000000, 0
71: 01000011, 67
72: 00000000, 0
73: 00000000, 0
74: 00000000, 0
75: 00000000, 0
76: 00000000, 0
77: 00000000, 0
78: 00000000, 0
79: 00000000, 0
80: 00000000, 0
81: 00000000, 0
82: 00000000, 0
83: 00000000, 0
84: 01000001, 65