Code improvement for DNA reverse complement?
Biotronic via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Mon May 22 03:21:13 PDT 2017
On Monday, 22 May 2017 at 08:58:24 UTC, biocyberman wrote:
> @Nicolas Wilson: Your explanation of the enum is clear and very
> helpful. I can recall to the same technique used in kh_hash in
> samtools and the associated. With that said, the chars enum is
> only to 'T' (85) elements.
The reason for having only 85 elements in the array was pure
laziness - the problem description seems to forbid non-ACGT
letters, so I saw no reason to write more code to handle them. :p
My code will crash if the input string contains lower-case
letters, Zs, or any other letter beyond 'T' (or read random bits
of memory if you're lucky).
> @ag0aep6g
>>You fell into a trap there. The value is calculated at compile
>>time, but it has >copy/paste-like behavior. That is, whenever
>>you use `chars`, the code behaves as if you >typed out the
>>array literal. That means, the whole array is re-created on
>>every iteration.
>
>>Use `static immutable` instead. It still forces compile-time
>>calculation, but it doesn't > have copy/paste behavior. Speeds
>>up revComp3 a lot.
>
> With 'iteration' here you mean running lifetime of the
> function, or in other words, each one of the 10_000 cycles in
> the benchmark?
>
> Could you provide some more reading for what you are telling
> here? I can only guess it is intrinsic behavior of an 'enum'.
ag0aep6g is absolutely correct in his observation, and the
resulting code is basically this:
string revComp3(string bps) {
const N = bps.length;
static immutable chars_saved = [Repeat!('A'-'\0', '\0'), 'T',
Repeat!('C'-'A'-1, '\0'), 'G',
Repeat!('G'-'C'-1, '\0'), 'C',
Repeat!('T'-'G'-1, '\0'), 'A'];
char[] result = new char[N];
for (int i = 0; i < N; ++i) {
auto chars = chars_saved.dup; // Bad stuff happens here
result[i] = chars[bps[N-i-1]];
}
return result.assumeUnique;
}
As we can see, it copies the entire array for every character in
the input string. That's basically an allocation and a memcpy in
the innermost, hottest loop. Roughly as bad as it gets. (yup,
that's 20 million allocations)
As for why this happens, enum can be thought of as the analog of
C's #define - the compiler precalculates the data to fill the
array, and then copies that into the source code every time it's
used.
More information about the Digitalmars-d-learn
mailing list