Code improvement for DNA reverse complement?

Nicholas Wilson via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Fri May 19 06:36:32 PDT 2017


On Friday, 19 May 2017 at 12:21:10 UTC, biocyberman wrote:
> On Friday, 19 May 2017 at 09:17:04 UTC, Biotronic wrote:
>> On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
>>> [...]
>>
>> Question about your implementation: you assume the input may 
>> contain newlines, but don't handle any other non-ACGT 
>> characters. The problem definition states 'DNA string' and the 
>> sample dataset contains no non-ACGT chars. Is this an 
>> oversight my part or yours, or did you just decide to support 
>> more than the problem requires?
>>
>> [...]
>
> Firstly, thank you for showing me various solutions, and even 
> cool benchmark code. To answer you questions: Yes I assume the 
> input file would realistically contain newlines, even though 
> the problem does not care about them. I also thought about 
> non-CATG bases, but haven't taken care of those cases. In 
> reality we should deal with at least ambiguous bases (N).
>
> I ran your code and also see that switch is faster than AA 
> (i.e. revComp0 is the fastest). And Stefan is right about this.
> 
> Some follow up questions:
>
> 1. Why do we need to use assumeUnique in 'revComp0' and 
> 'revComp3'?
>

Because `char[] result = new char[N];` is not a string (a.k.a. 
immutable(char)[]).
But because it was created from the GC in this function we know 
that it is safe to assume that is a string.

> 2. What is going on with the trick of making chars enum like 
> that in 'revComp3'?

What revComp3 is doing is effectively creating a table for each 
possible value of char that matches the behaviour of the switch.
it could also be rewritten as
```
char[256] chars; // implicitly memset to '\0'
chars['A'] = 'T';
chars['C'] = 'G';
chars['G'] = 'C';
chars['T'] = 'A';
```

Other miscellaneous comments:

If you haven't already checkout 
[BioD](https://github.com/biod/BioD), for most (all?) your 
bioinformatics needs.

If you're trying to be fast you probably don't want to use string 
for internal calculations as it is very entropy non-optimal (2 
bits out of 8 for ACGT, 4 out of 8 for an ambiguous encoding).
  I would have at least 2 "Dictionaries": one the standard 
nucleotides (ACGT) and another for your ambiguous representations 
(UNRYBDHVMKSW-) and the standard nucleotides, to get a better 
information density. If you're doing anything with protein 
sequences then you should use a translation table anyway as the 
DNA -> amino acid mapping changes between species/organelle 
(mt|cp|n)DNA.


More information about the Digitalmars-d-learn mailing list