Code improvement for DNA reverse complement?
Nicholas Wilson via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Fri May 19 06:36:32 PDT 2017
On Friday, 19 May 2017 at 12:21:10 UTC, biocyberman wrote:
> On Friday, 19 May 2017 at 09:17:04 UTC, Biotronic wrote:
>> On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
>>> [...]
>>
>> Question about your implementation: you assume the input may
>> contain newlines, but don't handle any other non-ACGT
>> characters. The problem definition states 'DNA string' and the
>> sample dataset contains no non-ACGT chars. Is this an
>> oversight my part or yours, or did you just decide to support
>> more than the problem requires?
>>
>> [...]
>
> Firstly, thank you for showing me various solutions, and even
> cool benchmark code. To answer you questions: Yes I assume the
> input file would realistically contain newlines, even though
> the problem does not care about them. I also thought about
> non-CATG bases, but haven't taken care of those cases. In
> reality we should deal with at least ambiguous bases (N).
>
> I ran your code and also see that switch is faster than AA
> (i.e. revComp0 is the fastest). And Stefan is right about this.
>
> Some follow up questions:
>
> 1. Why do we need to use assumeUnique in 'revComp0' and
> 'revComp3'?
>
Because `char[] result = new char[N];` is not a string (a.k.a.
immutable(char)[]).
But because it was created from the GC in this function we know
that it is safe to assume that is a string.
> 2. What is going on with the trick of making chars enum like
> that in 'revComp3'?
What revComp3 is doing is effectively creating a table for each
possible value of char that matches the behaviour of the switch.
it could also be rewritten as
```
char[256] chars; // implicitly memset to '\0'
chars['A'] = 'T';
chars['C'] = 'G';
chars['G'] = 'C';
chars['T'] = 'A';
```
Other miscellaneous comments:
If you haven't already checkout
[BioD](https://github.com/biod/BioD), for most (all?) your
bioinformatics needs.
If you're trying to be fast you probably don't want to use string
for internal calculations as it is very entropy non-optimal (2
bits out of 8 for ACGT, 4 out of 8 for an ambiguous encoding).
I would have at least 2 "Dictionaries": one the standard
nucleotides (ACGT) and another for your ambiguous representations
(UNRYBDHVMKSW-) and the standard nucleotides, to get a better
information density. If you're doing anything with protein
sequences then you should use a translation table anyway as the
DNA -> amino acid mapping changes between species/organelle
(mt|cp|n)DNA.
More information about the Digitalmars-d-learn
mailing list