fasta parser with iopipe?

Wed Aug 23 04:07:55 PDT 2017

On Wednesday, 23 August 2017 at 09:53:49 UTC, biocyberman wrote:
> I lost my momentum to learn D and want to gain it up again. 
> Therefore I need some help with this seemingly simple task:
>
> # Fasta sequence
>
>
>> \>Entry1_ID header field1|header field2|...
>> CAGATATCTTTGATGTCCTGATTGGAAGGACCGTTGGCCCCCCACCCTTAGGCAG
>> TGTATACTCTTCCATAAACGAGCTATTAGTTATGAGGTCCGTAGATTGAAAAGGG
>> TGACGGAATTCGGCCGAACGGGAAAGACGGACATCTAGGTATCCTGAGCACGGTT
>> GCGCGTCCGTATCAAGCTCCTCTTTATAGGCCCCG
>> \>Entry2_ID header field1|header field4|...
>> GTTACTGTTGGTCGTAGAGCCCAGAACGGGTTGGGCAGATGTACGACAATATCGCT
>> TAGTCACCCTTGGGCCACGGTCCGCTACCTTACAGGAATTGAGA
>
>> \>Entry3_ID header field1|header field2|...
>> GGCAGTACGATCGCACGCCCCACGTGAACGATTGGTAAACCCTGTGGCCTGTGAGC
>> GACAAAAGCTTTAATGGGAAATACGCGCCCATAACTTGGTGCGA
>
> # Some characteristics:
>
> - Entry_ID is >[[:alphanumeric:]]. Where '>' marks the entry 
> start. In this post I have to put an escape character (\) to 
> make the '>' visible.
> - Headers may contain annotation information separated by some 
> delimiter (i.e. | in this case).
> - Entry ID and header is a single line, which does not contain 
> newline characters.
> - Sequence under the header line is [ATCGN\n]* (Perl regex).

(if you know your sequence has no Ns or other ambiguous bases you 
can can store 4 bases to a byte, or 3 if you want them in 
triplets)

> - A fasta file can be plain-text or gzip compressed.
>
>
> # Goals:
> Write a parser that uses Dlang range with iopipe library for 
> performance and ease of use. A big fasta file can be dozens of 
> gigabytes.
>
> # Questions:
>
> 1. How do I model a fasta entry with a struct or class?

You could model the headers as a struct if you know the format, 
but most of the time I think they are just ignored, if you really 
need them just put them in a string[string].
The real info is in the sequences which are just a string of 
characters (not utf-8 so i'd go with ubyte or some compressed 
form).

id go with

struct FastaEntry
{
     string id;
     string[string] headers;
     ubyte[] sequence;
}

and the a fasta is an array of entries. Note that with the 
expected usage patterns you probably want this in struct of array 
(SoA) form (see https://maikklein.github.io/post/soa-d/) given 
unless you're bring it out you're unlike likely to need more than 
one field at a time.

> 2. How to I implement a range of fasta entries with iopipe. A 
> range in this case can be a forward range, but preferably a 
> random access range.

Given the relative simplicity of the file format I'm don't think 
it should be too hard. iopipe would give you a streaming 
interface for it that would work with gzip compression but 
wouldn't get you a random access range. You can keep the most 
recent entries for look back but for real random access (e.g. if 
you're trying to align them) you need them all in an array see 
above comment about SoA .

> 3. I want to do with range to explore the power and elegance of 
> ranges. But if performance is a big concern, what can I do 
> alternatively?

You can probably use it for parsing I'd take a look at the 
https://github.com/schveiguy/iopipe documentation and see what 
you can find.

OT, I'm taking "systems biology and bioinformatics" this semester 
so I'll be very interested to see what you do. Partly for the 
bioinformatics stuff but also I will be giving a (series I hope) 
of talks to the staff and some students at my university about 
computation for biology (rather introductory as is the biology 
department) but still I hope to get them interested in 
computational biology and D so if you have some good examples 
please do let me know.

Nic