fasta parser with iopipe?
Nicholas Wilson via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Wed Aug 23 04:07:55 PDT 2017
On Wednesday, 23 August 2017 at 09:53:49 UTC, biocyberman wrote:
> I lost my momentum to learn D and want to gain it up again.
> Therefore I need some help with this seemingly simple task:
>
> # Fasta sequence
>
>
>> \>Entry1_ID header field1|header field2|...
>> CAGATATCTTTGATGTCCTGATTGGAAGGACCGTTGGCCCCCCACCCTTAGGCAG
>> TGTATACTCTTCCATAAACGAGCTATTAGTTATGAGGTCCGTAGATTGAAAAGGG
>> TGACGGAATTCGGCCGAACGGGAAAGACGGACATCTAGGTATCCTGAGCACGGTT
>> GCGCGTCCGTATCAAGCTCCTCTTTATAGGCCCCG
>> \>Entry2_ID header field1|header field4|...
>> GTTACTGTTGGTCGTAGAGCCCAGAACGGGTTGGGCAGATGTACGACAATATCGCT
>> TAGTCACCCTTGGGCCACGGTCCGCTACCTTACAGGAATTGAGA
>
>> \>Entry3_ID header field1|header field2|...
>> GGCAGTACGATCGCACGCCCCACGTGAACGATTGGTAAACCCTGTGGCCTGTGAGC
>> GACAAAAGCTTTAATGGGAAATACGCGCCCATAACTTGGTGCGA
>
> # Some characteristics:
>
> - Entry_ID is >[[:alphanumeric:]]. Where '>' marks the entry
> start. In this post I have to put an escape character (\) to
> make the '>' visible.
> - Headers may contain annotation information separated by some
> delimiter (i.e. | in this case).
> - Entry ID and header is a single line, which does not contain
> newline characters.
> - Sequence under the header line is [ATCGN\n]* (Perl regex).
(if you know your sequence has no Ns or other ambiguous bases you
can can store 4 bases to a byte, or 3 if you want them in
triplets)
> - A fasta file can be plain-text or gzip compressed.
>
>
> # Goals:
> Write a parser that uses Dlang range with iopipe library for
> performance and ease of use. A big fasta file can be dozens of
> gigabytes.
>
> # Questions:
>
> 1. How do I model a fasta entry with a struct or class?
You could model the headers as a struct if you know the format,
but most of the time I think they are just ignored, if you really
need them just put them in a string[string].
The real info is in the sequences which are just a string of
characters (not utf-8 so i'd go with ubyte or some compressed
form).
id go with
struct FastaEntry
{
string id;
string[string] headers;
ubyte[] sequence;
}
and the a fasta is an array of entries. Note that with the
expected usage patterns you probably want this in struct of array
(SoA) form (see https://maikklein.github.io/post/soa-d/) given
unless you're bring it out you're unlike likely to need more than
one field at a time.
> 2. How to I implement a range of fasta entries with iopipe. A
> range in this case can be a forward range, but preferably a
> random access range.
Given the relative simplicity of the file format I'm don't think
it should be too hard. iopipe would give you a streaming
interface for it that would work with gzip compression but
wouldn't get you a random access range. You can keep the most
recent entries for look back but for real random access (e.g. if
you're trying to align them) you need them all in an array see
above comment about SoA .
> 3. I want to do with range to explore the power and elegance of
> ranges. But if performance is a big concern, what can I do
> alternatively?
You can probably use it for parsing I'd take a look at the
https://github.com/schveiguy/iopipe documentation and see what
you can find.
OT, I'm taking "systems biology and bioinformatics" this semester
so I'll be very interested to see what you do. Partly for the
bioinformatics stuff but also I will be giving a (series I hope)
of talks to the staff and some students at my university about
computation for biology (rather introductory as is the biology
department) but still I hope to get them interested in
computational biology and D so if you have some good examples
please do let me know.
Nic
More information about the Digitalmars-d-learn
mailing list