fasta parser with iopipe?

biocyberman via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Wed Aug 23 02:53:49 PDT 2017


I lost my momentum to learn D and want to gain it up again. 
Therefore I need some help with this seemingly simple task:

# Fasta sequence


> \>Entry1_ID header field1|header field2|...
> CAGATATCTTTGATGTCCTGATTGGAAGGACCGTTGGCCCCCCACCCTTAGGCAG
> TGTATACTCTTCCATAAACGAGCTATTAGTTATGAGGTCCGTAGATTGAAAAGGG
> TGACGGAATTCGGCCGAACGGGAAAGACGGACATCTAGGTATCCTGAGCACGGTT
> GCGCGTCCGTATCAAGCTCCTCTTTATAGGCCCCG
> \>Entry2_ID header field1|header field4|...
> GTTACTGTTGGTCGTAGAGCCCAGAACGGGTTGGGCAGATGTACGACAATATCGCT
> TAGTCACCCTTGGGCCACGGTCCGCTACCTTACAGGAATTGAGA

> \>Entry3_ID header field1|header field2|...
> GGCAGTACGATCGCACGCCCCACGTGAACGATTGGTAAACCCTGTGGCCTGTGAGC
> GACAAAAGCTTTAATGGGAAATACGCGCCCATAACTTGGTGCGA

# Some characteristics:

- Entry_ID is >[[:alphanumeric:]]. Where '>' marks the entry 
start. In this post I have to put an escape character (\) to make 
the '>' visible.
- Headers may contain annotation information separated by some 
delimiter (i.e. | in this case).
- Entry ID and header is a single line, which does not contain 
newline characters.
- Sequence under the header line is [ATCGN\n]* (Perl regex).
- A fasta file can be plain-text or gzip compressed.


# Goals:
Write a parser that uses Dlang range with iopipe library for 
performance and ease of use. A big fasta file can be dozens of 
gigabytes.

# Questions:

1. How do I model a fasta entry with a struct or class?
2. How to I implement a range of fasta entries with iopipe. A 
range in this case can be a forward range, but preferably a 
random access range.
3. I want to do with range to explore the power and elegance of 
ranges. But if performance is a big concern, what can I do 
alternatively?




More information about the Digitalmars-d-learn mailing list