fasta parser with iopipe?

Steven Schveighoffer via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Wed Aug 23 06:06:36 PDT 2017


On 8/23/17 5:53 AM, biocyberman wrote:
> I lost my momentum to learn D and want to gain it up again. Therefore I 
> need some help with this seemingly simple task:
> 
> # Fasta sequence
> 
> 
>> \>Entry1_ID header field1|header field2|...
>> CAGATATCTTTGATGTCCTGATTGGAAGGACCGTTGGCCCCCCACCCTTAGGCAG
>> TGTATACTCTTCCATAAACGAGCTATTAGTTATGAGGTCCGTAGATTGAAAAGGG
>> TGACGGAATTCGGCCGAACGGGAAAGACGGACATCTAGGTATCCTGAGCACGGTT
>> GCGCGTCCGTATCAAGCTCCTCTTTATAGGCCCCG
>> \>Entry2_ID header field1|header field4|...
>> GTTACTGTTGGTCGTAGAGCCCAGAACGGGTTGGGCAGATGTACGACAATATCGCT
>> TAGTCACCCTTGGGCCACGGTCCGCTACCTTACAGGAATTGAGA
> 
>> \>Entry3_ID header field1|header field2|...
>> GGCAGTACGATCGCACGCCCCACGTGAACGATTGGTAAACCCTGTGGCCTGTGAGC
>> GACAAAAGCTTTAATGGGAAATACGCGCCCATAACTTGGTGCGA
> 
> # Some characteristics:
> 
> - Entry_ID is >[[:alphanumeric:]]. Where '>' marks the entry start. In 
> this post I have to put an escape character (\) to make the '>' visible.
> - Headers may contain annotation information separated by some delimiter 
> (i.e. | in this case).
> - Entry ID and header is a single line, which does not contain newline 
> characters.
> - Sequence under the header line is [ATCGN\n]* (Perl regex).
> - A fasta file can be plain-text or gzip compressed.
> 
> 
> # Goals:
> Write a parser that uses Dlang range with iopipe library for performance 
> and ease of use. A big fasta file can be dozens of gigabytes.
> 
> # Questions:
> 
> 1. How do I model a fasta entry with a struct or class?
> 2. How to I implement a range of fasta entries with iopipe. A range in 
> this case can be a forward range, but preferably a random access range.
> 3. I want to do with range to explore the power and elegance of ranges. 
> But if performance is a big concern, what can I do alternatively?

I'll respond to all your questions with what I would do, instead of 
answering each one.

I would suggest an approach similar to how I approached parsing JSON 
data. In your case, the protocol is even simpler, as there is no nesting.

1. The base layer iopipe should be something that tokenizes the input 
into reference-based structs. If you look at the jsoniopipe library 
(https://github.com/schveiguy/jsoniopipe), you can see that the lowest 
level finds the start of the next JSON token. In your case, it should be 
looking for >[[:alphanumeric:]] (or possibly just >).

This code is pretty straightforward, and roughly corresponds to this:
while(cannot find start sequence in stream)
     stream.extend;
make sure you aren't re-doing work that has already been done (i.e. save 
the last place you looked).

Once you have this, you can deduce each packet by the data between the 
starts.

2. The next layer should validate and parse the data into structs that 
contain referencing data from the buffer. I recommend not using actual 
ranges from the buffer, but information on how to build the ranges. The 
reason for this is that the buffer can move while being streamed by 
iopipe, so your data could become invalid if you take actual references 
to the buffer. If you look in the jsoniopipe library, the struct for 
storing a json item has a start and length, but not a reference to the 
buffer.

Potentially, you could take this mechanism and build an iopipe on top of 
the buffered data. This iopipe's elements would be the items themselves, 
with the underlying buffer hidden in the implementation details. 
Extending would parse out another set of items, releasing would allow 
those items to get reclaimed (and the underlying stream data).

This is something I actually wanted to explore with jsoniopipe but 
didn't have time before the conference. I probably will still build it.

3. build your real code on top of that layer. What do you want to do 
with the data? Easiest thing to do for proof of concept is build a range 
out of the functions. That can allow you to test performance with your 
lower layers. One of the awesome things about iopipe is testing 
correctness is really easy -- every string is also an iopipe :)

I actually worked with a person at dconf on a similar (maybe identical?) 
format and explained how it could be done in a very similar way. He was 
looking to remove data that had a low percentage of correctness (or 
something like that, not in bioinformatics, so I don't understand the 
real semantics).

With this mechanism in hand, the decompression is pretty easy to chain 
together with whatever actual stream you have, just use iopipe.zip.

Good luck, and email me if you need more help (schveiguy at yahoo.com).

-Steve


More information about the Digitalmars-d-learn mailing list