fasta parser with iopipe?

Mon Aug 28 07:08:25 PDT 2017

On Wednesday, 23 August 2017 at 13:06:36 UTC, Steven 
Schveighoffer wrote:
> On 8/23/17 5:53 AM, biocyberman wrote:
>> [...]
>
> I'll respond to all your questions with what I would do, 
> instead of answering each one.
>
> I would suggest an approach similar to how I approached parsing 
> JSON data. In your case, the protocol is even simpler, as there 
> is no nesting.
>
> 1. The base layer iopipe should be something that tokenizes the 
> input into reference-based structs. If you look at the 
> jsoniopipe library (https://github.com/schveiguy/jsoniopipe), 
> you can see that the lowest level finds the start of the next 
> JSON token. In your case, it should be looking for
> >[...]
>
> This code is pretty straightforward, and roughly corresponds to 
> this:
> while(cannot find start sequence in stream)
>     stream.extend;
> make sure you aren't re-doing work that has already been done 
> (i.e. save the last place you looked).
>
> Once you have this, you can deduce each packet by the data 
> between the starts.
>
> 2. The next layer should validate and parse the data into 
> structs that contain referencing data from the buffer. I 
> recommend not using actual ranges from the buffer, but 
> information on how to build the ranges. The reason for this is 
> that the buffer can move while being streamed by iopipe, so 
> your data could become invalid if you take actual references to 
> the buffer. If you look in the jsoniopipe library, the struct 
> for storing a json item has a start and length, but not a 
> reference to the buffer.
>
> Potentially, you could take this mechanism and build an iopipe 
> on top of the buffered data. This iopipe's elements would be 
> the items themselves, with the underlying buffer hidden in the 
> implementation details. Extending would parse out another set 
> of items, releasing would allow those items to get reclaimed 
> (and the underlying stream data).
>
> This is something I actually wanted to explore with jsoniopipe 
> but didn't have time before the conference. I probably will 
> still build it.
>
> 3. build your real code on top of that layer. What do you want 
> to do with the data? Easiest thing to do for proof of concept 
> is build a range out of the functions. That can allow you to 
> test performance with your lower layers. One of the awesome 
> things about iopipe is testing correctness is really easy -- 
> every string is also an iopipe :)
>
> I actually worked with a person at dconf on a similar (maybe 
> identical?) format and explained how it could be done in a very 
> similar way. He was looking to remove data that had a low 
> percentage of correctness (or something like that, not in 
> bioinformatics, so I don't understand the real semantics).
>
> With this mechanism in hand, the decompression is pretty easy 
> to chain together with whatever actual stream you have, just 
> use iopipe.zip.
>
> Good luck, and email me if you need more help 
> (schveiguy at yahoo.com).
>
> -Steve

Hi Nic and Steve

Thank you both very much for your inputs. I am trying to make use 
of them. I will try to adapt jsoniopipe for fasta. This is on 
going and broken code: https://github.com/biocyberman/fastaq . 
PRs are welcome.

@Nic:
I am too very interested in bringing D to bioinformatics. I will 
be happy to share information I have. Feel free to email me at 
vql(.at.)rn.dk and we talk further about it.

@Steve: Yes we talked at dconf 2017. I had to other things so D 
learning got slow down. I am trying with Fasta format before 
jumping to Fastq again. The jsoniopipe is full feature, and 
relatively small project, which can be used to study case. 
However there are some aspects I still haven't fully understood. 
Would I be lucky enough to have you make the current broken code 
of fastaq to work? :) That will definitely save me time and 
headache dealing with newbie problems.