fasta parser with iopipe?
biocyberman via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Mon Aug 28 07:08:25 PDT 2017
On Wednesday, 23 August 2017 at 13:06:36 UTC, Steven
Schveighoffer wrote:
> On 8/23/17 5:53 AM, biocyberman wrote:
>> [...]
>
> I'll respond to all your questions with what I would do,
> instead of answering each one.
>
> I would suggest an approach similar to how I approached parsing
> JSON data. In your case, the protocol is even simpler, as there
> is no nesting.
>
> 1. The base layer iopipe should be something that tokenizes the
> input into reference-based structs. If you look at the
> jsoniopipe library (https://github.com/schveiguy/jsoniopipe),
> you can see that the lowest level finds the start of the next
> JSON token. In your case, it should be looking for
> >[...]
>
> This code is pretty straightforward, and roughly corresponds to
> this:
> while(cannot find start sequence in stream)
> stream.extend;
> make sure you aren't re-doing work that has already been done
> (i.e. save the last place you looked).
>
> Once you have this, you can deduce each packet by the data
> between the starts.
>
> 2. The next layer should validate and parse the data into
> structs that contain referencing data from the buffer. I
> recommend not using actual ranges from the buffer, but
> information on how to build the ranges. The reason for this is
> that the buffer can move while being streamed by iopipe, so
> your data could become invalid if you take actual references to
> the buffer. If you look in the jsoniopipe library, the struct
> for storing a json item has a start and length, but not a
> reference to the buffer.
>
> Potentially, you could take this mechanism and build an iopipe
> on top of the buffered data. This iopipe's elements would be
> the items themselves, with the underlying buffer hidden in the
> implementation details. Extending would parse out another set
> of items, releasing would allow those items to get reclaimed
> (and the underlying stream data).
>
> This is something I actually wanted to explore with jsoniopipe
> but didn't have time before the conference. I probably will
> still build it.
>
> 3. build your real code on top of that layer. What do you want
> to do with the data? Easiest thing to do for proof of concept
> is build a range out of the functions. That can allow you to
> test performance with your lower layers. One of the awesome
> things about iopipe is testing correctness is really easy --
> every string is also an iopipe :)
>
> I actually worked with a person at dconf on a similar (maybe
> identical?) format and explained how it could be done in a very
> similar way. He was looking to remove data that had a low
> percentage of correctness (or something like that, not in
> bioinformatics, so I don't understand the real semantics).
>
> With this mechanism in hand, the decompression is pretty easy
> to chain together with whatever actual stream you have, just
> use iopipe.zip.
>
> Good luck, and email me if you need more help
> (schveiguy at yahoo.com).
>
> -Steve
Hi Nic and Steve
Thank you both very much for your inputs. I am trying to make use
of them. I will try to adapt jsoniopipe for fasta. This is on
going and broken code: https://github.com/biocyberman/fastaq .
PRs are welcome.
@Nic:
I am too very interested in bringing D to bioinformatics. I will
be happy to share information I have. Feel free to email me at
vql(.at.)rn.dk and we talk further about it.
@Steve: Yes we talked at dconf 2017. I had to other things so D
learning got slow down. I am trying with Fasta format before
jumping to Fastq again. The jsoniopipe is full feature, and
relatively small project, which can be used to study case.
However there are some aspects I still haven't fully understood.
Would I be lucky enough to have you make the current broken code
of fastaq to work? :) That will definitely save me time and
headache dealing with newbie problems.
More information about the Digitalmars-d-learn
mailing list