parsing fastq files with D

eastanon via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Wed Mar 23 22:42:28 PDT 2016


Fastq is a format for storing DNA sequences together with the 
associated quality information often encoded in ascii characters. 
It is typically made of 4 lines  for example 2 fastq entries 
would look like this.

@seq1
TTATTTTAAT
+
?+BBB/DHH@
@seq2
GACCCTTTGCA
+
?+BHB/DIH@

I do not have a lot of D expirience and I am writing a simple 
parser to help work with  these files. Ideally it should be fast 
with low memory footprint. I am working with very large files of 
this type and can be  up to 1GB.

module fastq;

import std.stdio;
import std.file;
import std.exception;
import std.algorithm;
import std.string;

struct Record{

     string sequence;
     string quals;
     string name;
}

auto Records(string filename){

     static auto toRecords(S)(S str){

         auto res = findSplitBefore(str,"+\n");

         auto seq = res[0];
         auto qual = res[1];

         return Record(seq,qual);
     }

     string text = cast(string)std.file.read(filename);

     enforce(text.length > 0 && text[0] == '@');
     text = text[1 .. $];

     auto entries = splitter(text,'@');

     return map!toRecords(entries);
}

The issue with this is that the "+" character can be part of the 
quality information and I am using it to split the quality 
information from the sequence information. and ends up splitting 
the quality information which is wrong.
Ideally I do not want to use regex and I have heard of ragel for 
parsing but never used it. Such a solution would also be welcome, 
since I read it can be very fast.

Which is the idiomatic way to capture, sequence name (starts with 
@ character and the first entry) the sequence, (line2) the 
quality scores( line 4)


More information about the Digitalmars-d-learn mailing list