parsing fastq files with D
eastanon via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Wed Mar 23 22:42:28 PDT 2016
Fastq is a format for storing DNA sequences together with the
associated quality information often encoded in ascii characters.
It is typically made of 4 lines for example 2 fastq entries
would look like this.
@seq1
TTATTTTAAT
+
?+BBB/DHH@
@seq2
GACCCTTTGCA
+
?+BHB/DIH@
I do not have a lot of D expirience and I am writing a simple
parser to help work with these files. Ideally it should be fast
with low memory footprint. I am working with very large files of
this type and can be up to 1GB.
module fastq;
import std.stdio;
import std.file;
import std.exception;
import std.algorithm;
import std.string;
struct Record{
string sequence;
string quals;
string name;
}
auto Records(string filename){
static auto toRecords(S)(S str){
auto res = findSplitBefore(str,"+\n");
auto seq = res[0];
auto qual = res[1];
return Record(seq,qual);
}
string text = cast(string)std.file.read(filename);
enforce(text.length > 0 && text[0] == '@');
text = text[1 .. $];
auto entries = splitter(text,'@');
return map!toRecords(entries);
}
The issue with this is that the "+" character can be part of the
quality information and I am using it to split the quality
information from the sequence information. and ends up splitting
the quality information which is wrong.
Ideally I do not want to use regex and I have heard of ragel for
parsing but never used it. Such a solution would also be welcome,
since I read it can be very fast.
Which is the idiomatic way to capture, sequence name (starts with
@ character and the first entry) the sequence, (line2) the
quality scores( line 4)
More information about the Digitalmars-d-learn
mailing list