parsing fastq files with D

rikki cattermole via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Thu Mar 24 01:32:04 PDT 2016


On 24/03/16 9:24 PM, eastanon wrote:
> On Thursday, 24 March 2016 at 06:34:51 UTC, rikki cattermole wrote:
>> As a little fun thing to do I implemented it for you.
>>
>> It won't allocate. Making this perfect for you.
>> With a bit of work you could make Result have buffers for result
>> instead of using the input array allow for the source to be an input
>> range itself.
>>
>> I made this up on dpaste and single quotes were not playing nicely
>> there. So you'll see "\r"[0] as a workaround.
>
> Thank you very much. I think you have exposed me to  a number of new
> concepts that I will go through and annotate the code with. I read all
> input from file as follows.
>
> string text = cast(string)std.file.read(inputfile);
> foreach(record;FastQRecord.parse(text)){
>     writeln(record);
> }
>
> </naivequestion>Does this mean that text is allocated to memory? and is
> there a better way to read and process the inputfile? </naivequestion>

Yup, any string usage gets allocated in memory. Since the way I designed 
parse there, it won't allocate during returning of a record. Just don't 
go around modifying that memory or keeping copies (not critical but I 
wouldn't).

There are better ways to read the input file. For example byChunk would 
read it in smaller groups of say 4096 chars. But you would need to 
overload that with support for e.g. lines parsing.

The way I'd go is memory mapped files as an input range.
After all, let the OS help you out with deciding when to put the file 
and parts of it into memory.

After all, if you have 1gb files you want to parse. Could you just as 
easily have 2tb worth? Probably since it is DNA after all.
You kinda don't want that all in memory at once if you know what I mean.

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus



More information about the Digitalmars-d-learn mailing list