How to read fastly files ( I/O operation)

monarch_dodra monarchdodra at gmail.com
Mon Feb 4 11:39:44 PST 2013


On Monday, 4 February 2013 at 19:30:59 UTC, Dejan Lekic wrote:
> FG wrote:
>
>> On 2013-02-04 15:04, bioinfornatics wrote:
>>> I am looking to parse efficiently huge file but i think D 
>>> lacking for this
>>> purpose. To parse 12 Go i need 11 minutes wheras fastxtoolkit 
>>> (written in c++
>>> ) need 2 min.
>>>
>>> My code is maybe not easy as is not easy to parse a fastq 
>>> file and is more
>>> harder when using memory mapped file.
>> 
>> Why are you using mmap? Don't you just go through the file 
>> sequentially?
>> In that case it should be faster to read in chunks:
>> 
>>      foreach (ubyte[] buffer; file.byChunk(chunkSize)) { ... }
>
> I would go even further, and organise the file so N Data 
> objects fit one page,
> and read the file page by page. The page-size can easily be 
> obtained from the
> system. IMHO that would beat this fastxtoolkit. :)

AFAIK, he is reading text data that needs to be parsed line by 
line, so byChunk may not be the best approach. Or at least, not 
the easiest approach.

I'm just wondering if maybe the reason the D code is slow is not 
just because of:
- unicode.
- front + popFront.

ranges in D are "notorious" for being slow to iterate on text, 
due to the "double decode".

If you are *certain* that the file contains nothing but ASCII 
(which should be the case for fastq, right?), you can get more 
bang for your buck if you attempt to iterate over it as an array 
of bytes, and convert the bytes to char on the fly, bypassing any 
and all unicode processing.


More information about the Digitalmars-d-learn mailing list