How to read fastly files ( I/O operation)
monarch_dodra
monarchdodra at gmail.com
Mon Feb 4 11:39:44 PST 2013
On Monday, 4 February 2013 at 19:30:59 UTC, Dejan Lekic wrote:
> FG wrote:
>
>> On 2013-02-04 15:04, bioinfornatics wrote:
>>> I am looking to parse efficiently huge file but i think D
>>> lacking for this
>>> purpose. To parse 12 Go i need 11 minutes wheras fastxtoolkit
>>> (written in c++
>>> ) need 2 min.
>>>
>>> My code is maybe not easy as is not easy to parse a fastq
>>> file and is more
>>> harder when using memory mapped file.
>>
>> Why are you using mmap? Don't you just go through the file
>> sequentially?
>> In that case it should be faster to read in chunks:
>>
>> foreach (ubyte[] buffer; file.byChunk(chunkSize)) { ... }
>
> I would go even further, and organise the file so N Data
> objects fit one page,
> and read the file page by page. The page-size can easily be
> obtained from the
> system. IMHO that would beat this fastxtoolkit. :)
AFAIK, he is reading text data that needs to be parsed line by
line, so byChunk may not be the best approach. Or at least, not
the easiest approach.
I'm just wondering if maybe the reason the D code is slow is not
just because of:
- unicode.
- front + popFront.
ranges in D are "notorious" for being slow to iterate on text,
due to the "double decode".
If you are *certain* that the file contains nothing but ASCII
(which should be the case for fastq, right?), you can get more
bang for your buck if you attempt to iterate over it as an array
of bytes, and convert the bytes to char on the fly, bypassing any
and all unicode processing.
More information about the Digitalmars-d-learn
mailing list