D and i/o
Steven Schveighoffer
schveiguy at gmail.com
Mon Nov 11 16:20:35 UTC 2019
On 11/10/19 2:16 AM, bioinfornatics wrote:
> On Saturday, 9 November 2019 at 23:39:09 UTC, bioinfornatics wrote:
>> Dear,
>>
>> In my field we are io bound thus I would like to have our tools fast
>> as I can read a file.
>>
>> Thus I started some dummy bench which count the number of lines.
>> The result is compared to wc -l command. The line counting is only a
>> pretext to evaluate the io, this process can be switched by any io
>> processing. Thus we use much as possible the buffer instead the byLine
>> range. Moreover such range imply that the buffer was read once before
>> to be ready to process.
>>
>>
>> https://github.com/bioinfornatics/test_io
>>
>> Ideally I would like to process a shared buffer through multiple core
>> and run a simd computation. But it is not yet done.
>
> If you have some scripts or enhancements you are welcome
>
> Currently results show that naïve implementation is at least twice time
> slower than wc, up to 5 slower for // scripts
I will say from my experience with iopipe, the secret to counting lines
is memchr.
After switching to memchr to find single bytes as an optimization, I was
beating Linux getline. Both use memchr, but getline does extra
processing to ensure the FILE * state is maintained.
See
https://github.com/schveiguy/iopipe/blob/6fa58b67bc9cadeb5ccded0d686f0fd116aed1ed/examples/byline/byline.d
If you run that like:
iopipe_byline -nooutput < filetocheck.txt
that's about as fast as I can get without using mmap, should be
comparable to wc -l. And it should work fine with all encodings (though
only UTF8 is optimized with memchr, should work on that).
-Steve
More information about the Digitalmars-d
mailing list