D and i/o

Mon Nov 11 16:20:35 UTC 2019

On 11/10/19 2:16 AM, bioinfornatics wrote:
> On Saturday, 9 November 2019 at 23:39:09 UTC, bioinfornatics wrote:
>> Dear,
>>
>> In my field we are io bound thus I would like to have our tools fast 
>> as I can read a file.
>>
>> Thus I started some dummy bench which count the number of lines.
>> The result is compared to wc -l command. The line counting is only a 
>> pretext to evaluate the io, this process can be switched by any io 
>> processing. Thus we use much as possible the buffer instead the byLine 
>> range. Moreover such range imply that the buffer was read once before 
>> to be ready to process.
>>
>>
>> https://github.com/bioinfornatics/test_io
>>
>> Ideally I would like to process a shared buffer through multiple core 
>> and run a simd computation. But it is not yet done.
> 
> If you have some scripts or enhancements you are welcome
> 
> Currently results show that naïve implementation is at least twice time 
> slower than wc, up to 5 slower for // scripts

I will say from my experience with iopipe, the secret to counting lines 
is memchr.

After switching to memchr to find single bytes as an optimization, I was 
beating Linux getline. Both use memchr, but getline does extra 
processing to ensure the FILE * state is maintained.

See 
https://github.com/schveiguy/iopipe/blob/6fa58b67bc9cadeb5ccded0d686f0fd116aed1ed/examples/byline/byline.d

If you run that like:

iopipe_byline -nooutput < filetocheck.txt

that's about as fast as I can get without using mmap, should be 
comparable to wc -l. And it should work fine with all encodings (though 
only UTF8 is optimized with memchr, should work on that).

-Steve