Speed of csvReader

Thu Jan 21 15:43:36 PST 2016

On Thursday, 21 January 2016 at 22:20:28 UTC, H. S. Teoh wrote:
> On Thu, Jan 21, 2016 at 10:09:24PM +0000, Jon D via 
> Digitalmars-d-learn wrote: [...]
>> FWIW - I've been implementing a few programs manipulating 
>> delimited files, e.g. tab-delimited. Simpler than CSV files 
>> because there is no escaping inside the data. I've been trying 
>> to do this in relatively straightforward ways, e.g. using 
>> byLine rather than byChunk. (Goal is to explore the power of D 
>> standard libraries).
>> 
>> I've gotten significant speed-ups in a couple different ways:
>> * DMD libraries 2.068+  -  byLine is dramatically faster
>> * LDC 0.17 (alpha)  -  Based on DMD 2.068, and faster than the 
>> DMD compiler
>
> While byLine has improved a lot, it's still not the fastest 
> thing in the world, because it still performs (at least) one OS 
> roundtrip per line, not to mention it will auto-reencode to 
> UTF-8. If your data is already in a known encoding, reading in 
> the entire file and casting to (|w|d)string then splitting it 
> by line will be a lot faster, since you can eliminate a lot of 
> I/O roundtrips that way.
>
No disagreement, but I had other goals. At a high level, I'm 
trying to learn and evaluate D, which partly involves 
understanding the strengths and weaknesses of the standard 
library. From this perspective, byLine was a logical starting 
point. More specifically, the tools I'm writing are often used in 
unix pipelines, so input can be a mixture of standard input and 
files. And, the files can be arbitrarily large. In these cases, 
reading the entire file is not always appropriate. Buffering 
usually is, and my code knows when it is dealing with files vs 
standard input and could handle these differently. However, 
standard library code could handle these distinctions as well, 
which was part of the reason for trying the straightforward 
approach.

Aside - Despite the 'learning D' motivation, the tools are real 
tools, and writing them in D has been a clear win, especially 
with the byLine performance improvements in 2.068.