Speed of csvReader
Jon D via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Thu Jan 21 15:43:36 PST 2016
On Thursday, 21 January 2016 at 22:20:28 UTC, H. S. Teoh wrote:
> On Thu, Jan 21, 2016 at 10:09:24PM +0000, Jon D via
> Digitalmars-d-learn wrote: [...]
>> FWIW - I've been implementing a few programs manipulating
>> delimited files, e.g. tab-delimited. Simpler than CSV files
>> because there is no escaping inside the data. I've been trying
>> to do this in relatively straightforward ways, e.g. using
>> byLine rather than byChunk. (Goal is to explore the power of D
>> standard libraries).
>>
>> I've gotten significant speed-ups in a couple different ways:
>> * DMD libraries 2.068+ - byLine is dramatically faster
>> * LDC 0.17 (alpha) - Based on DMD 2.068, and faster than the
>> DMD compiler
>
> While byLine has improved a lot, it's still not the fastest
> thing in the world, because it still performs (at least) one OS
> roundtrip per line, not to mention it will auto-reencode to
> UTF-8. If your data is already in a known encoding, reading in
> the entire file and casting to (|w|d)string then splitting it
> by line will be a lot faster, since you can eliminate a lot of
> I/O roundtrips that way.
>
No disagreement, but I had other goals. At a high level, I'm
trying to learn and evaluate D, which partly involves
understanding the strengths and weaknesses of the standard
library. From this perspective, byLine was a logical starting
point. More specifically, the tools I'm writing are often used in
unix pipelines, so input can be a mixture of standard input and
files. And, the files can be arbitrarily large. In these cases,
reading the entire file is not always appropriate. Buffering
usually is, and my code knows when it is dealing with files vs
standard input and could handle these differently. However,
standard library code could handle these distinctions as well,
which was part of the reason for trying the straightforward
approach.
Aside - Despite the 'learning D' motivation, the tools are real
tools, and writing them in D has been a clear win, especially
with the byLine performance improvements in 2.068.
More information about the Digitalmars-d-learn
mailing list