Speed of csvReader

Thu Jan 21 14:20:28 PST 2016

On Thu, Jan 21, 2016 at 10:09:24PM +0000, Jon D via Digitalmars-d-learn wrote:
[...]
> FWIW - I've been implementing a few programs manipulating delimited
> files, e.g. tab-delimited. Simpler than CSV files because there is no
> escaping inside the data. I've been trying to do this in relatively
> straightforward ways, e.g. using byLine rather than byChunk. (Goal is
> to explore the power of D standard libraries).
> 
> I've gotten significant speed-ups in a couple different ways:
> * DMD libraries 2.068+  -  byLine is dramatically faster
> * LDC 0.17 (alpha)  -  Based on DMD 2.068, and faster than the DMD compiler

While byLine has improved a lot, it's still not the fastest thing in the
world, because it still performs (at least) one OS roundtrip per line,
not to mention it will auto-reencode to UTF-8. If your data is already
in a known encoding, reading in the entire file and casting to
(|w|d)string then splitting it by line will be a lot faster, since you
can eliminate a lot of I/O roundtrips that way.

In any case, it's well-known that gdc/ldc generally produce code that's
about 20%-30% faster than dmd-compiled code, sometimes a lot more. While
DMD has gotten some improvements in this area recently, it still has a
long way to go before it can catch up.  For performance-sensitive code I
always reach for gdc instead of dmd.

> * Avoid utf-8 to dchar conversion - This conversion often occurs
> silently when working with ranges, but is generally not needed when
> manipulating data.
[...]

Yet another nail in the coffin of auto-decoding.  I wonder how many more
nails we will need before Andrei is convinced...

T

-- 
The diminished 7th chord is the most flexible and fear-instilling chord. Use it often, use it unsparingly, to subdue your listeners into submission!