Speed of csvReader

Tue Jan 26 14:36:31 PST 2016

On Tue, Jan 26, 2016 at 08:54:34PM +0000, Chris Wright via Digitalmars-d-learn wrote:
> On Tue, 26 Jan 2016 18:16:28 +0000, Gerald Jansen wrote:
> 
> > On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
> >>
> >> While this is no fancy range-based code, and one might say it's
> >> more hackish and C-like than idiomatic D, the problem is that
> >> current D compilers can't quite optimize range-based code to this
> >> extent yet.  Perhaps in the future optimizers will improve so that
> >> more idiomiatic, range-based code will have comparable performance
> >> with fastcsv. (At least in theory this should be possible.)
> > 
> > As a D novice still struggling with the concept that composable
> > range-based functions can be more efficient than good-old looping
> > (ya, I know, cache friendliness and GC avoidance), I find it
> > extremely interesting that someone as expert as yourself would reach
> > for a C-like approach for serious data crunching. Given that data
> > crunching is the kind of thing I need to do a lot, I'm wondering how
> > general your statement above might be at this time w.r.t. this and
> > possibly other domains.
> 
> You want to reduce allocations. Ranges often let you do that. However,
> it's sometimes unsafe to reuse range values that aren't immutable.
> That means, if you want to keep the values around, you need to copy
> them -- which introduces an allocation.
> 
> You can get fewer large allocations by reading the whole file at once
> manually and using slices into that large allocation.

Yeah, in the course of this exercise, I found that the one thing that
has had the biggest impact on performance is the amount of allocations
involved.  Basically, I noted that the less allocations are made, the
more efficient the code. I'm not sure exactly why this is so, but it's
probably something to do with the fact that tracing GCs work better with
fewer allocations of larger objects, than many allocations of small
objects.  I have also noted in the past that D's current GC runs
collections a little too often; in past projects I've obtained
significant speedup (in one case, up to 40% reduction of total runtime)
by suppressing automatic collections and scheduling them manually at a
lower frequency.

In short, I've found that reducing GC load plays a much bigger role in
performance than the range vs. loops issue.

The reason I chose to write manual loops at first is to eliminate all
possibility of unexpected overhead that might hide behind range
primitives, as well as compiler limitations, as current optimizers
aren't exactly tuned for range-based idioms, and may fail to recognize
certain range-based idioms that would lead to much more efficient code.
However, in my second iteration when I made the fastcsv parser return an
input range instead of an array, I found only negligible performance
differences.  This suggests that perhaps range-based code may not
perform that badly after all. I have yet to test this hypothesis, as the
inner loop that parses fields in a single row is still a manual loop;
but my suspicion is that it wouldn't do too badly in range-based form
either.

What might make a big difference, though, is the part where slicing is
used, since that is essential for reducing the number of allocations.

The current iteration of struct-based parsing code, for instance, went
through an initial version that was excruciatingly slow for structs with
string fields. Why? Because the function takes const(char)[] as input,
and you can't legally get strings out of that unless you make a copy of
that data (since const means you cannot modify it, but somebody else
still might). So std.conv.to would allocate a new string and copy the
contents over, every time a string field was parsed, resulting in a
large number of small allocations.

To solve this, I decided to use a string buffer: instead of one
allocation per string, pre-allocate a large-ish char[] buffer, and every
time a string field was parsed, append the data into the buffer. If the
buffer becomes full, allocate a new one. Take a slice of the buffer
corresponding to that field and cast it to string (this is safe since
the algorithm was constructed never to write over previous parts of the
buffer).  This seemingly trivial optimization won me a performance
improvement of an order of magnitude(!).

This is particularly enlightening, since it suggests that even the
overhead of copying all the string fields out of the original data into
a new buffer does not add up to that much.  The new struct-based parser
also returns an input range rather than an array; I found that
constructing the array directly vs. copying from an input range didn't
really make that big of a difference either. What did make a huge
difference is reducing the number of allocations.

So the moral of the story is: avoid large numbers of small allocations.
If you have to do it, consider consolidating your allocations into a
series of allocations of large(ish) buffers instead, and taking slices
of the buffers.

(And on a tangential note, this backs up Walter's claim that string
manipulation in C/C++ ultimately will lose, because of strcpy() and
strlen(). Think of how many times in C/C++ code you have to copy string
data just because you can't guarantee the incoming string will still be
around after you return, and how many times you have to iterate over
strings just because arrays are pointers and thus have no length.  You
couldn't write the equivalent of fastcsv in C/C++, because you'll leak
memory and/or get dangling pointers, since you don't know what will
happen to the incoming data after you return, so you can't just take
slices of it. You'd be forced to malloc() all your strings, and then
somehow ensure the caller will clean up properly. Ultimately you'd need
a convoluted, unnatural API just to make sure the memory housekeeping is
taken care of. Whereas in D, even though the GC is so atrociously slow,
it *does* let you freely slice things to your heart's content with zero
API complication, no memory leaks, and when done right, can even rival
C/C++ performance, and that at a fraction of the mental load required to
write leak-free, pointer-bug-free C/C++ code.)

T

-- 
Tell me and I forget. Teach me and I remember. Involve me and I understand. -- Benjamin Franklin