std.csv Performance Review

Sat Jun 3 23:15:24 PDT 2017

On Sun, Jun 04, 2017 at 05:41:10AM +0000, Jesse Phillips via Digitalmars-d wrote:
> On Saturday, 3 June 2017 at 23:18:26 UTC, bachmeier wrote:
> > Do you know what happened with fastcsv [0], original thread [1].
> > 
> > [0] https://github.com/quickfur/fastcsv
> > [1] http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-learn@puremagic.com
> 
> I do not. Rereading that in light of this new article I'm a little
> sceptical of the 51 times faster, since I'm seeing only 10x against
> these other implications.
[...]

You don't have to be skeptical, neither do you have to believe what I
claimed.  I posted the entire code I used in the original thread, as
well as the URLs of the exact data files I used for testing.  You can
just run it yourself and see the results for yourself.

And yes, fastcsv has its limitations (the file has to fit in memory, no
validation is done, etc.), which are also documented up-front in the
README file.  I wrote the code targeting a specific use case mentioned
by the OP of the original thread, so I do not expect or claim you will
see the same kind of performance for other use cases. If you want
validation, then it's a given that you won't get maximum performance,
simply because there's just more work to do.  For data sets that don't
fit into memory, I already have some ideas about how to extend my
algorithm to work with it, so some of the performance may still be
retained. But obviously it's not going to be as fast as if you can just
read the entire file into memory first.

(Note that this is much less of a limitation than it seems; for example
you could use std.mmfile to memory-map the file into your address space
so that it doesn't actually have to fit into memory, and you can still
take slices of it. The OS will manage the paging from/to disk for you.
Of course, it will be slower when something has to be paged from disk,
but IME this is often much faster than if you read the data into memory
yourself. Again, you don't have to believe me: the fastcsv code is
there, just import std.mmfile, mmap the largest CSV file you can find,
call fastcsv on it, and measure the performance yourself. If your code
performs better, great, tell us all about it. I'm interested to learn
how you did it.)

Note that besides slicing, another major part of the performance boost
in fastcsv is in minimizing GC allocations.  If you allocate a string
for each field in a row, it will be much slower than if you either
sliced the original string, or if you allocated a large buffer for
holding the data and just take slices for each field.  Furthermore, if
you allocate a new array per row to hold the list of fields, it will be
much slower than if you allocate a large array for holding all the
fields of all the rows, and merely slice this array to get your rows.
Of course, you cannot know ahead of time exactly how many rows there
will be, so the next best thing is to allocate a series of large arrays,
capable of holding the field slices of k rows, for sufficiently large k.
Once the current array runs out of space, copy the (partial) slices of
the last row to beginning of a new large array, and continue from there.

This way, you will be making n/k allocations, where n is the number of
rows and k is the number of rows that fit into each buffer, as opposed
to n allocations. For large values of k, this greatly reduces the GC
load and significantly speeds things up.  Again, don't take my word for
it. Run a profiler on the fastcsv code and see for yourself.

T

-- 
People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG