std.csv Performance Review

Mon Jun 5 13:15:56 PDT 2017

On Sun, Jun 04, 2017 at 03:59:03PM +0000, Jesse Phillips via Digitalmars-d wrote:
[...]
> Ok, I took you up on that, I'm still skeptical:
> 
> LDC2 -O3 -release -enable-cross-module-inlining
> 
> std.csv: 12487 msecs
> fastcsv (no gc): 1376 msecs
> csvslicing: 3039 msecs
> 
> That looks like about 10 times faster to me. Using the slicing version
> failed because of \r\n line endings (guess multi-part separators is
> broken) I changed the data file so I could get the execution time.

Thank you for testing it yourself.  I also tried to run the tests again
on my machine, and I can't seem to reproduce the 102136 msecs reading
again.  It seems that different compilers give somewhat readings, and
also we are using different compile flags.  In any case, in the spirit
of full disclosure, here's my test with the 3 compilers, that I just ran
just now just to be sure I'm not just copying old bad measurements:

$ dmd -O -inline benchmark.d fastcsv.d
$ ./benchmark stdstruct
std.csv read 2126883 records
std.csv (struct): 33119 msecs
$ ./benchmark faststruct2
fastcsv read 2126883 records
fastcsv (struct with const(char)[]): 2596 msecs

$ gdc -O3 -finline benchmark.d fastcsv.d -o benchmark
$ ./benchmark stdstruct
std.csv read 2126883 records
std.csv (struct): 23103 msecs
$ ./benchmark faststruct2
fastcsv read 2126883 records
fastcsv (struct with const(char)[]): 1909 msecs

$ ldc2 -O3 benchmark.d fastcsv.d 
$ ./benchmark stdstruct
std.csv read 2126883 records
std.csv (struct): 20776 msecs
$ ./benchmark faststruct2
fastcsv read 2126883 records
fastcsv (struct with const(char)[]): 1813 msecs

So, it looks like your 10x figure is more-or-less on target.  I've no
idea where the original 102136 msecs reading came from.  Perhaps that
was an unfortunate coincidence with a heavy load on my machine, or
something like that. Or maybe just a bungled copy-n-paste.

> Anyway, I'm not trying to claim fastcsv isn't good at what it does,
> all I'm trying to point out is std.csv is doing more work than these
> faster csv parsers. And I don't even want to claim that std.csv is
> better because of that work, it actually appears that it was a mistake
> to do validation.

I never intended for fastcsv to become a point of contention or as some
kind of competition with std.csv, and I apologize if I ever came across
that way.  I fully understand that std.csv does more work than fastcsv;
certainly, being able to assume an in-memory input and free slicing
gives a big advantage over being restricted to just input range
primitives. I had hoped to actually work fastcsv into a suitable form to
merge into std.csv -- to dispel wrong perceptions of D being "slow", you
see -- but it turned out to be more work than I had time for, so I
didn't get very far beyond the initial promising results.

My original hope was that the fastcsv code would be taken as a source of
ideas that we could adopt for speeding up std.csv, rather than be taken
in the way of "haha I wrote faster code than std.csv, so std.csv sux".
The latter was not my intention at all.

Anyway, I'm glad that you're looking into using slicing in std.csv. We
need Phobos modules to be highly performant so that newcomers don't get
the wrong impression about the language being slow. I'd also recommend
investigating reducing GC load, as I described in my previous post, as
another angle for improving the performance of std.csv.

As for whether to validate or not: if you were to ask me, I'd leave it
in, with a caveat in the docs that it would be less performant. As the
standard library, Phobos should give the user options, including the
option to validate input files that could potentially be malformed.  But
where the user knows the input is always well-formed, we can (and
should) take advantage of that to achieve better performance.

T

-- 
Why waste time reinventing the wheel, when you could be reinventing the engine? -- Damian Conway