std.csv Performance Review

Fri Jun 2 21:25:27 PDT 2017

Author here:

The discussion[1] and articles[2] around "Faster Command Line 
Tools" had me trying out std.csv for the task.

Now I know std.csv isn't fast and it allocates. When I wrote my 
CSV parser, I'd also left around a parser which used slicing 
instead of allocation[3].

I compared these two: LDC -O3 -release

std.csv: over 10 seconds
csv slicing: under 5 seconds

Over 50% improvement isn't bad, but this still wasn't competing 
with the other implementations. Now I didn't profile std.csv's 
implementation but I did take a look at the one with slicing.

Majority of the time was spent in std.algorithm.startsWith, which 
is being called by countUntil. The calls made to empty() also add 
up from the use in countUntil and startsWith. These functions are 
by no means slow, startsWith averaged 1 millisecond execution 
time while countUntil was up to 5 milliseconds; thing is starts 
with was called a whopping 384,806,160 times.

Keep in mind that the file itself has 10,512,769 rows of data 
with four columns. Now I've talked to std.csv's performance in 
the past, probably with the author of the fast command line 
tools. Essentially it came down to std.csv is restricted to 
parsing with only the Input Range api, and you can't correctly 
parse CSV without allocation. But now I'm working outside those 
restrictions and so I offer an additional point.

Both of these do something none of the other implementation do, 
it validates the CSV is well formed. If it finds that the file no 
longer conforms to the correct CSV layout it makes a choice, 
either throw an exception or guess and continue on (based on the 
what the user requested). While the Nim implementation does 
handle escaped quotes (and newlines, unlike fast csv) the parsing 
assumes the file is well formed, which std.csv was quite prompt 
to point out that this file is indeed not well formed.

Even though the issue can be ignored, the overhead of parsing to 
identify issues still remains. I haven't attempted write the 
algorithm assuming proper data structure so I don't know what the 
performance would look like, but I suspect it isn't negligible. 
There is also likely some overhead for providing the tokens 
through range interfaces.

On another note, including this slicing version of the CSV parse 
can and should be included in std.csv as a specialization. But it 
is by no means ready. The feature requirements need to be spelled 
out better (hasSlicing!Range fails for strings but is the primary 
use-case for the optimization), escaped quotes remain in the 
returned data (like I said proper parsing requires allocation).

1. 
http://forum.dlang.org/post/chvukhbscgamxecvpwlw@forum.dlang.org
2. 
https://www.euantorano.co.uk/posts/faster-command-line-tools-in-nim/
3. https://github.com/JesseKPhillips/JPDLibs/tree/csvoptimize