std.csv Performance Review
Jesse Phillips via Digitalmars-d
digitalmars-d at puremagic.com
Fri Jun 2 21:25:27 PDT 2017
Author here:
The discussion[1] and articles[2] around "Faster Command Line
Tools" had me trying out std.csv for the task.
Now I know std.csv isn't fast and it allocates. When I wrote my
CSV parser, I'd also left around a parser which used slicing
instead of allocation[3].
I compared these two: LDC -O3 -release
std.csv: over 10 seconds
csv slicing: under 5 seconds
Over 50% improvement isn't bad, but this still wasn't competing
with the other implementations. Now I didn't profile std.csv's
implementation but I did take a look at the one with slicing.
Majority of the time was spent in std.algorithm.startsWith, which
is being called by countUntil. The calls made to empty() also add
up from the use in countUntil and startsWith. These functions are
by no means slow, startsWith averaged 1 millisecond execution
time while countUntil was up to 5 milliseconds; thing is starts
with was called a whopping 384,806,160 times.
Keep in mind that the file itself has 10,512,769 rows of data
with four columns. Now I've talked to std.csv's performance in
the past, probably with the author of the fast command line
tools. Essentially it came down to std.csv is restricted to
parsing with only the Input Range api, and you can't correctly
parse CSV without allocation. But now I'm working outside those
restrictions and so I offer an additional point.
Both of these do something none of the other implementation do,
it validates the CSV is well formed. If it finds that the file no
longer conforms to the correct CSV layout it makes a choice,
either throw an exception or guess and continue on (based on the
what the user requested). While the Nim implementation does
handle escaped quotes (and newlines, unlike fast csv) the parsing
assumes the file is well formed, which std.csv was quite prompt
to point out that this file is indeed not well formed.
Even though the issue can be ignored, the overhead of parsing to
identify issues still remains. I haven't attempted write the
algorithm assuming proper data structure so I don't know what the
performance would look like, but I suspect it isn't negligible.
There is also likely some overhead for providing the tokens
through range interfaces.
On another note, including this slicing version of the CSV parse
can and should be included in std.csv as a specialization. But it
is by no means ready. The feature requirements need to be spelled
out better (hasSlicing!Range fails for strings but is the primary
use-case for the optimization), escaped quotes remain in the
returned data (like I said proper parsing requires allocation).
1.
http://forum.dlang.org/post/chvukhbscgamxecvpwlw@forum.dlang.org
2.
https://www.euantorano.co.uk/posts/faster-command-line-tools-in-nim/
3. https://github.com/JesseKPhillips/JPDLibs/tree/csvoptimize
More information about the Digitalmars-d
mailing list