Improving CSV parsing performance, Episode 2 (Was: Re: Speed of csvReader)

Sat Jan 23 17:57:11 PST 2016

On Fri, Jan 22, 2016 at 10:04:58PM +0000, data pulverizer via Digitalmars-d-learn wrote:
[...]
> I guess the next step is allowing Tuple rows with mixed types.

Alright. I threw together a new CSV parsing function that loads CSV data
into an array of structs. Currently, the implementation is not quite
polished yet (it blindly assumes the first row is a header row, which it
discards), but it does work, and outperforms std.csv by about an order
of magnitude.

The initial implementation was very slow (albeit still somewhat fast
than std.csv by about 10% or so) when given a struct with string fields.
However, structs with POD fields are lightning fast (not significantly
different from before, in spite of all the calls to std.conv.to!). This
suggested that the slowdown was caused by excessive allocations of small
strings, causing a heavy GC load.  This suspicion was confirmed when I
ran the same input data with a struct where all string fields were
replaced with const(char)[] (so that std.conv.to simply returned slices
to the data) -- the performance shot back up to about 1700 msecs, a
little slower than the original version of reading into an array of
array of const(char)[] slices, but about 58 times(!) the performance of
std.csv.

So I tried a simple optimization: instead of allocating a string per
field, allocate 64KB string buffers and copy string field values into
it, then taking slices from the buffer to assign to the struct's string
fields.  With this optimization, running times came down to about the
1900 msec range, which is only marginally slower than the const(char)[]
case, about 51 times faster than std.csv.

Here are the actual benchmark values:

1) std.csv: 2126883 records, 102136 msecs

2) fastcsv (struct with string fields): 2126883 records, 1978 msecs

3) fastcsv (struct with const(char)[] fields): 2126883 records, 1743 msecs

The latest code is available on github:

	https://github.com/quickfur/fastcsv

The benchmark driver now has 3 new targets:

stdstruct	- std.csv parsing of CSV into structs
faststruct	- fastcsv parsing of CSV into struct (string fields)
faststruct2	- fastcsv parsing of CSV into struct (const(char)[] fields)

Note that the structs are hard-coded into the code, so they will only
work with the census.gov test file.

Things still left to do:

- Fix header parsing to have a consistent interface with std.csv, or at
  least allow the user to configure whether or not the first row should
  be discarded.

- Support transcription to Tuples?

- Refactor the code to have less copy-pasta.

- Ummm... make it ready for integration with std.csv maybe? ;-)

T

-- 
Fact is stranger than fiction.