Speed of csvReader

H. S. Teoh via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Thu Jan 21 17:22:16 PST 2016


On Fri, Jan 22, 2016 at 12:56:02AM +0000, cym13 via Digitalmars-d-learn wrote:
[...]
> Great! Sorry for the separator thing, I didn't read your code
> carefully. You still lack some things like comments and surely more
> things that I don't know about but it's getting there.

Comments? You mean in the code?  'cos the CSV grammar described in
RFC-4180 doesn't seem to have the possibility of comments in the CSV
itself...


> I didn't think you'd go through the trouble of fixing those things to
> be honnest, I'm impressed.

They weren't that hard to fix, because the original code already had a
separate path for quoted values, so it was just a matter of deleting
some of the loop conditions to make the quoted path accept delimiters
and newlines. In fact, the original code already accepted doubled
quotes in the unquoted field path.

It was only to implement interpretation of doubled quotes that required
modifications to both inner loops.

Now having said that, though, I think there are some bugs in the code
that might cause an array overrun... and the fix might slow things down
yet a bit more. There are also some fundamental limitations:

1) The CSV data has to be loadable into memory in its entirety. This may
not be possible for very large files, or on machines with low memory.

2) There is no ranged-based interface. I *think* this should be possible
to add, but it will probably increase the overhead and make the code
slower.

3) There is no validation of the input whatsoever. If you feed it
malformed CSV, it will give you nonsensical output. Well, it may crash,
but hopefully won't anymore after I fix those missing bounds checks...
but it will still give you nonsensical output.

4) The accepted syntax is actually a little larger than strict CSV (in
the sense of RFC-4180); Unicode input is accepted but RFC-4180 does not
allow Unicode. This may actually be a plus, though, because I'm
expecting that modern CSV may actually contain Unicode data, not just
the ASCII range defined in RFC-4180.


T

-- 
The volume of a pizza of thickness a and radius z can be described by the following formula: pi zz a. -- Wouter Verhelst


More information about the Digitalmars-d-learn mailing list