Review of Jesse Phillips's CSV Parser

Thu Nov 10 02:10:28 PST 2011

On Friday, October 28, 2011 09:18:27 dsimcha wrote:
> Formal review of Jesse Phillips's CSV parser module begins today and
> runs through November .  Following that, a vote will take place for one
> week.  Please post any comments about this library in this thread.
> 
> Docs:
> http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
> 
> Code:
> https://github.com/he-the-great/phobos/tree/csv

- Nitpick: You should probably proofread the module documentation again. It 
has several mispellings and grammar errors.

- The functions and types mentioned in the documentation should use LREF so 
that they're links to their documentation.

- Shouldn't the term "header" be used rather than "heading?" So, you'd have 
HeaderMismatchException instead of HeadingMismatchException. Usually, the term 
heading is only used for stuff like the heading of an article. Usually, the 
term header is used for stuff like files. RFC 4180 uses the term header but 
never heading. So, unless I'm missing something, I really think that all of 
the instances of heading (both in the documentation and in the API) should be 
changed to header.

- I'd argue against Separator in csvReader being char by default. Individual 
characters should generally be dchar, not char. Using char by itself is almost 
always bad practice IMHO.

- I'd say that csvReader should say that it returns a Records struct which is 
an input range rather than saying that it provides std.range.isInputRange. We 
just don't talk that way about ranges normally. We say that something is a 
particular range type, not that it provides the API for a particular range 
type. If you want to make "is an input range" a link to 
std.range.isInputRange, that's fine, but what's currently there is not how 
ranges are typically referred to. I'd also suggest that you on the range 
functions where you say that they're part of the std.range.isInputRange 
interface, you should use the term API instead of interface, since interface 
has a specific meaning in the language which has nothing to do with this 
context.

- Why does csvNextToken take an Appender!(char[])? That seems overly 
restrictive. I'd argue that it should take an output range of dchar (which 
Appender!(char[]) is). Then it's not restricted to Appender, and it's not 
restricted to char[].

- Also, the fact that you have isSomeChar!(ElementType!Range) concerns me 
somewhat. Normally, I'd expect something more like is(ElementType!Range == 
dchar). As it is, you could have a range of char or wchar, which could do 
funny things. Phobos doesn't normally support that, and I'd be concerned about 
what would happen if someone tried. ElementType!Range will be dchar for all 
string types, so they'll work fine either way, but programmers who are overly 
adventurous or ignorant may try and use a char or wchar range, and I'm willing 
to be that there would be problems with that. Just all around, I'd advise 
checking that the ranges are ranges of dchar and using dchar for the 
separators rather than char.

- You should have unit tests testing ranges _other_ than string. The fact that 
neither wstring nor dstring is tested is worrisome, and ideally, you'd do stuff 
like call filter!"true" on a string so that you have an input range which isn't 
a string. Without at least some of that, I'd be very worried that your code 
can't actually handle generic ranges of dchar and really only works with 
string.

- You should have some unit tests with unicode values - preferably some which 
are multiple code units for both char and wchar. For instance, std.string uses 
\U00010143 in some of its tests, because it's 4 code units in UTF-8 and 2 in 
UTF-16. The complete lack of testing with unicode is worrisome in the same way 
that only testing with string is worrisome. It's too easy to screw something 
up such that it happens to work with string and ASCII but not in general. So, 
please add tests which involve unicode - not only in the data but in the 
separators.

- I find it somewhat odd that you would use a pointer to the range in Record 
rather than using a slice, but I'd have to examine it in more detail to see 
whether that's actually necessary or not. But in general, I'd argue that a 
slice should be used rather than a pointer if it can be.

- Jonathan M Davis

P.S.  Just as a note, since you seem to like not using braces if you don't 
need to for if statements and the like, I though that I'd point out that try 
and catch don't require braces in D if htey contain only one statement. You 
have several one line catches which use braces when they don't need to. I'm 
not arguing that you should or shouldn't change it, but I thought that I would 
point out that the braces aren't necessary in case you didn't know (since they 
_are_ necessary in other languages).