Review of Jesse Phillips's CSV Parser
Jonathan M Davis
jmdavisProg at gmx.com
Thu Nov 10 02:10:28 PST 2011
On Friday, October 28, 2011 09:18:27 dsimcha wrote:
> Formal review of Jesse Phillips's CSV parser module begins today and
> runs through November . Following that, a vote will take place for one
> week. Please post any comments about this library in this thread.
>
> Docs:
> http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
>
> Code:
> https://github.com/he-the-great/phobos/tree/csv
- Nitpick: You should probably proofread the module documentation again. It
has several mispellings and grammar errors.
- The functions and types mentioned in the documentation should use LREF so
that they're links to their documentation.
- Shouldn't the term "header" be used rather than "heading?" So, you'd have
HeaderMismatchException instead of HeadingMismatchException. Usually, the term
heading is only used for stuff like the heading of an article. Usually, the
term header is used for stuff like files. RFC 4180 uses the term header but
never heading. So, unless I'm missing something, I really think that all of
the instances of heading (both in the documentation and in the API) should be
changed to header.
- I'd argue against Separator in csvReader being char by default. Individual
characters should generally be dchar, not char. Using char by itself is almost
always bad practice IMHO.
- I'd say that csvReader should say that it returns a Records struct which is
an input range rather than saying that it provides std.range.isInputRange. We
just don't talk that way about ranges normally. We say that something is a
particular range type, not that it provides the API for a particular range
type. If you want to make "is an input range" a link to
std.range.isInputRange, that's fine, but what's currently there is not how
ranges are typically referred to. I'd also suggest that you on the range
functions where you say that they're part of the std.range.isInputRange
interface, you should use the term API instead of interface, since interface
has a specific meaning in the language which has nothing to do with this
context.
- Why does csvNextToken take an Appender!(char[])? That seems overly
restrictive. I'd argue that it should take an output range of dchar (which
Appender!(char[]) is). Then it's not restricted to Appender, and it's not
restricted to char[].
- Also, the fact that you have isSomeChar!(ElementType!Range) concerns me
somewhat. Normally, I'd expect something more like is(ElementType!Range ==
dchar). As it is, you could have a range of char or wchar, which could do
funny things. Phobos doesn't normally support that, and I'd be concerned about
what would happen if someone tried. ElementType!Range will be dchar for all
string types, so they'll work fine either way, but programmers who are overly
adventurous or ignorant may try and use a char or wchar range, and I'm willing
to be that there would be problems with that. Just all around, I'd advise
checking that the ranges are ranges of dchar and using dchar for the
separators rather than char.
- You should have unit tests testing ranges _other_ than string. The fact that
neither wstring nor dstring is tested is worrisome, and ideally, you'd do stuff
like call filter!"true" on a string so that you have an input range which isn't
a string. Without at least some of that, I'd be very worried that your code
can't actually handle generic ranges of dchar and really only works with
string.
- You should have some unit tests with unicode values - preferably some which
are multiple code units for both char and wchar. For instance, std.string uses
\U00010143 in some of its tests, because it's 4 code units in UTF-8 and 2 in
UTF-16. The complete lack of testing with unicode is worrisome in the same way
that only testing with string is worrisome. It's too easy to screw something
up such that it happens to work with string and ASCII but not in general. So,
please add tests which involve unicode - not only in the data but in the
separators.
- I find it somewhat odd that you would use a pointer to the range in Record
rather than using a slice, but I'd have to examine it in more detail to see
whether that's actually necessary or not. But in general, I'd argue that a
slice should be used rather than a pointer if it can be.
- Jonathan M Davis
P.S. Just as a note, since you seem to like not using braces if you don't
need to for if statements and the like, I though that I'd point out that try
and catch don't require braces in D if htey contain only one statement. You
have several one line catches which use braces when they don't need to. I'm
not arguing that you should or shouldn't change it, but I thought that I would
point out that the braces aren't necessary in case you didn't know (since they
_are_ necessary in other languages).
More information about the Digitalmars-d
mailing list