[RFC] CSV parser
Robert Jacques
sandford at jhu.edu
Tue Apr 5 22:53:19 PDT 2011
On Tue, 05 Apr 2011 12:45:59 -0400, Jesse Phillips
<jessekphillips+D at gmail.com> wrote:
> Robert Jacques Wrote:
>> * You should input ranges. It's fine to detect slicing and optimize for
>> it, but you should support simple input ranges as well.
> This implementation only operates with input ranges, of text. I have
> another implementation which works on a slice-able forward range of
> anything. I stopped development because it didn't support input ranges,
> and Phobos must have this.
> https://github.com/he-the-great/JPDLibs/tree/csvoptimize
The library should work with both and be efficient with both. i.e. detect
at compile-time whether you have a string or an input range, and act
>> * I'd think being able to retrieve the headings from the csv would be a
>> good [optional] feature.
> Do you think it would be good for the interface to use an associative
> array. Maybe the heading could be the first thing to get from a
> RecordList.
> auto records = csvText(str);
> auto heading = records.heading;
> or shouldn't it be passed in as an empty array:
> string[] heading;
> auto records = csvText(str, heading);
Hmm... The latter makes it explicit that there should be headers, while
with the former, you'd need to call heading before parsing the records,
which could be a source of bugs. Then again, records.heading is more
natural. Alternative syntax ideas:
auto records = csvText(str, null);
auto heading = records.heading;
auto records = csvText!(csvOptions.headers)(str);
auto heading = records.heading;
>> * Exposing the tokenizer would be useful.
> Good. I'm thinking I will use an output range of type char[] instead of
> Appender.
Output range? You mean as a buffer? I'd think Appender is the better
choice for that internally, but you could allow the user to pass in a
char[] buffer to use as the basis of the appender.
>> * Regarding buffering, it's okay for the tokenizer to expose buffering
>> in
>> it's API (and users should be able to supply their own buffers), but I
>> don't think an unbuffered version of csvText itself is correct;
> As I'm thinking of using an output range I believe it has the job of
> allowing a user specified buffer or not. I'm not sure if the interface
> for csvText or RecordList should allow for custom buffers.
An output range is not a buffer. A buffer requires a clear method and a
data retrieval method. All output ranges provide is put.
>> csvByRecord or csvText!(T).byRecord would be more appropriate.
> You always iterate CSV by record, what other option is there? If you
> desire an array for the record:
> auto records = csvText(str);
> foreach(record: records)
> auto arr = array(record); // ...
Well, some of your example code included
csvText!MyStruct(str,["my","headers"]); or possibly just
csvText!MyStruct(str). Which would have either provide a forward range of
MyStructs or an array of MyStruct. In either case, duplication of strings
where appropriate should occur. Also, remember that records might not be
manually processed; the user might forward it to map,filter,etc and
therefore should be safe by default. A user should have to manually state
that they are going to want a lazy buffered version.
>> And
>> anyways, since you're only using strings, why is there any buffering
>> going
>> on at all? string values should simply be sliced, not buffered.
>> Buffering
>> should only come into play with input ranges.
> Input ranges do not provide slicing. On top of that you _must_ modify
> the returned data at times. So you can do buffering and slicing together
> which my other implementation will do, but as I said it won't support
> input ranges.
Why do you need to do any modification? Part of the advantage of csv is
that it's just text: you don't have to deal with escape characters, etc.
>> * There should be a way to specify other separators; I've started using
>> tab separated files as ','s show up in a lot of data.
> You can use a custom "comma" and "quote" the record break is not
> modifiable because it doesn't fit into a single character. I can make
> this available through csvText and am not exactly sure why I didn't.
>> * Any thought of parsing a file into a tuple of arrays? Writing csv?
> A static CSV parser? I would hope CTFE would allow for using this
> parser, at some point.
By tuple I meant the tuple(5,7) kind, not the TypeTuple!(int,char) kind.
auto records = csvText!(string,real)(str);
string[] names = records._0;
real[] grades = records._1;
auto records = csvText!(Tuple!(string,"name",real,"grade"))(str);
foreach(record; records) {
writeln(record.name," got ",record.grade);
> For writing I made a quick and dirty one to write structures to a file.
> But I have not considered adding such functionality to the library.
> Thank you for the comment.
> void writeStruct(string file, Layout data) {
> string[] elements;
> foreach(i, U; FieldTypeTuple!Layout) {
> elements ~= to!string(data.tupleof[i]);
> }
> string record;
> foreach(e; elements) {
> if(find(e, `"`, ",", "\n", "\r")[1] == 0)
> record ~= e ~ ",";
> else
> record ~= `"` ~ replace(e, `"`, `""`) ~ `",`;
> }
> std.file.append(file, record[0..$-1] ~ "\n");
> }
> https://gist.github.com/805392
