[RFC] CSV parser
Jesse Phillips
jessekphillips+D at gmail.com
Tue Apr 5 09:45:59 PDT 2011
Robert Jacques Wrote:
> * You should input ranges. It's fine to detect slicing and optimize for
> it, but you should support simple input ranges as well.
This implementation only operates with input ranges, of text. I have another implementation which works on a slice-able forward range of anything. I stopped development because it didn't support input ranges, and Phobos must have this.
https://github.com/he-the-great/JPDLibs/tree/csvoptimize
> * I'd think being able to retrieve the headings from the csv would be a
> good [optional] feature.
Do you think it would be good for the interface to use an associative array. Maybe the heading could be the first thing to get from a RecordList.
auto records = csvText(str);
auto heading = records.heading;
or shouldn't it be passed in as an empty array:
string[] heading;
auto records = csvText(str, heading);
> * Exposing the tokenizer would be useful.
Good. I'm thinking I will use an output range of type char[] instead of Appender.
> * Regarding buffering, it's okay for the tokenizer to expose buffering in
> it's API (and users should be able to supply their own buffers), but I
> don't think an unbuffered version of csvText itself is correct;
As I'm thinking of using an output range I believe it has the job of allowing a user specified buffer or not. I'm not sure if the interface for csvText or RecordList should allow for custom buffers.
> csvByRecord or csvText!(T).byRecord would be more appropriate.
You always iterate CSV by record, what other option is there? If you desire an array for the record:
auto records = csvText(str);
foreach(record: records)
auto arr = array(record); // ...
> And
> anyways, since you're only using strings, why is there any buffering going
> on at all? string values should simply be sliced, not buffered. Buffering
> should only come into play with input ranges.
Input ranges do not provide slicing. On top of that you _must_ modify the returned data at times. So you can do buffering and slicing together which my other implementation will do, but as I said it won't support input ranges.
> * There should be a way to specify other separators; I've started using
> tab separated files as ','s show up in a lot of data.
You can use a custom "comma" and "quote" the record break is not modifiable because it doesn't fit into a single character. I can make this available through csvText and am not exactly sure why I didn't.
> * Any thought of parsing a file into a tuple of arrays? Writing csv?
A static CSV parser? I would hope CTFE would allow for using this parser, at some point.
For writing I made a quick and dirty one to write structures to a file. But I have not considered adding such functionality to the library. Thank you for the comment.
void writeStruct(string file, Layout data) {
string[] elements;
foreach(i, U; FieldTypeTuple!Layout) {
elements ~= to!string(data.tupleof[i]);
}
string record;
foreach(e; elements) {
if(find(e, `"`, ",", "\n", "\r")[1] == 0)
record ~= e ~ ",";
else
record ~= `"` ~ replace(e, `"`, `""`) ~ `",`;
}
std.file.append(file, record[0..$-1] ~ "\n");
}
https://gist.github.com/805392
More information about the Digitalmars-d
mailing list