[RFC] CSV parser

Robert Jacques sandford at jhu.edu
Tue Apr 5 22:53:19 PDT 2011


On Tue, 05 Apr 2011 12:45:59 -0400, Jesse Phillips  
<jessekphillips+D at gmail.com> wrote:

> Robert Jacques Wrote:
>
>> * You should input ranges. It's fine to detect slicing and optimize for
>> it, but you should support simple input ranges as well.
>
> This implementation only operates with input ranges, of text. I have  
> another implementation which works on a slice-able forward range of  
> anything. I stopped development because it didn't support input ranges,  
> and Phobos must have this.
>
> https://github.com/he-the-great/JPDLibs/tree/csvoptimize

The library should work with both and be efficient with both. i.e. detect  
at compile-time whether you have a string or an input range, and act  
accordingly.

>> * I'd think being able to retrieve the headings from the csv would be a
>> good [optional] feature.
>
> Do you think it would be good for the interface to use an associative  
> array. Maybe the heading could be the first thing to get from a  
> RecordList.
>
> auto records  = csvText(str);
> auto heading = records.heading;
>
> or shouldn't it be passed in as an empty array:
>
> string[] heading;
> auto records = csvText(str, heading);

Hmm... The latter makes it explicit that there should be headers, while  
with the former, you'd need to call heading before parsing the records,  
which could be a source of bugs. Then again, records.heading is more  
natural. Alternative syntax ideas:

auto records = csvText(str, null);
auto heading = records.heading;

or

auto records = csvText!(csvOptions.headers)(str);
auto heading = records.heading;

>> * Exposing the tokenizer would be useful.
>
> Good. I'm thinking I will use an output range of type char[] instead of  
> Appender.

Output range? You mean as a buffer? I'd think Appender is the better  
choice for that internally, but you could allow the user to pass in a  
char[] buffer to use as the basis of the appender.

>> * Regarding buffering, it's okay for the tokenizer to expose buffering  
>> in
>> it's API (and users should be able to supply their own buffers), but I
>> don't think an unbuffered version of csvText itself is correct;
>
> As I'm thinking of using an output range I believe it has the job of  
> allowing a user specified buffer or not. I'm not sure if the interface  
> for csvText or RecordList should allow for custom buffers.

An output range is not a buffer. A buffer requires a clear method and a  
data retrieval method. All output ranges provide is put.

>> csvByRecord or csvText!(T).byRecord would be more appropriate.
>
> You always iterate CSV by record, what other option is there? If you  
> desire an array for the record:
>
> auto records = csvText(str);
>
> foreach(record: records)
>     auto arr = array(record); // ...

Well, some of your example code included  
csvText!MyStruct(str,["my","headers"]); or possibly just  
csvText!MyStruct(str). Which would have either provide a forward range of  
MyStructs or an array of MyStruct. In either case, duplication of strings  
where appropriate should occur. Also, remember that records might not be  
manually processed; the user might forward it to map,filter,etc and  
therefore should be safe by default. A user should have to manually state  
that they are going to want a lazy buffered version.

>> And
>> anyways, since you're only using strings, why is there any buffering  
>> going
>> on at all? string values should simply be sliced, not buffered.  
>> Buffering
>> should only come into play with input ranges.
>
> Input ranges do not provide slicing. On top of that you _must_ modify  
> the returned data at times. So you can do buffering and slicing together  
> which my other implementation will do, but as I said it won't support  
> input ranges.

Why do you need to do any modification? Part of the advantage of csv is  
that it's just text: you don't have to deal with escape characters, etc.

>> * There should be a way to specify other separators; I've started using
>> tab separated files as ','s show up in a lot of data.
>
> You can use a custom "comma" and "quote" the record break is not  
> modifiable because it doesn't fit into a single character. I can make  
> this available through csvText and am not exactly sure why I didn't.
>
>> * Any thought of parsing a file into a tuple of arrays? Writing csv?
>
> A static CSV parser? I would hope CTFE would allow for using this  
> parser, at some point.

By tuple I meant the tuple(5,7) kind, not the TypeTuple!(int,char) kind.  
Basically,

auto records = csvText!(string,real)(str);
string[] names   = records._0;
real[]   grades  = records._1;

vs

auto records = csvText!(Tuple!(string,"name",real,"grade"))(str);
foreach(record; records) {
	writeln(record.name," got ",record.grade);
}
	

> For writing I made a quick and dirty one to write structures to a file.  
> But I have not considered adding such functionality to the library.  
> Thank you for the comment.
>
> void writeStruct(string file, Layout data) {
>     string[] elements;
>
>     foreach(i, U; FieldTypeTuple!Layout) {
>         elements ~= to!string(data.tupleof[i]);
>     }
>
>     string record;
>     foreach(e; elements) {
>         if(find(e, `"`, ",", "\n", "\r")[1] == 0)
>             record ~= e ~ ",";
>         else
>             record ~= `"` ~ replace(e, `"`, `""`) ~ `",`;
>     }
>
>     std.file.append(file, record[0..$-1] ~ "\n");
> }
>
> https://gist.github.com/805392


More information about the Digitalmars-d mailing list