[RFC] CSV parser

Jesse Phillips jessekphillips+D at gmail.com
Tue Apr 5 09:45:59 PDT 2011


Robert Jacques Wrote:

> * You should input ranges. It's fine to detect slicing and optimize for  
> it, but you should support simple input ranges as well.

This implementation only operates with input ranges, of text. I have another implementation which works on a slice-able forward range of anything. I stopped development because it didn't support input ranges, and Phobos must have this.

https://github.com/he-the-great/JPDLibs/tree/csvoptimize

> * I'd think being able to retrieve the headings from the csv would be a  
> good [optional] feature.

Do you think it would be good for the interface to use an associative array. Maybe the heading could be the first thing to get from a RecordList.

auto records  = csvText(str);
auto heading = records.heading;

or shouldn't it be passed in as an empty array:

string[] heading;
auto records = csvText(str, heading);
 
> * Exposing the tokenizer would be useful.

Good. I'm thinking I will use an output range of type char[] instead of Appender.

> * Regarding buffering, it's okay for the tokenizer to expose buffering in  
> it's API (and users should be able to supply their own buffers), but I  
> don't think an unbuffered version of csvText itself is correct; 

As I'm thinking of using an output range I believe it has the job of allowing a user specified buffer or not. I'm not sure if the interface for csvText or RecordList should allow for custom buffers.

> csvByRecord or csvText!(T).byRecord would be more appropriate. 

You always iterate CSV by record, what other option is there? If you desire an array for the record:

auto records = csvText(str);

foreach(record: records)
    auto arr = array(record); // ...

> And  
> anyways, since you're only using strings, why is there any buffering going  
> on at all? string values should simply be sliced, not buffered. Buffering  
> should only come into play with input ranges.

Input ranges do not provide slicing. On top of that you _must_ modify the returned data at times. So you can do buffering and slicing together which my other implementation will do, but as I said it won't support input ranges.

> * There should be a way to specify other separators; I've started using  
> tab separated files as ','s show up in a lot of data.

You can use a custom "comma" and "quote" the record break is not modifiable because it doesn't fit into a single character. I can make this available through csvText and am not exactly sure why I didn't.

> * Any thought of parsing a file into a tuple of arrays? Writing csv?

A static CSV parser? I would hope CTFE would allow for using this parser, at some point.

For writing I made a quick and dirty one to write structures to a file. But I have not considered adding such functionality to the library. Thank you for the comment.

void writeStruct(string file, Layout data) {
    string[] elements;

    foreach(i, U; FieldTypeTuple!Layout) {
        elements ~= to!string(data.tupleof[i]);
    }

    string record;
    foreach(e; elements) {
        if(find(e, `"`, ",", "\n", "\r")[1] == 0)
            record ~= e ~ ",";
        else
            record ~= `"` ~ replace(e, `"`, `""`) ~ `",`;
    }

    std.file.append(file, record[0..$-1] ~ "\n");
    
}

https://gist.github.com/805392



More information about the Digitalmars-d mailing list