[RFC] CSV parser

Jesse Phillips jessekphillips+d at gmail.com
Wed Apr 6 19:12:41 PDT 2011


On Wed, 06 Apr 2011 01:53:19 -0400, Robert Jacques wrote:

> The library should work with both and be efficient with both. i.e.
> detect at compile-time whether you have a string or an input range, and
> act accordingly.

I'm actually considering trying to benchmark my two implementations. 
However I really don't know a good setup for testing memory usage. At 
this time it is not a priority as specialization can be added without 
changing the interface or capabilities.

>> auto records  = csvText(str);
>> auto heading = records.heading;
>>
>> or shouldn't it be passed in as an empty array:
>>
>> string[] heading;
>> auto records = csvText(str, heading);
> 
> Hmm... The latter makes it explicit that there should be headers, while
> with the former, you'd need to call heading before parsing the records,
> which could be a source of bugs. Then again, records.heading is more
> natural. Alternative syntax ideas:
> 
> auto records = csvText(str, null);
> auto heading = records.heading;
> 
> or
> 
> auto records = csvText!(csvOptions.headers)(str); auto heading =
> records.heading;

I'm not fond of the csvOptions approach, but the combined explicit and 
ability to retrieve seems reasonable.

>>> * Regarding buffering, it's okay for the tokenizer to expose buffering
>>> in
>>> it's API (and users should be able to supply their own buffers), but I
>>> don't think an unbuffered version of csvText itself is correct;
>>
>> As I'm thinking of using an output range I believe it has the job of
>> allowing a user specified buffer or not. I'm not sure if the interface
>> for csvText or RecordList should allow for custom buffers.
> 
> An output range is not a buffer. A buffer requires a clear method and a
> data retrieval method. All output ranges provide is put.

Except for the exceptions thrown the tokenizer only needs an output range 
of char.

csvText is buffered by using Appender internally. However the default is 
to make a copy of the data in the buffer, which is safer.
 
>>> csvByRecord or csvText!(T).byRecord would be more appropriate.
>>
>> You always iterate CSV by record, what other option is there? If you
>> desire an array for the record:
>>
>> auto records = csvText(str);
>>
>> foreach(record: records)
>>     auto arr = array(record); // ...
> 
> Well, some of your example code included
> csvText!MyStruct(str,["my","headers"]); or possibly just
> csvText!MyStruct(str). Which would have either provide a forward range
> of MyStructs or an array of MyStruct. In either case, duplication of
> strings where appropriate should occur. Also, remember that records
> might not be manually processed; the user might forward it to
> map,filter,etc and therefore should be safe by default. A user should
> have to manually state that they are going to want a lazy buffered
> version.

The behavior when using string is to do a duplication, but char[] will 
not. I believe this provides access to both a safe and memory efficient 
iteration. But it is confusing and probably doesn't need to be provided. 

> Why do you need to do any modification? Part of the advantage of csv is
> that it's just text: you don't have to deal with escape characters, etc.

This is wrong. CSV does have escape characters, or more precisely an 
escaped quote.

> By tuple I meant the tuple(5,7) kind, not the TypeTuple!(int,char) kind.
> Basically,
> 
> auto records = csvText!(string,real)(str); string[] names   =
> records._0;
> real[]   grades  = records._1;
> 
> vs
> 
> auto records = csvText!(Tuple!(string,"name",real,"grade"))(str);
> foreach(record; records) {
> 	writeln(record.name," got ",record.grade);
> }

I'm not sure the second example would work with the current 
implementation. But thinking on this part, it might be better to wait for 
the DB API to be worked out and then it could use this module to provide 
such database queries.


More information about the Digitalmars-d mailing list