[phobos] CSVRange: RFC

David Simcha dsimcha at gmail.com
Sun Jan 30 11:38:22 PST 2011


Thanks for the review.  As I said before, this was a prototype.  I'm 
aware it' a ways from being Phobos quality, but I posted here anyway 
because I wanted to gauge whether there was sufficient interest and 
whether the high-level design was good enough for it to be worth 
cleaning up the details.  Responses to specific points below:

On 1/30/2011 2:11 PM, Andrei Alexandrescu wrote:
> Looks like a good candidate for std.format, but I think it's a ways 
> from getting there.
>
> Code review:
>
> #50 RefRange is really ForceInputRange. What do you need it for? It's 
> unusual to want to reduce the capabilities of a range.

Basically, when CsvLine makes changes to the range, the changes need to 
be visible from CsvFile, necessitating reference semantics.  
Furthermore, the assumption is that this lib will mostly be used with 
input ranges anyhow, so I don't see how that's much of a restriction in 
practice.
>
> #78 isCharRange is incorrect. Correct version:
>
> enum bool isCharRange = isInputRange!R && isSomeChar!(ElementType!R);

Good point.

>
> #100 Why not struct?

Again, because it needs reference semantics so that changes made to it 
(popping, etc.) by the user of the library are visible to the CsvFile 
struct.

>
> #102 private Appender!(char[]) _front;

I thought we were thinking of making Appender a nested struct defined in 
a function.
>
> #305 No comment?

It's type returned by csvFile().  The csvFile() function pretty much 
tells you what you need to know about how to use a CsvFile object.
>
> #306 This is CsvRange not a CsvFile as it builds on another range 
> (that may or may not be backed up by a file)

Good point.

>
> #386 No comment?

Again, it's supposed to be instantiated from the csvStructRange() 
function, so this is where the documentation is.
>
> #387 The name is confusing - it's a class with struct in its name.

Agreed.  I was hoping someone would come up with a better name.  Since I'm
>
> #582 We also need a way to read CSV files into string arrays in case 
> the user just wants to do the parsing and decide on typing later. 
> Seemingly the current design forces choice of type before parsing.

Trivial with the rest of the lib, assuming you have a character range 
called charRange, though it might deserve a convenience function.

string[][] res;
auto csvIter = csvFile(charRange, ',');
foreach(row; csvIter) {
     res ~= array(map!"a.idup"(row));
}

>
> Documentation review:
>
> * No spellchecking (e.g. 'teh')
>
> * Malformatted Wikipedia URL
>
> * No need for copying the license, a URL is sufficient.
>

Good points.

> * O(1) is a bit inaccurate - memory consumed is proportional to that 
> of one element. What you might have meant is that it does not depend 
> on the number of lines in a file or on the number of CSV elements in a 
> line.

Ok, very good point.  That's exactly what I meant.
>
> * A few artifacts have no examples.
>
> * The example should compile. getCharRange() does not exist. FWIW your 
> design should work with byLine().

I would have liked to make it work with byLine(), but the problem is 
that a CSV file can have a newline character inside quotation marks, in 
which case this does not mean "start a new row".  Therefore, the proper 
way to parse a CSV file is character-by-character, not line by line.  
Ideally, std.stdio.File should have a .byChar range.  As far as I can 
tell, it doesn't.  Should I add one?  (I guess the proper way to do it 
would be to use byChunk() to buffer things and then pop off one 
character at a time from this.)

>
> * It's unclear what colHeaders do from the code and the documentation.
I'm not sure how to make it much more clear.  colHeaders selects the 
columns to be read from the CSV file by name.  It's assumed that the 
first row is a "header" and gives the name of each column.  The first 
entry of colHeaders should be what you want read into the first field of 
the struct, etc.

>
>
> Andrei
>
> On 01/29/2011 05:44 PM, David Simcha wrote:
>> I've written a small module for reading CSV and similar delimited files.
>> I've been meaning to do this for a while. Basically, it allows reading a
>> CSV file with O(1) memory usage (i.e. it can be parsed one character at
>> a time) to a range of ranges of cells. Quotes, escaped quotes, etc. are
>> handled properly. I tested it on a nasty CSV file produced by
>> Affymetrix, and it works rather well.
>>
>> CSVRange also allows for iteration over rows as a range of structs. For
>> example, let's say you had a file:
>>
>> Height,Weight,Shoe Size
>> 6.5,210,13
>> ...
>>
>> You could read this file lazily into a range of structs with something
>> like:
>>
>> struct Person
>> {
>> float height;
>> uint weight;
>> uint shoeSize;
>> }
>>
>> auto csvRange = csvFile(someCharacterRange, ',');
>> auto structs = csvStructRange(csvRange, ["Height", "Weight", "Shoe 
>> Size"]);
>>
>> // Iterate lazily through the rows.
>> foreach(s; structs) {
>> // Do stuff.
>> }
>>
>> Note that this still works even if you have tons of columns you don't
>> care about in the file.
>>
>> Code:
>>
>> http://dsource.org/projects/scrapple/browser/trunk/csvRange/csvRange.d
>>
>> Docs:
>>
>> http://cis.jhu.edu/~dsimcha/csvRange.html
>>
>>
>> _______________________________________________
>> phobos mailing list
>> phobos at puremagic.com
>> http://lists.puremagic.com/mailman/listinfo/phobos
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
>



More information about the phobos mailing list