[phobos] CSVRange: RFC

Andrei Alexandrescu andrei at erdani.com
Sun Jan 30 22:52:17 PST 2011


Without having studied the code closely, I could say that asking for an 
input range with slicing is quite a tall order that virtually restricts 
you to random-access ranges.

An input range only allows you to move one character forward and never 
save your position or go back. A range with slicing in this context 
means that we can confidently calculate how much of the range we need to 
take, and that automatically requires the range to be able to go forward 
and then restart from a previous position.

Regarding overall design and user-level API, it may be reasonable to 
assume that:

1. CSV readers are usually often for reading an entire file through the 
end, so optimizations that are mostly applicable to reading one single 
line are unnecessary. At the same time, optimizations for repeated use 
of empty/front/popFront are likely to be beneficial.

2. An entire line's representation as strings must fit in memory as a 
requirement.

As such, David's implementation that works on a character stream is the 
most general and the theoretical perfect one because one character of 
lookahead is all CSV needs. At the same time, if an implementation 
assuming (1) and (2) above has considerable advantages (speed, 
convenience) then it might trump the theoretically perfect one.

David, LockingTextReader in std.stdio implements character level input 
straight from a file. It does so very slowly by means of getc/ungetc, 
which is the only portable way. I know how to make it faster on Linux 
and OSX and Walter knows how to make it faster on Windows, but we never 
got around to it.


Andrei

On 01/30/2011 02:21 PM, Jesse Phillips wrote:
> In some way your comments apply to my implementation, I'll answer
> assuming the latest design I have.
>
> https://github.com/he-the-great/JPDLibs/tree/separator
>
> And as David mentioned, it probably would be best to go with mine, but
> it too is not ready for inclusion.
>
> On Sun, Jan 30, 2011 at 11:11 AM, Andrei Alexandrescu<andrei at erdani.com>  wrote:
>> Code review:
>>
>> #50 RefRange is really ForceInputRange. What do you need it for? It's
>> unusual to want to reduce the capabilities of a range.
>>
>> #78 isCharRange is incorrect. Correct version:
>>
>> enum bool isCharRange = isInputRange!R&&  isSomeChar!(ElementType!R);
>>
>> #100 Why not struct?
>
> Answering for all of the above as mine doesn't use any of these.
>
> I make use of two ranges and a token function. Most of the work is
> done by csvNextToken and has the most unittests against it. I then
> build a range which iterates each token until the end of the record,
> this is struct Record. On top of that I have a range which iterates
> over each Record, this is RecordList.
>
> I avoid the need to create reference semantics by having both Record
> and csvNextToken take a ref parameter of the Range.
>
>> #102 private Appender!(char[]) _front;
>
> As my result may not be an array, I use contactination. I do not have
> a constraint on this yet...
>
>> #306 This is CsvRange not a CsvFile as it builds on another range (that may
>> or may not be backed up by a file)
>
> As said, mine are named Record and RecordList, and I believe these are
> good names as we aren't dealing with lines or files. The only concern
> I have is that Record has many means, though I believe I am using it
> correctly for CSV row data.
>
> CsvRecord? DataRecord?
>
>> #387 The name is confusing - it's a class with struct in its name.
>
> I believe this is an implementation detail. I use csvText!CastTo(...)
> where CastTo can be a struct. It will be interesting to see what I do
> when I get heading support put in.
>
>> #582 We also need a way to read CSV files into string arrays in case the
>> user just wants to do the parsing and decide on typing later. Seemingly the
>> current design forces choice of type before parsing.
>
> Mine defaults to a slice of the original range, with the exception of
> quoted entries. A helper function to return a Range[][] could be made.
> But I think just leaving standard Range semantics is best.
>
>> Documentation review:
>>
>> * O(1) is a bit inaccurate - memory consumed is proportional to that of one
>> element. What you might have meant is that it does not depend on the number
>> of lines in a file or on the number of CSV elements in a line.
>
> I believe my new method has a better footprint then David's, but it no
> longer operates on an InputRange, but a range with slicing and
> appending. If there is no quoted data it just returns the slice of the
> needed data, otherwise it appends slices of the data to a new Range.
>
> It may be useful to have an implementation which doesn't make use of these.
>
>> * A few artifacts have no examples.
>
> I haven't done much for examples either.
>
>> * The example should compile. getCharRange() does not exist. FWIW your
>> design should work with byLine().
>
> As David said, you really can't do this. You can make it work, but it
> is more trouble then it is worth. I built my design off of the ideas
> found in Splitter.
>
> Feedback and suggestion are welcome, but it really hasn't seen the
> comment or code standard which it needs for easy understanding.
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos


More information about the phobos mailing list