csvReader & specifying separator problems...

Fri Nov 15 03:25:19 UTC 2019

On Thursday, 14 November 2019 at 12:25:30 UTC, Robert M. Münch 
wrote:
> Just trying a very simple thing and it's pretty hard: "Read a 
> CSV file (raw_data) that has a ; separator so that I can 
> iterate over the lines and access the fields."
>
> 	csv_data = raw_data.byLine.joiner("\n")
>
> From the docs, which I find extremly hard to understand:
>
> auto csvReader(Contents = string, Malformed ErrorLevel = 
> Malformed.throwException, Range, Separator = char)(Range input, 
> Separator delimiter = ',', Separator quote = '"')
>
> So, let's see if I can decyphre this, step-by-step by trying 
> out:
>
> 	csv_records = csv_data.csvReader();
>
> Would split the CSV data into iterable CSV records using ',' 
> char as separator using UFCS syntax. When running this I get:
>
> [...]

Side comment - This code looks like it was taken from the first 
example in the std.csv documentation. To me, the code in the 
std.csv example is doing something that might not be obvious at 
first glance and is potentially confusing.

In particular, 'byLine' is not reading individual CSV records. 
CSV can have embedded newlines, these are identified by CSV 
escape syntax. 'byLine' doesn't know the escape syntax. If there 
are embedded newlines, 'byLine' will read partial records, which 
may not be obvious at first glance. The .joiner("\n") step puts 
the newline back, stitching fields and records back together 
again in the process.

The effect is to create an input range of characters representing 
the entire file, using 'byLine' to do buffered reads. This input 
range is passed to CSVReader.

This could also be done using 'byChunk' and 'joiner' (with no 
separator). This would use a fixed size buffer, no searching for 
newlines while reading, so it should be faster.

An example:

==== csv_by_chunk.d ====
import std.algorithm;
import std.csv;
import std.conv;
import std.stdio;
import std.typecons;
import std.utf;

void main()
{
     // Small buffer used to show it works. Normally would use a 
larger buffer.
     ubyte[16] buffer;
     auto stdinBytes = stdin.byChunk(buffer).joiner;
     auto stdinDChars = stdinBytes.map!((ubyte b) => cast(char) 
b).byDchar;

     writefln("--------------");
     foreach (record; stdinDChars.csvReader!(Tuple!(string, 
string, string)))
     {
         writefln("Field 0: |%s|", record[0]);
         writefln("Field 1: |%s|", record[1]);
         writefln("Field 2: |%s|", record[2]);
         writefln("--------------");
     }
}

Pass it csv data without embedded newlines:

$ echo $'abc,def,ghi\njkl,mno,pqr' | ./csv_by_chunk
--------------
Field 0: |abc|
Field 1: |def|
Field 2: |ghi|
--------------
Field 0: |jkl|
Field 1: |mno|
Field 2: |pqr|
--------------

Pass it csv data with embedded newlines:

$ echo $'abc,"LINE 1\nLINE 2",ghi\njkl,mno,pqr' | ./csv_by_chunk
--------------
Field 0: |abc|
Field 1: |LINE 1
LINE 2|
Field 2: |ghi|
--------------
Field 0: |jkl|
Field 1: |mno|
Field 2: |pqr|
--------------

An example like this may avoid the confusion about newlines. 
Unfortunately, the need to do the odd looking conversion from 
ubyte to char/dchar is undesirable in a code example. I haven't 
found a cleaner way to write that. If there's a nicer way I'd 
appreciate hearing about it.

--Jon