Using lazy code to process large files

Steven Schveighoffer via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Wed Aug 2 05:45:42 PDT 2017


On 8/2/17 7:44 AM, Martin DraĊĦar via Digitalmars-d-learn wrote:
> Hi,
> 
> I am struggling to use a lazy range-based code to process large text
> files. My task is simple, i.e., I can write a non-range-based code in a
> really short time, but I wanted to try different approach and I am
> hitting a wall after wall.
> 
> Task: read a csv-like input, take only lines starting with some string,
> split by a comma, remove leading and trailing whitespaces from splitted
> elements, join by comma again and write to an output.
> 
> My attempt so far:
> 
> alias stringStripLeft = std.string.stripLeft;
> 
> auto input  = File("input.csv");
> auto output = File("output.csv");
> 
> auto result = input.byLine()
>                     .filter!(a => a.startsWith("..."))
>                     .map!(a => a.splitter(","))
>                     .stringStripleft // <-- errors start here
>                     .join(",");
> 
> output.write(result);
> 
> Needless to say, this does not compile. Basically, I don't know how to
> feed MapResults to splitter and then sensibly join it.

The problem is that you are 2 ranges deep when you apply splitter. The 
result of the map is a range of ranges.

Then when you apply stringStripleft, you are applying to the map result, 
not the splitter result.

What you need is to bury the action on each string into the map:

.map!(a => a.splitter(",").map!(stringStripLeft).join(","))

The internal map is because stripLeft doesn't take a range of strings 
(the result of splitter), it takes a range of dchar (which is each 
element of splitter). So you use map to apply the function to every element.

Disclaimer: I haven't tested to see this works, but I think it should.

Note that I have forwarded your call to join, even though this actually 
is not lazy, it builds a string out of it (and actually probably a 
dstring). Use joiner to do it truly lazily.

I will also note that the result is not going to look like what you 
think, as outputting a range looks like this: [element, element, 
element, ...]

You could potentially output like this:

output.write(result.joiner("\n"));

Which I think will work. Again, no testing.

I wouldn't expect good performance from this, as there is auto-decoding 
all over the place.

-Steve


More information about the Digitalmars-d-learn mailing list