Processing a gzipped csv-file by line-by-line

Laeeth Isharc via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Thu May 11 21:51:14 PDT 2017


On Friday, 12 May 2017 at 00:18:47 UTC, H. S. Teoh wrote:
> On Wed, May 10, 2017 at 11:40:08PM +0000, Jesse Phillips via 
> Digitalmars-d-learn wrote: [...]
>> H.S. Teoh mentioned fastcsv but requires all the data to be in 
>> memory.
>
> Or you could use std.mmfile.  But if it's decompressed data, 
> then it would still need to be small enough to fit in memory.  
> Well, in theory you *could* use an anonymous mapping for 
> std.mmfile as an OS-backed virtual memory buffer to decompress 
> into, but it's questionable whether that's really worth the 
> effort.
>
>
>> If you can get the zip to decompress into a range of dchar 
>> then std.csv will work with it. It is by far not the fastest, 
>> but much speed is lost since it supports input ranges and 
>> doesn't specialize on any other range type.
>
> I actually spent some time today to look into whether fastcsv 
> can possibly be made to work with general input ranges as long 
> as they support slicing... and immediately ran into the 
> infamous autodecoding issue: strings are not random-access 
> ranges because of autodecoding, so it would require either 
> extensive code surgery to make it work, or ugly hacks to bypass 
> autodecoding.  I'm quite tempted to attempt the latter, in 
> fact, but not now since it's getting busier at work and I don't 
> have that much free time to spend on a major refactoring of 
> fastcsv.
>
> Alternatively, I could possibly hack together a version of 
> fastcsv that took a range of const(char)[] as input (rather 
> than a single string), so that, in theory, it could handle 
> arbitrarily large input files as long as the caller can provide 
> a range of data blocks, e.g., File.byChunk, or in this 
> particular case, a range of decompressed data blocks from 
> whatever decompressor is used to extract the data.  As long as 
> you consume the individual rows without storing references to 
> them indefinitely (don't try to make an array of the entire 
> dataset), fastcsv's optimizations should still work, since 
> unreferenced blocks will eventually get cleaned up by the GC 
> when memory runs low.
>
>
> T

I hacked your code to work with std.experimental.allocator.  If I 
remember it was a fair bit faster for my use.  Let me know if you 
would like me to tidy up into a pull request.

Thanks for the library.

Also - sent you an email.  Not sure if you got it.


Laeeth




More information about the Digitalmars-d-learn mailing list