Processing a gzipped csv-file by line-by-line
Laeeth Isharc via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Thu May 11 21:51:14 PDT 2017
On Friday, 12 May 2017 at 00:18:47 UTC, H. S. Teoh wrote:
> On Wed, May 10, 2017 at 11:40:08PM +0000, Jesse Phillips via
> Digitalmars-d-learn wrote: [...]
>> H.S. Teoh mentioned fastcsv but requires all the data to be in
>> memory.
>
> Or you could use std.mmfile. But if it's decompressed data,
> then it would still need to be small enough to fit in memory.
> Well, in theory you *could* use an anonymous mapping for
> std.mmfile as an OS-backed virtual memory buffer to decompress
> into, but it's questionable whether that's really worth the
> effort.
>
>
>> If you can get the zip to decompress into a range of dchar
>> then std.csv will work with it. It is by far not the fastest,
>> but much speed is lost since it supports input ranges and
>> doesn't specialize on any other range type.
>
> I actually spent some time today to look into whether fastcsv
> can possibly be made to work with general input ranges as long
> as they support slicing... and immediately ran into the
> infamous autodecoding issue: strings are not random-access
> ranges because of autodecoding, so it would require either
> extensive code surgery to make it work, or ugly hacks to bypass
> autodecoding. I'm quite tempted to attempt the latter, in
> fact, but not now since it's getting busier at work and I don't
> have that much free time to spend on a major refactoring of
> fastcsv.
>
> Alternatively, I could possibly hack together a version of
> fastcsv that took a range of const(char)[] as input (rather
> than a single string), so that, in theory, it could handle
> arbitrarily large input files as long as the caller can provide
> a range of data blocks, e.g., File.byChunk, or in this
> particular case, a range of decompressed data blocks from
> whatever decompressor is used to extract the data. As long as
> you consume the individual rows without storing references to
> them indefinitely (don't try to make an array of the entire
> dataset), fastcsv's optimizations should still work, since
> unreferenced blocks will eventually get cleaned up by the GC
> when memory runs low.
>
>
> T
I hacked your code to work with std.experimental.allocator. If I
remember it was a fair bit faster for my use. Let me know if you
would like me to tidy up into a pull request.
Thanks for the library.
Also - sent you an email. Not sure if you got it.
Laeeth
More information about the Digitalmars-d-learn
mailing list