Splitting up large dirty file

Steven Schveighoffer schveiguy at yahoo.com
Tue May 15 21:19:16 UTC 2018


On 5/15/18 4:36 PM, Dennis wrote:
> I have a file with two problems:
> - It's too big to fit in memory (apparently, I thought 1.5 Gb would fit 
> but I get an out of memory error when using std.file.read)
> - It is dirty (contains invalid Unicode characters, null bytes in the 
> middle of lines)
> 
> I want to write a program that splits it up into multiple files, with 
> the splits happening every n lines. I keep encountering roadblocks though:
> 
> - You can't give Yes.useReplacementChar to `byLine` and `byLine` (or 
> `readln`) throws an Exception upon encountering an invalid character.
> - decodeFront doesn't work on inputRanges like `byChunk(4096).joiner`
> - std.algorithm.splitter doesn't work on inputRanges either
> - When you convert chunks to arrays, you have the risk of a split being 
> in the middle of a character with multiple code units
> 
> Is there a simple way to do this?
> 

Using iopipe, you can split on N lines (iopipe doesn't autodecode when 
searching for newlines), or split on a pre-determined chunk size (and 
ensure you don't split a code point).

Splitting on N lines:

import iopipe.bufpipe;
import iopipe.textpipe;

auto infile = openDev("filename").bufd.assumeText.byLine;

foreach(i; 0 .. N) infile.extend(0); // ensure N lines in the buffer

Splitting on pre-determined chunk size

auto infile = openDev("filename")
     .bufd!(ubyte, chunkSize) // use chunkSize as minimum read size
     .assumeText // it's text, not ubyte
     .ensureDecodeable; // do not end in the middle of a codepoint

The output isn't as straightforward. Ideally you would want to simply 
create an output pipe that split into multiple files, and process the 
whole thing at once. I haven't created such a thing yet though (will add 
an enhancement request to do so).

Easiest thing to do is to write the entire window of the input pipe into 
an output pipe, or cast it back to ubyte[] and write directly to an 
output device.

e.g.:

auto infile = ... // one of the above ideas
    .encodeText; // convert to ubyte

auto outfile = openDev("outputFilename1", "w");
outfile.write(infile.window);
outfile.close;
infile.release(infile.window.length); // flush the input buffer
... // refill the buffer using the chosen technique above.

-Steve


More information about the Digitalmars-d-learn mailing list