Splitting up large dirty file

Tue May 15 21:29:54 UTC 2018

On Tuesday, May 15, 2018 20:36:21 Dennis via Digitalmars-d-learn wrote:
> I have a file with two problems:
> - It's too big to fit in memory (apparently, I thought 1.5 Gb
> would fit but I get an out of memory error when using
> std.file.read)
> - It is dirty (contains invalid Unicode characters, null bytes in
> the middle of lines)
>
> I want to write a program that splits it up into multiple files,
> with the splits happening every n lines. I keep encountering
> roadblocks though:
>
> - You can't give Yes.useReplacementChar to `byLine` and `byLine`
> (or `readln`) throws an Exception upon encountering an invalid
> character.
> - decodeFront doesn't work on inputRanges like
> `byChunk(4096).joiner`
> - std.algorithm.splitter doesn't work on inputRanges either
> - When you convert chunks to arrays, you have the risk of a split
> being in the middle of a character with multiple code units
>
> Is there a simple way to do this?

If you're on a *nix systime, and you're simply looking for a solution to
split files and don't necessarily care about writing one, I'd suggest trying
the split utility:

https://linux.die.net/man/1/split

If I had to write it in D, I'd probably just use std.mmap and operate on the
files as a dynamic array of ubytes, since if what you care about is '\n',
that can easily be searched for without needing any decoding, and using mmap
avoids having to chunk anything.

- Jonathan M Davis