Splitting up large dirty file

Dennis dkorpel at gmail.com
Tue May 15 20:36:21 UTC 2018


I have a file with two problems:
- It's too big to fit in memory (apparently, I thought 1.5 Gb 
would fit but I get an out of memory error when using 
std.file.read)
- It is dirty (contains invalid Unicode characters, null bytes in 
the middle of lines)

I want to write a program that splits it up into multiple files, 
with the splits happening every n lines. I keep encountering 
roadblocks though:

- You can't give Yes.useReplacementChar to `byLine` and `byLine` 
(or `readln`) throws an Exception upon encountering an invalid 
character.
- decodeFront doesn't work on inputRanges like 
`byChunk(4096).joiner`
- std.algorithm.splitter doesn't work on inputRanges either
- When you convert chunks to arrays, you have the risk of a split 
being in the middle of a character with multiple code units

Is there a simple way to do this?



More information about the Digitalmars-d-learn mailing list