Splitting up large dirty file

Dennis dkorpel at gmail.com
Wed May 16 08:57:10 UTC 2018


On Wednesday, 16 May 2018 at 08:20:06 UTC, drug wrote:
> What is the purpose of `.drop(4)`? I'm pretty sure this is the 
> reason of the exception.

The file in question is a .json database dump with an array 
"rows" of 10 million 8-line objects. The newlines in the string 
fields are escaped, but they still contain other invalid 
characters which makes std.json reject it.

The first 4 lines of the file are basically "header" and the last 
2 lines are a closing ] and }, so I want to split every 4 + 
8*(10_000_000/amountOfFiles) n lines and also remove trailing the 
comma, add brackets, drop the last 2 lines etc.

I thought it wouldn't be hard to crudely split this file using 
D's range functions and basic string manipulation, but the 
combination of being to large for a string and having invalid 
encoding seems to defeat most simple solutions. For now I decided 
to use Git Bash and do:
tail -n80000002 inputfile.json | split -l 8000000 - outputfile

And now I have files that do fit in memory. I'm still interested 
in complete D solutions though, thanks for the iopipe and memory 
mapped file suggestions Steven and Jonathan. I will check those 
out.


More information about the Digitalmars-d-learn mailing list