Splitting up large dirty file
Dennis
dkorpel at gmail.com
Wed May 16 08:57:10 UTC 2018
On Wednesday, 16 May 2018 at 08:20:06 UTC, drug wrote:
> What is the purpose of `.drop(4)`? I'm pretty sure this is the
> reason of the exception.
The file in question is a .json database dump with an array
"rows" of 10 million 8-line objects. The newlines in the string
fields are escaped, but they still contain other invalid
characters which makes std.json reject it.
The first 4 lines of the file are basically "header" and the last
2 lines are a closing ] and }, so I want to split every 4 +
8*(10_000_000/amountOfFiles) n lines and also remove trailing the
comma, add brackets, drop the last 2 lines etc.
I thought it wouldn't be hard to crudely split this file using
D's range functions and basic string manipulation, but the
combination of being to large for a string and having invalid
encoding seems to defeat most simple solutions. For now I decided
to use Git Bash and do:
tail -n80000002 inputfile.json | split -l 8000000 - outputfile
And now I have files that do fit in memory. I'm still interested
in complete D solutions though, thanks for the iopipe and memory
mapped file suggestions Steven and Jonathan. I will check those
out.
More information about the Digitalmars-d-learn
mailing list