Splitting up large dirty file
Jon Degenhardt
jond at noreply.com
Wed May 16 02:47:50 UTC 2018
On Tuesday, 15 May 2018 at 20:36:21 UTC, Dennis wrote:
> I have a file with two problems:
> - It's too big to fit in memory (apparently, I thought 1.5 Gb
> would fit but I get an out of memory error when using
> std.file.read)
> - It is dirty (contains invalid Unicode characters, null bytes
> in the middle of lines)
>
> I want to write a program that splits it up into multiple
> files, with the splits happening every n lines. I keep
> encountering roadblocks though:
>
> - You can't give Yes.useReplacementChar to `byLine` and
> `byLine` (or `readln`) throws an Exception upon encountering an
> invalid character.
Can you show the program you are using that throws when using
byLine? I tried a very simple program that reads and outputs
line-by-line, then fed it a file that contained invalid utf-8. I
did not see an exception. The invalid utf-8 was created by taking
part of this file:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt (a
commonly used file with utf-8 edge cases), plus adding a number
of random hex characters, including null. I don't see exceptions
thrown.
The program I used:
int main(string[] args)
{
import std.stdio;
import std.conv : to;
try
{
auto inputStream = (args.length < 2 || args[1] == "-") ?
stdin : args[1].File;
foreach (line; inputStream.byLine(KeepTerminator.yes))
write(line);
}
catch (Exception e)
{
stderr.writefln("Error [%s]: %s", args[0], e.msg);
return 1;
}
return 0;
}
More information about the Digitalmars-d-learn
mailing list