Splitting up large dirty file

Jon Degenhardt jond at noreply.com
Wed May 16 02:47:50 UTC 2018


On Tuesday, 15 May 2018 at 20:36:21 UTC, Dennis wrote:
> I have a file with two problems:
> - It's too big to fit in memory (apparently, I thought 1.5 Gb 
> would fit but I get an out of memory error when using 
> std.file.read)
> - It is dirty (contains invalid Unicode characters, null bytes 
> in the middle of lines)
>
> I want to write a program that splits it up into multiple 
> files, with the splits happening every n lines. I keep 
> encountering roadblocks though:
>
> - You can't give Yes.useReplacementChar to `byLine` and 
> `byLine` (or `readln`) throws an Exception upon encountering an 
> invalid character.

Can you show the program you are using that throws when using 
byLine? I tried a very simple program that reads and outputs 
line-by-line, then fed it a file that contained invalid utf-8. I 
did not see an exception. The invalid utf-8 was created by taking 
part of this file: 
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt (a 
commonly used file with utf-8 edge cases), plus adding a number 
of random hex characters, including null. I don't see exceptions 
thrown.

The program I used:

int main(string[] args)
{
     import std.stdio;
     import std.conv : to;
     try
     {
         auto inputStream = (args.length < 2 || args[1] == "-") ? 
stdin : args[1].File;
         foreach (line; inputStream.byLine(KeepTerminator.yes)) 
write(line);
     }
     catch (Exception e)
     {
         stderr.writefln("Error [%s]: %s", args[0], e.msg);
         return 1;
     }
     return 0;
}





More information about the Digitalmars-d-learn mailing list