Splitting up large dirty file

Thu May 17 21:40:56 UTC 2018

On Tuesday, 15 May 2018 at 20:36:21 UTC, Dennis wrote:
> I have a file with two problems:
> - It's too big to fit in memory (apparently, I thought 1.5 Gb 
> would fit but I get an out of memory error when using 
> std.file.read)

Memory mapping should work. That's in core.sys.posix.sys.mman for 
Posix systems, and Windows has some equivalent probably. (But 
nobody uses Windows, right?)

> - It is dirty (contains invalid Unicode characters, null bytes 
> in the middle of lines)

std.algorithm should generally work with sequences of anything, 
not just strings. So memory map, cast to ubyte[], and deal with 
it that way?

> - When you convert chunks to arrays, you have the risk of a 
> split being in the middle of a character with multiple code 
> units

It's straightforward to scan for the start of a Unicode 
character; you just skip past characters where the highest bit is 
set and the next-highest is not. (0b1100_0000 through 0b1111_1110 
is the start of a multibyte character; 0b0000_0000 through 
0b0111_1111 is a single-byte character.)

That said, you seem to only need to split based on a newline 
character, so you might be able to ignore this entirely, even if 
you go by chunks.