Splitting up large dirty file
Neia Neutuladh
neia at ikeran.org
Thu May 17 21:40:56 UTC 2018
On Tuesday, 15 May 2018 at 20:36:21 UTC, Dennis wrote:
> I have a file with two problems:
> - It's too big to fit in memory (apparently, I thought 1.5 Gb
> would fit but I get an out of memory error when using
> std.file.read)
Memory mapping should work. That's in core.sys.posix.sys.mman for
Posix systems, and Windows has some equivalent probably. (But
nobody uses Windows, right?)
> - It is dirty (contains invalid Unicode characters, null bytes
> in the middle of lines)
std.algorithm should generally work with sequences of anything,
not just strings. So memory map, cast to ubyte[], and deal with
it that way?
> - When you convert chunks to arrays, you have the risk of a
> split being in the middle of a character with multiple code
> units
It's straightforward to scan for the start of a Unicode
character; you just skip past characters where the highest bit is
set and the next-highest is not. (0b1100_0000 through 0b1111_1110
is the start of a multibyte character; 0b0000_0000 through
0b0111_1111 is a single-byte character.)
That said, you seem to only need to split based on a newline
character, so you might be able to ignore this entirely, even if
you go by chunks.
More information about the Digitalmars-d-learn
mailing list