Files and UTF

Thu Aug 6 07:19:37 UTC 2020

On Wednesday, 5 August 2020 at 17:39:36 UTC, Mike Surette wrote:
> In my efforts to learn D I am writing some code to read files 
> in different UTF encodings with the aim of having them end up 
> as UTF-8 internally. As a start I have the following code:
>
> import std.stdio;
> import std.file;
>
> void main(string[] args)
> {
>     if (args.length == 2)
>     {
>         if (args[1].exists && args[1].isFile)
>         {
>             auto f = File(args[1]);
>             writeln(args[1]);
>
>             for (auto i = 1; i <= 3; ++i)
>                 write(f.readln);
>         }
>     }
> }
>
> It works well outputting the file name and first three lines of 
> the file properly, without any regard to the encoding of the 
> file. The exception to this is if the file is UTF-16, with both 
> LE and BE encodings, two characters representing the BOM are 
> printed.
>
> I assume that write detects the encoding of the string returned 
> by readln and prints it correctly rather than readln reading in 
> as a consistent encoding. Is this correct?
>
> Is there a way to remove the BOM from the input buffer and 
> still know the encoding of the file?
>
> Is there a D idiomatic way to do what I want to do?
>
> Mike

all strings in D are _assumed_ to be UTF-8, so your I/O reading 
function needs to check that it is actually UTF-8. 
File/File.readln does not do that, so you are actually getting 
UTF-16 bytes in your string, not UTF-8 bytes.

What you are seeing through writeln is not fully correct: If you 
only test English characters there are null bytes (0) before each 
english character byte, which aren't rendered in the console.

You can verify this with this simple code:
auto s = File("test.txt").readln;
writefln("%(%02x %)", s.representation);

result: ff fe 68 00 65 00 6c 00 6c 00 6f 00 20 00 77 00 6f 00 72 
00 6c 00 64 00 21 00 0d 00 0a

You can see there is a UTF-16 LE BOM in there and then all the 
English characters, which are null bytes and the characters.

Basically what you want to do is writing a function converting 
the input encoding to UTF-8 so you can use it in D strings. If 
you want to get into this yourself, there is std.encoding which 
offers you most of the functionality: 
https://dlang.org/phobos/std_encoding.html

You can use the getBOM method to try to determine encoding by BOM 
and can then convert it to UTF-8 from the source encoding using 
the `transcode` method or manually using the encoding classes.

If you don't have a BOM you need some kind of algorithm to 
determine the encoding of your file. If you don't want to do that 
and just want to check if it's UTF-8, use the `std.utf : 
validate` function which throws if your string is not valid UTF-8 
and otherwise does nothing.

If you only support UTF-8, UTF-16 and possibly UTF-32, you can 
use just the std.utf methods after determining BOM to lazily 
convert without allocating memory (useful if you only go over 
your string linearly without going back):

import std;

void main() {
	// readln is actually rather unsafe for this! You should use 
std.file.read
	// or File.rawRead or read byte chunks instead. For chunking you 
need to adjust
	// the encode API however and probably make a helper struct with 
a small buffer.
	string s = File("test.txt").readln;
	// need to remove the line terminator before encoding
	// (it's encoded in UTF-8, potentially after UTF-16)
	if (s.length) s = s[0 .. $ - 1];
	string e = encodeUTF8(s.representation);

	writefln("%s\n(%(%02x %))", s, s.representation);
	writefln("%s\n(%(%02x %))", e, e.representation);
}

string encodeUTF8(immutable(ubyte)[] bytes) {
	auto bom = getBOM(bytes);
	bytes = bytes[bom.sequence.length .. $];

	switch (bom.schema) {
	// optionally we could validate, but we just trust because there 
is a UTF-8 BOM
	case BOM.utf8: return cast(string)bytes;
	case BOM.utf16le: return convertUTF!wchar(bytes, true);
	case BOM.utf16be: return convertUTF!wchar(bytes, false);
	case BOM.utf32le: return convertUTF!dchar(bytes, true);
	case BOM.utf32be: return convertUTF!dchar(bytes, false);
	default: string input = cast(string)bytes; validate(input); 
return input;
	}
}

private string convertUTF(T)(scope const(ubyte)[] bytes, bool 
littleEndian) {
	// T.sizeof expecting 2 or 4 (which divided maps to 0 and 1)
	enum name = ["UTF-16", "UTF-32"][T.sizeof / 4];
	alias Int = AliasSeq!(ushort, uint)[T.sizeof / 4];

	if (bytes.length % T.sizeof != 0)
		throw new Exception("File is " ~ name ~ ", but got "
			~ bytes.length.to!string ~ " bytes, which is not a multiple of 
"
			~ T.sizeof.stringof ~ "!");

	scope Int[] units = (cast(Int*) bytes.ptr)[0 .. bytes.length / 
T.sizeof];

	// swap mismatching endianness
	version (LittleEndian) // CPU is little endian, swap if file is 
big endian
		bool swap = !littleEndian;
	else // CPU is big endian, swap if file is little endian
		bool swap = littleEndian;

	if (swap) swapAllEndian(units);

	scope wstr = cast(const(T)[]) units;
	auto ret = wstr.toUTF8;
	// because we are operating in-place, we need to revert to keep 
memory consistent
	// if you don't use the byte data anywhere else, you could omit 
this
	// (note this could be unsafe though)
	if (swap) swapAllEndian(units);

	return ret;
}

private void swapAllEndian(T)(T[] data) {
	// TODO: could probably optimize this with SIMD instructions
	foreach (i; 0 .. data.length)
		data[i] = swapEndian(data[i]);
}

Example library if you want to guess encoding without BOM: 
https://code.dlang.org/packages/libguess-d

used API docs:
https://dlang.org/phobos/std_bitmanip.html#swapEndian <- swapping 
BE/LE for native encoding
https://dlang.org/phobos/std_encoding.html <- BOM detection, 
transcoding capabilities for encodings other than UTF
https://dlang.org/phobos/std_utf.html <- low level UTF-8 
encoding/decoding, lazy decoding, validation