Files and UTF
WebFreak001
d.forum at webfreak.org
Thu Aug 6 07:19:37 UTC 2020
On Wednesday, 5 August 2020 at 17:39:36 UTC, Mike Surette wrote:
> In my efforts to learn D I am writing some code to read files
> in different UTF encodings with the aim of having them end up
> as UTF-8 internally. As a start I have the following code:
>
> import std.stdio;
> import std.file;
>
> void main(string[] args)
> {
> if (args.length == 2)
> {
> if (args[1].exists && args[1].isFile)
> {
> auto f = File(args[1]);
> writeln(args[1]);
>
> for (auto i = 1; i <= 3; ++i)
> write(f.readln);
> }
> }
> }
>
> It works well outputting the file name and first three lines of
> the file properly, without any regard to the encoding of the
> file. The exception to this is if the file is UTF-16, with both
> LE and BE encodings, two characters representing the BOM are
> printed.
>
> I assume that write detects the encoding of the string returned
> by readln and prints it correctly rather than readln reading in
> as a consistent encoding. Is this correct?
>
> Is there a way to remove the BOM from the input buffer and
> still know the encoding of the file?
>
> Is there a D idiomatic way to do what I want to do?
>
> Mike
all strings in D are _assumed_ to be UTF-8, so your I/O reading
function needs to check that it is actually UTF-8.
File/File.readln does not do that, so you are actually getting
UTF-16 bytes in your string, not UTF-8 bytes.
What you are seeing through writeln is not fully correct: If you
only test English characters there are null bytes (0) before each
english character byte, which aren't rendered in the console.
You can verify this with this simple code:
auto s = File("test.txt").readln;
writefln("%(%02x %)", s.representation);
result: ff fe 68 00 65 00 6c 00 6c 00 6f 00 20 00 77 00 6f 00 72
00 6c 00 64 00 21 00 0d 00 0a
You can see there is a UTF-16 LE BOM in there and then all the
English characters, which are null bytes and the characters.
Basically what you want to do is writing a function converting
the input encoding to UTF-8 so you can use it in D strings. If
you want to get into this yourself, there is std.encoding which
offers you most of the functionality:
https://dlang.org/phobos/std_encoding.html
You can use the getBOM method to try to determine encoding by BOM
and can then convert it to UTF-8 from the source encoding using
the `transcode` method or manually using the encoding classes.
If you don't have a BOM you need some kind of algorithm to
determine the encoding of your file. If you don't want to do that
and just want to check if it's UTF-8, use the `std.utf :
validate` function which throws if your string is not valid UTF-8
and otherwise does nothing.
If you only support UTF-8, UTF-16 and possibly UTF-32, you can
use just the std.utf methods after determining BOM to lazily
convert without allocating memory (useful if you only go over
your string linearly without going back):
import std;
void main() {
// readln is actually rather unsafe for this! You should use
std.file.read
// or File.rawRead or read byte chunks instead. For chunking you
need to adjust
// the encode API however and probably make a helper struct with
a small buffer.
string s = File("test.txt").readln;
// need to remove the line terminator before encoding
// (it's encoded in UTF-8, potentially after UTF-16)
if (s.length) s = s[0 .. $ - 1];
string e = encodeUTF8(s.representation);
writefln("%s\n(%(%02x %))", s, s.representation);
writefln("%s\n(%(%02x %))", e, e.representation);
}
string encodeUTF8(immutable(ubyte)[] bytes) {
auto bom = getBOM(bytes);
bytes = bytes[bom.sequence.length .. $];
switch (bom.schema) {
// optionally we could validate, but we just trust because there
is a UTF-8 BOM
case BOM.utf8: return cast(string)bytes;
case BOM.utf16le: return convertUTF!wchar(bytes, true);
case BOM.utf16be: return convertUTF!wchar(bytes, false);
case BOM.utf32le: return convertUTF!dchar(bytes, true);
case BOM.utf32be: return convertUTF!dchar(bytes, false);
default: string input = cast(string)bytes; validate(input);
return input;
}
}
private string convertUTF(T)(scope const(ubyte)[] bytes, bool
littleEndian) {
// T.sizeof expecting 2 or 4 (which divided maps to 0 and 1)
enum name = ["UTF-16", "UTF-32"][T.sizeof / 4];
alias Int = AliasSeq!(ushort, uint)[T.sizeof / 4];
if (bytes.length % T.sizeof != 0)
throw new Exception("File is " ~ name ~ ", but got "
~ bytes.length.to!string ~ " bytes, which is not a multiple of
"
~ T.sizeof.stringof ~ "!");
scope Int[] units = (cast(Int*) bytes.ptr)[0 .. bytes.length /
T.sizeof];
// swap mismatching endianness
version (LittleEndian) // CPU is little endian, swap if file is
big endian
bool swap = !littleEndian;
else // CPU is big endian, swap if file is little endian
bool swap = littleEndian;
if (swap) swapAllEndian(units);
scope wstr = cast(const(T)[]) units;
auto ret = wstr.toUTF8;
// because we are operating in-place, we need to revert to keep
memory consistent
// if you don't use the byte data anywhere else, you could omit
this
// (note this could be unsafe though)
if (swap) swapAllEndian(units);
return ret;
}
private void swapAllEndian(T)(T[] data) {
// TODO: could probably optimize this with SIMD instructions
foreach (i; 0 .. data.length)
data[i] = swapEndian(data[i]);
}
Example library if you want to guess encoding without BOM:
https://code.dlang.org/packages/libguess-d
used API docs:
https://dlang.org/phobos/std_bitmanip.html#swapEndian <- swapping
BE/LE for native encoding
https://dlang.org/phobos/std_encoding.html <- BOM detection,
transcoding capabilities for encodings other than UTF
https://dlang.org/phobos/std_utf.html <- low level UTF-8
encoding/decoding, lazy decoding, validation
More information about the Digitalmars-d-learn
mailing list