Read non-UTF8 file
Stewart Gordon
smjg_1998 at yahoo.com
Mon Feb 21 08:55:56 PST 2011
What compiler version/platform are you using? I had to fix some errors before it would
compile on mine (1.066/2.051 Windows).
On 19/02/2011 13:42, Nrgyzer wrote:
<snip>
> Now... and with writefln("%s", cast(ubyte[]) convertToUTF8(f.readLine())); I get the following:
>
> [195, 131, 164]
> [195, 131, 182]
> [195, 131, 188]
It took a while for me to make sense of what's going on!
The expressions (0xC0 | (ch >> 6)) and (0x80 | (ch & 0x3F)) both have type int. It
appears that, in D2, if you append an int to a string then it treats the int as a Unicode
codepoint and automagically converts it to UTF-8. But why is it doing it on the first
byte and not the second? This looks like a bug.
Casting each UTF-8 byte value to a char
if (ch < 0x80) {
result ~= cast(char) ch;
} else {
result ~= cast(char) (0xC0 | (ch >> 6));
result ~= cast(char) (0x80 | (ch & 0x3F));
}
gives the expected output
[195, 164]
[195, 182]
[195, 188]
HTH
Stewart.
More information about the Digitalmars-d-learn
mailing list