Read non-UTF8 file

Stewart Gordon smjg_1998 at yahoo.com
Mon Feb 21 08:55:56 PST 2011


What compiler version/platform are you using?  I had to fix some errors before it would 
compile on mine (1.066/2.051 Windows).

On 19/02/2011 13:42, Nrgyzer wrote:
<snip>
> Now... and with writefln("%s", cast(ubyte[]) convertToUTF8(f.readLine())); I get the following:
>
> [195, 131, 164]
> [195, 131, 182]
> [195, 131, 188]

It took a while for me to make sense of what's going on!

The expressions (0xC0 | (ch >> 6)) and (0x80 | (ch & 0x3F)) both have type int.  It 
appears that, in D2, if you append an int to a string then it treats the int as a Unicode 
codepoint and automagically converts it to UTF-8.  But why is it doing it on the first 
byte and not the second?  This looks like a bug.

Casting each UTF-8 byte value to a char

     if (ch < 0x80) {
         result ~= cast(char) ch;
     } else {
         result ~= cast(char) (0xC0 | (ch >> 6));
         result ~= cast(char) (0x80 | (ch & 0x3F));
     }

gives the expected output

[195, 164]
[195, 182]
[195, 188]

HTH

Stewart.


More information about the Digitalmars-d-learn mailing list