Read non-UTF8 file

Nrgyzer nrgyzer at gmail.com
Tue Feb 22 08:56:41 PST 2011


== Auszug aus Stewart Gordon (smjg_1998 at yahoo.com)'s Artikel
> What compiler version/platform are you using?  I had to fix some errors before it would
> compile on mine (1.066/2.051 Windows).
> On 19/02/2011 13:42, Nrgyzer wrote:
> <snip>
> > Now... and with writefln("%s", cast(ubyte[]) convertToUTF8(f.readLine())); I get the following:
> >
> > [195, 131, 164]
> > [195, 131, 182]
> > [195, 131, 188]
> It took a while for me to make sense of what's going on!
> The expressions (0xC0 | (ch >> 6)) and (0x80 | (ch & 0x3F)) both have type int.  It
> appears that, in D2, if you append an int to a string then it treats the int as a Unicode
> codepoint and automagically converts it to UTF-8.  But why is it doing it on the first
> byte and not the second?  This looks like a bug.
> Casting each UTF-8 byte value to a char
>      if (ch < 0x80) {
>          result ~= cast(char) ch;
>      } else {
>          result ~= cast(char) (0xC0 | (ch >> 6));
>          result ~= cast(char) (0x80 | (ch & 0x3F));
>      }
> gives the expected output
> [195, 164]
> [195, 182]
> [195, 188]
> HTH
> Stewart.

I also wondered because I've used the same code in D1 and it worked without any problems. Anyway... thanks :)


More information about the Digitalmars-d-learn mailing list