UTF-8 problems
Oskar Linde
oskar.lindeREM at OVEgmail.com
Mon Jun 12 07:45:53 PDT 2006
Deewiant skrev:
> import std.stream, std.cstream;
>
> // åäöΔ
>
> void main() {
> Stream file = new File(__FILE__, FileMode.In);
> // alternatively:
> //Stream file = din;
>
> while (!file.eof)
> dout.writef("%s", file.getc);
> }
> --
>
> With the above UTF-8 code, I expect the program's source to be output, also in
> UTF-8. However, I get ASCII output, and on line three appears everyone's
> favourite "Error: 4invalid UTF-8 sequence".
>
> Furthermore, unless I use the "alternative" where std.cstream.din is used, the
> two line breaks after "std.cstream;" are not \r\n as they should be in the DOS
> encoding I use, they are \r\r\n. Converting the line breaks to just \n causes
> them to become \r\n in the output. Whence the extra \r?
>
> What's strange is if I use e.g. readLine instead of getc, everything is fine.
> Since readLine seems to use getc internally, I'm having trouble understanding
> why this is the case.
>
> A bug or two, or where am I going wrong?
I had a quick look at the std.stream sources and it seems std.stream
isn't really unicode aware. getc() assumes the stream to be in utf-8 and
returns a char, which means it returns a utf8 code unit, not a full
character. getcw() on the other hand assumes the string is in utf-16 and
returns a utf-16 code unit as a wchar.
You are printing individial utf-8 code units as characters, which
triggers your error.
If D claims to have full unicode support, std.stream ought to either
have decoding routines that return a dchar, or have a utf-decoding
wrapper stream, in which case std.stream.getc() ought to return a ubyte,
not a char...
/Oskar
More information about the Digitalmars-d-learn
mailing list