UTF-8 problems

Oskar Linde oskar.lindeREM at OVEgmail.com
Mon Jun 12 07:45:53 PDT 2006


Deewiant skrev:
> import std.stream, std.cstream;
> 
> // åäöΔ
> 
> void main() {
> 	Stream file = new File(__FILE__, FileMode.In);
> 	// alternatively:
> 	//Stream file = din;
> 
> 	while (!file.eof)
> 		dout.writef("%s", file.getc);
> }
> --
> 
> With the above UTF-8 code, I expect the program's source to be output, also in
> UTF-8. However, I get ASCII output, and on line three appears everyone's
> favourite "Error: 4invalid UTF-8 sequence".
> 
> Furthermore, unless I use the "alternative" where std.cstream.din is used, the
> two line breaks after "std.cstream;" are not \r\n as they should be in the DOS
> encoding I use, they are \r\r\n. Converting the line breaks to just \n causes
> them to become \r\n in the output. Whence the extra \r?
> 
> What's strange is if I use e.g. readLine instead of getc, everything is fine.
> Since readLine seems to use getc internally, I'm having trouble understanding
> why this is the case.
> 
> A bug or two, or where am I going wrong?

I had a quick look at the std.stream sources and it seems std.stream 
isn't really unicode aware. getc() assumes the stream to be in utf-8 and 
returns a char, which means it returns a utf8 code unit, not a full 
character. getcw() on the other hand assumes the string is in utf-16 and 
returns a utf-16 code unit as a wchar.

You are printing individial utf-8 code units as characters, which 
triggers your error.

If D claims to have full unicode support, std.stream ought to either 
have decoding routines that return a dchar, or have a utf-decoding 
wrapper stream, in which case std.stream.getc() ought to return a ubyte, 
not a char...

/Oskar



More information about the Digitalmars-d-learn mailing list