UTF-8 problems

Deewiant deewiant.doesnotlike.spam at gmail.com
Mon Jun 12 08:52:12 PDT 2006


Oskar Linde wrote:
> Deewiant skrev:
>> import std.stream, std.cstream;
>>
>> // åäöΔ
>>
>> void main() {
>>     Stream file = new File(__FILE__, FileMode.In);
>>     // alternatively:
>>     //Stream file = din;
>>
>>     while (!file.eof)
>>         dout.writef("%s", file.getc);
>> }
>> -- 
>>
>> With the above UTF-8 code, I expect the program's source to be output,
>> also in
>> UTF-8. However, I get ASCII output, and on line three appears everyone's
>> favourite "Error: 4invalid UTF-8 sequence".
>>
>> Furthermore, unless I use the "alternative" where std.cstream.din is
>> used, the
>> two line breaks after "std.cstream;" are not \r\n as they should be in
>> the DOS
>> encoding I use, they are \r\r\n. Converting the line breaks to just \n
>> causes
>> them to become \r\n in the output. Whence the extra \r?
>>
>> What's strange is if I use e.g. readLine instead of getc, everything
>> is fine.
>> Since readLine seems to use getc internally, I'm having trouble
>> understanding
>> why this is the case.
>>
>> A bug or two, or where am I going wrong?
> 
> I had a quick look at the std.stream sources and it seems std.stream
> isn't really unicode aware. getc() assumes the stream to be in utf-8 and
> returns a char, which means it returns a utf8 code unit, not a full
> character. getcw() on the other hand assumes the string is in utf-16 and
> returns a utf-16 code unit as a wchar.
> 
> You are printing individial utf-8 code units as characters, which
> triggers your error.
> 
> /Oskar

Thanks for the explanation. Unfortunately, I'm not knowledgeable enough in these
matters to correct the problem.

So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I
combine the former two into a single "char"?

Say I check if the char received from getc() is greater than 127 (outside ASCII)
and if it is, I store it and the following char in two ubytes. Now what? How do
I get a char?



More information about the Digitalmars-d-learn mailing list