UTF-8 problems

Mon Jun 12 09:06:43 PDT 2006

Deewiant skrev:
> Oskar Linde wrote:
>> Deewiant skrev:
>>> import std.stream, std.cstream;
>>>
>>> // åäöΔ
>>>
>>> void main() {
>>>     Stream file = new File(__FILE__, FileMode.In);
>>>     // alternatively:
>>>     //Stream file = din;
>>>
>>>     while (!file.eof)
>>>         dout.writef("%s", file.getc);
>>> }
>>> -- 
>>>
>>> With the above UTF-8 code, I expect the program's source to be output,
>>> also in
>>> UTF-8. However, I get ASCII output, and on line three appears everyone's
>>> favourite "Error: 4invalid UTF-8 sequence".
>>>
>>> Furthermore, unless I use the "alternative" where std.cstream.din is
>>> used, the
>>> two line breaks after "std.cstream;" are not \r\n as they should be in
>>> the DOS
>>> encoding I use, they are \r\r\n. Converting the line breaks to just \n
>>> causes
>>> them to become \r\n in the output. Whence the extra \r?
>>>
>>> What's strange is if I use e.g. readLine instead of getc, everything
>>> is fine.
>>> Since readLine seems to use getc internally, I'm having trouble
>>> understanding
>>> why this is the case.
>>>
>>> A bug or two, or where am I going wrong?
>> I had a quick look at the std.stream sources and it seems std.stream
>> isn't really unicode aware. getc() assumes the stream to be in utf-8 and
>> returns a char, which means it returns a utf8 code unit, not a full
>> character. getcw() on the other hand assumes the string is in utf-16 and
>> returns a utf-16 code unit as a wchar.
>>
>> You are printing individial utf-8 code units as characters, which
>> triggers your error.
>>
>> /Oskar
> 
> Thanks for the explanation. Unfortunately, I'm not knowledgeable enough in these
> matters to correct the problem.
> 
> So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I
> combine the former two into a single "char"?
> 
> Say I check if the char received from getc() is greater than 127 (outside ASCII)
> and if it is, I store it and the following char in two ubytes. Now what? How do
> I get a char?

dchar std.utf.decode(char[],int)

even if it can be quite clumsy. A hint is to use:

std.utf.UTF8stride[c] to get the total number of bytes that are part of 
the starting token c.

/Oskar