UTF-8 problems

Deewiant deewiant.doesnotlike.spam at gmail.com
Mon Jun 12 11:02:05 PDT 2006


Oskar Linde wrote:
> Deewiant skrev:
>> Oskar Linde wrote:
>>> Deewiant skrev:
>>>> So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä".
>>>> How do I combine the former two into a single "char"?
>>>>
>>>> Say I check if the char received from getc() is greater than 127
>>>> (outside
>>>> ASCII) and if it is, I store it and the following char in two
>>>> ubytes. Now what? How do I get a char?
>>> dchar std.utf.decode(char[],int)
>>>
>>> even if it can be quite clumsy. A hint is to use:
>>>
>>> std.utf.UTF8stride[c] to get the total number of bytes that are part
>>> of the
>>> starting token c.
>>>
>>> /Oskar
>>
>> Thanks, that works. What I did was write a short function looking like
>> this:
> 
> This only works for a small subset of Unicode...

Thanks for correcting it, I was unsure myself.

>> dchar myGetchar(Stream s) {
>>     char c = s.getc;
>>
>>     // ASCII
>>     if (c <= 127)
>>         return c;
>>     else {
>>         // UTF-8
>>         char[] str = new char[2];
>>         str[0] = c;
>>         str[1] = s.getc;
> 
> For a more general implementation, change the last 3 lines to:
> 
>         char[6] str;

6? Aren't 4 UTF-8 units enough for all of Unicode? I see that UTF8stride also
has 5 or 6 as some of its elements; why is that?

>                 str[0] = c;
>                 int n = std.utf.UTF8stride[c];
>                 if (n == 0xff)
>                         return cast(dchar)-1;; // corrupt string
>                 for (int i = 1; i < n; i++)
>                         str[i] = s.getc;
> 
>>
>>         // dummy var, needed by decode
>>         size_t i = 0;
>>         return decode(str, i);
>>     }
>> }
>>
>> Using that in place of getc() pretty much does the trick.
>>
>> Unfortunately, when reading from files instead of stdin, I still run
>> into the
>> problem of \r\n being converted to \r\r\n. I think I know why, too:
>> '\n' is
>> being converted into \r\n because I'm on a Windows platform. I use the
>> following
>> workaround:
> 
> Yes. This is another proof that std.stream is lacking functionality.
> Because of this conversion, it is clear that std.stream isn't a binary
> stream, and as such, it ought to be either a utf-8, utf-16 or utf-32
> encoded text stream, and in those cases std.stream.getc should have a
> function returning a dchar, just as the above code.
> 
> /Oskar

Yes, I agree wholeheartedly. It would appear that the std.stream classes are for
textual input, but currently some of the methods choke on UTF-x input.

In addition to a getcd() method to complement getc() and getcw(), a getb()
method returning an ubyte might also be handy, for when one really wants
byte-by-byte input. Perhaps getc()'s signature should actually be changed into
that, since after all that's all it seems currently to be doing.



More information about the Digitalmars-d-learn mailing list