UTF-8 problems
Oskar Linde
oskar.lindeREM at OVEgmail.com
Mon Jun 12 10:24:38 PDT 2006
Deewiant skrev:
> Oskar Linde wrote:
>> Deewiant skrev:
>>> So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I
>>> combine the former two into a single "char"?
>>>
>>> Say I check if the char received from getc() is greater than 127 (outside
>>> ASCII) and if it is, I store it and the following char in two ubytes. Now
>>> what? How do I get a char?
>> dchar std.utf.decode(char[],int)
>>
>> even if it can be quite clumsy. A hint is to use:
>>
>> std.utf.UTF8stride[c] to get the total number of bytes that are part of the
>> starting token c.
>>
>> /Oskar
>
> Thanks, that works. What I did was write a short function looking like this:
This only works for a small subset of Unicode...
> dchar myGetchar(Stream s) {
> char c = s.getc;
>
> // ASCII
> if (c <= 127)
> return c;
> else {
> // UTF-8
> char[] str = new char[2];
> str[0] = c;
> str[1] = s.getc;
For a more general implementation, change the last 3 lines to:
char[6] str;
str[0] = c;
int n = std.utf.UTF8stride[c];
if (n == 0xff)
return cast(dchar)-1;; // corrupt string
for (int i = 1; i < n; i++)
str[i] = s.getc;
>
> // dummy var, needed by decode
> size_t i = 0;
> return decode(str, i);
> }
> }
>
> Using that in place of getc() pretty much does the trick.
>
> Unfortunately, when reading from files instead of stdin, I still run into the
> problem of \r\n being converted to \r\r\n. I think I know why, too: '\n' is
> being converted into \r\n because I'm on a Windows platform. I use the following
> workaround:
Yes. This is another proof that std.stream is lacking functionality.
Because of this conversion, it is clear that std.stream isn't a binary
stream, and as such, it ought to be either a utf-8, utf-16 or utf-32
encoded text stream, and in those cases std.stream.getc should have a
function returning a dchar, just as the above code.
/Oskar
More information about the Digitalmars-d-learn
mailing list