UTF-8 problems
Deewiant
deewiant.doesnotlike.spam at gmail.com
Mon Jun 12 10:08:06 PDT 2006
Oskar Linde wrote:
> Deewiant skrev:
>> So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I
>> combine the former two into a single "char"?
>>
>> Say I check if the char received from getc() is greater than 127 (outside
>> ASCII) and if it is, I store it and the following char in two ubytes. Now
>> what? How do I get a char?
>
> dchar std.utf.decode(char[],int)
>
> even if it can be quite clumsy. A hint is to use:
>
> std.utf.UTF8stride[c] to get the total number of bytes that are part of the
> starting token c.
>
> /Oskar
Thanks, that works. What I did was write a short function looking like this:
dchar myGetchar(Stream s) {
char c = s.getc;
// ASCII
if (c <= 127)
return c;
else {
// UTF-8
char[] str = new char[2];
str[0] = c;
str[1] = s.getc;
// dummy var, needed by decode
size_t i = 0;
return decode(str, i);
}
}
Using that in place of getc() pretty much does the trick.
Unfortunately, when reading from files instead of stdin, I still run into the
problem of \r\n being converted to \r\r\n. I think I know why, too: '\n' is
being converted into \r\n because I'm on a Windows platform. I use the following
workaround:
if (c == '\r') {
char d = s.getc;
if (d == '\n')
return '\n';
else {
s.ungetc(d);
return c;
}
}
More information about the Digitalmars-d-learn
mailing list