UTF-8 problems

Oskar Linde oskar.lindeREM at OVEgmail.com
Mon Jun 12 10:24:38 PDT 2006


Deewiant skrev:
> Oskar Linde wrote:
>> Deewiant skrev:
>>> So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I 
>>> combine the former two into a single "char"?
>>>
>>> Say I check if the char received from getc() is greater than 127 (outside
>>> ASCII) and if it is, I store it and the following char in two ubytes. Now 
>>> what? How do I get a char?
>> dchar std.utf.decode(char[],int)
>>
>> even if it can be quite clumsy. A hint is to use:
>>
>> std.utf.UTF8stride[c] to get the total number of bytes that are part of the
>> starting token c.
>>
>> /Oskar
> 
> Thanks, that works. What I did was write a short function looking like this:

This only works for a small subset of Unicode...

> dchar myGetchar(Stream s) {
> 	char c = s.getc;
> 
> 	// ASCII
> 	if (c <= 127)
> 		return c;
> 	else {
> 		// UTF-8
> 		char[] str = new char[2];
> 		str[0] = c;
> 		str[1] = s.getc;

For a more general implementation, change the last 3 lines to:

		char[6] str;
                 str[0] = c;
                 int n = std.utf.UTF8stride[c];
                 if (n == 0xff)
                         return cast(dchar)-1;; // corrupt string
                 for (int i = 1; i < n; i++)
                         str[i] = s.getc;

> 
> 		// dummy var, needed by decode
> 		size_t i = 0;
> 		return decode(str, i);
> 	}
> }
> 
> Using that in place of getc() pretty much does the trick.
> 
> Unfortunately, when reading from files instead of stdin, I still run into the
> problem of \r\n being converted to \r\r\n. I think I know why, too: '\n' is
> being converted into \r\n because I'm on a Windows platform. I use the following
> workaround:

Yes. This is another proof that std.stream is lacking functionality. 
Because of this conversion, it is clear that std.stream isn't a binary 
stream, and as such, it ought to be either a utf-8, utf-16 or utf-32 
encoded text stream, and in those cases std.stream.getc should have a 
function returning a dchar, just as the above code.

/Oskar



More information about the Digitalmars-d-learn mailing list