UTF-8 problems

Mon Jun 12 10:08:06 PDT 2006

Oskar Linde wrote:
> Deewiant skrev:
>> So, for instance, "c3 a4" is the UTF-8 equivalent of U+00E4, "ä". How do I 
>> combine the former two into a single "char"?
>> 
>> Say I check if the char received from getc() is greater than 127 (outside
>> ASCII) and if it is, I store it and the following char in two ubytes. Now 
>> what? How do I get a char?
> 
> dchar std.utf.decode(char[],int)
> 
> even if it can be quite clumsy. A hint is to use:
> 
> std.utf.UTF8stride[c] to get the total number of bytes that are part of the
> starting token c.
> 
> /Oskar

Thanks, that works. What I did was write a short function looking like this:

dchar myGetchar(Stream s) {
	char c = s.getc;

	// ASCII
	if (c <= 127)
		return c;
	else {
		// UTF-8
		char[] str = new char[2];
		str[0] = c;
		str[1] = s.getc;

		// dummy var, needed by decode
		size_t i = 0;
		return decode(str, i);
	}
}

Using that in place of getc() pretty much does the trick.

Unfortunately, when reading from files instead of stdin, I still run into the
problem of \r\n being converted to \r\r\n. I think I know why, too: '\n' is
being converted into \r\n because I'm on a Windows platform. I use the following
workaround:

if (c == '\r') {
	char d = s.getc;
	if (d == '\n')
		return '\n';
	else {
		s.ungetc(d);
		return c;
	}
}