ASCII to UTF8 Conversion - is this right?

Pragma ericanderton at yahoo.removeme.com
Mon Dec 18 09:19:29 PST 2006


Here's something that came up recently.  As some of you may already 
know, I've been doing some work with forum data recently.

I wanted to move some old forum data, which was stored in ASCII over to 
UTF8 via D.  The problem is that some of the data has characters in the 
0x80-0xff range, which causes UTF-BOM detection to fail.

So I rolled the following function to 'transcode' these characters:

char[] ASCII2UTF8(char[] value){
	char[] result;
	for(uint i=0; i<value.length; i++){
		char ch = value[i];
		if(ch < 0x80){
			result ~= ch;
		}
		else{
			result ~= 0xC0  | (ch >> 6);
			result ~= 0x80  | (ch & 0x3F);
			
			debug writefln("converted: %0.2X to %0.2X %0.2X",ch, result[$-2], 
result[$-1]);
		}
	}
	return result;
}

So my question is, while this conversion is done against a literal 
interpretation of the UTF-8 spec: is this the correct way to treat these 
characters?

Should I be taking user locale into account?  Are high-ASCII chars 
considered to be universal?

-- 
- EricAnderton at yahoo


More information about the Digitalmars-d-learn mailing list