ASCII to UTF8 Conversion - is this right?

Mon Dec 18 09:58:39 PST 2006

Pragma wrote:
> Here's something that came up recently.  As some of you may already 
> know, I've been doing some work with forum data recently.
> 
> I wanted to move some old forum data, which was stored in ASCII over to 
> UTF8 via D.  The problem is that some of the data has characters in the 
> 0x80-0xff range, which causes UTF-BOM detection to fail.
> 
> So I rolled the following function to 'transcode' these characters:
> 
> char[] ASCII2UTF8(char[] value){
>     char[] result;
>     for(uint i=0; i<value.length; i++){
>         char ch = value[i];
>         if(ch < 0x80){
>             result ~= ch;
>         }
>         else{
>             result ~= 0xC0  | (ch >> 6);
>             result ~= 0x80  | (ch & 0x3F);
>            
>             debug writefln("converted: %0.2X to %0.2X %0.2X",ch, 
> result[$-2], result[$-1]);
>         }
>     }
>     return result;
> }
> 
> So my question is, while this conversion is done against a literal 
> interpretation of the UTF-8 spec: is this the correct way to treat these 
> characters?

First, ASCII is a 7 bit encoding that only defines characters <= 0x7f. 
The encoding of the upper 128 bytes are locale dependent and can not be 
called "ASCII". There are numerous different encodings used for the 
upper 128 code points.

The above is correct if the source text is in Latin1 (ISO-8859-1) 
coding. This is probably the most common single byte encoding for 
Western Europe and the US. The windows english standard charset 1252 is 
a superset of latin1 and defines the range 0x80-0x9f differently.

> Should I be taking user locale into account?  Are high-ASCII chars 
> considered to be universal?

Rename the function Latin12UTF8 and you have something that behaves 
correctly according to spec. :)

Best regards,

/Oskar