ASCII to UTF8 Conversion - is this right?

Georg Wrede georg at nospam.org
Tue Dec 19 00:26:09 PST 2006


Pragma wrote:
> Oskar Linde wrote:
> 
>> Pragma wrote:
>>
>>> Here's something that came up recently.  As some of you may already 
>>> know, I've been doing some work with forum data recently.
>>>
>>> I wanted to move some old forum data, which was stored in ASCII over 
>>> to UTF8 via D.  The problem is that some of the data has characters 
>>> in the 0x80-0xff range, which causes UTF-BOM detection to fail.
>>>
>>> So I rolled the following function to 'transcode' these characters:
>>>
>>> char[] ASCII2UTF8(char[] value){
>>>     char[] result;
>>>     for(uint i=0; i<value.length; i++){
>>>         char ch = value[i];
>>>         if(ch < 0x80){
>>>             result ~= ch;
>>>         }
>>>         else{
>>>             result ~= 0xC0  | (ch >> 6);
>>>             result ~= 0x80  | (ch & 0x3F);
>>>                        debug writefln("converted: %0.2X to %0.2X 
>>> %0.2X",ch, result[$-2], result[$-1]);
>>>         }
>>>     }
>>>     return result;
>>> }
>>>
>>> So my question is, while this conversion is done against a literal 
>>> interpretation of the UTF-8 spec: is this the correct way to treat 
>>> these characters?
>>
>>
>> First, ASCII is a 7 bit encoding that only defines characters <= 0x7f. 
>> The encoding of the upper 128 bytes are locale dependent and can not 
>> be called "ASCII". There are numerous different encodings used for the 
>> upper 128 code points.
> 
> 
> Precisely the reason why I posted this. :) The 'ASCII2UTF8' name was 
> taken for lack of a better title.  Admittedly, it's a misnomer. Same 
> goes for my use of "high-ASCII".
> 
>>
>> The above is correct if the source text is in Latin1 (ISO-8859-1) 
>> coding. This is probably the most common single byte encoding for 
>> Western Europe and the US. The windows english standard charset 1252 
>> is a superset of latin1 and defines the range 0x80-0x9f differently.
>>
>>> Should I be taking user locale into account?  Are high-ASCII chars 
>>> considered to be universal?
>>
>>
>> Rename the function Latin12UTF8 and you have something that behaves 
>> correctly according to spec. :)
> 
> 
> Makes sense to me.  If I can't find a way to determine what codepage 
> users are using in the forum for non-Latin1 posts, I'll just try Latin-1 
> and see what happens.

You might also want to look at the message headers:

Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit

Especially the Content-Type header often tells you directly what the 
coding is.


More information about the Digitalmars-d-learn mailing list