ASCII to UTF8 Conversion - is this right?

Pragma ericanderton at yahoo.removeme.com
Mon Dec 18 10:36:38 PST 2006


Oskar Linde wrote:
> Pragma wrote:
>> Here's something that came up recently.  As some of you may already 
>> know, I've been doing some work with forum data recently.
>>
>> I wanted to move some old forum data, which was stored in ASCII over 
>> to UTF8 via D.  The problem is that some of the data has characters in 
>> the 0x80-0xff range, which causes UTF-BOM detection to fail.
>>
>> So I rolled the following function to 'transcode' these characters:
>>
>> char[] ASCII2UTF8(char[] value){
>>     char[] result;
>>     for(uint i=0; i<value.length; i++){
>>         char ch = value[i];
>>         if(ch < 0x80){
>>             result ~= ch;
>>         }
>>         else{
>>             result ~= 0xC0  | (ch >> 6);
>>             result ~= 0x80  | (ch & 0x3F);
>>                        debug writefln("converted: %0.2X to %0.2X 
>> %0.2X",ch, result[$-2], result[$-1]);
>>         }
>>     }
>>     return result;
>> }
>>
>> So my question is, while this conversion is done against a literal 
>> interpretation of the UTF-8 spec: is this the correct way to treat 
>> these characters?
> 
> First, ASCII is a 7 bit encoding that only defines characters <= 0x7f. 
> The encoding of the upper 128 bytes are locale dependent and can not be 
> called "ASCII". There are numerous different encodings used for the 
> upper 128 code points.

Precisely the reason why I posted this. :) The 'ASCII2UTF8' name was 
taken for lack of a better title.  Admittedly, it's a misnomer. Same 
goes for my use of "high-ASCII".

> 
> The above is correct if the source text is in Latin1 (ISO-8859-1) 
> coding. This is probably the most common single byte encoding for 
> Western Europe and the US. The windows english standard charset 1252 is 
> a superset of latin1 and defines the range 0x80-0x9f differently.
> 
>> Should I be taking user locale into account?  Are high-ASCII chars 
>> considered to be universal?
> 
> Rename the function Latin12UTF8 and you have something that behaves 
> correctly according to spec. :)

Makes sense to me.  If I can't find a way to determine what codepage 
users are using in the forum for non-Latin1 posts, I'll just try Latin-1 
and see what happens.

Thanks!

-- 
- EricAnderton at yahoo


More information about the Digitalmars-d-learn mailing list