ASCII to UTF8 Conversion - is this right?
Georg Wrede
georg at nospam.org
Tue Dec 19 00:26:09 PST 2006
Pragma wrote:
> Oskar Linde wrote:
>
>> Pragma wrote:
>>
>>> Here's something that came up recently. As some of you may already
>>> know, I've been doing some work with forum data recently.
>>>
>>> I wanted to move some old forum data, which was stored in ASCII over
>>> to UTF8 via D. The problem is that some of the data has characters
>>> in the 0x80-0xff range, which causes UTF-BOM detection to fail.
>>>
>>> So I rolled the following function to 'transcode' these characters:
>>>
>>> char[] ASCII2UTF8(char[] value){
>>> char[] result;
>>> for(uint i=0; i<value.length; i++){
>>> char ch = value[i];
>>> if(ch < 0x80){
>>> result ~= ch;
>>> }
>>> else{
>>> result ~= 0xC0 | (ch >> 6);
>>> result ~= 0x80 | (ch & 0x3F);
>>> debug writefln("converted: %0.2X to %0.2X
>>> %0.2X",ch, result[$-2], result[$-1]);
>>> }
>>> }
>>> return result;
>>> }
>>>
>>> So my question is, while this conversion is done against a literal
>>> interpretation of the UTF-8 spec: is this the correct way to treat
>>> these characters?
>>
>>
>> First, ASCII is a 7 bit encoding that only defines characters <= 0x7f.
>> The encoding of the upper 128 bytes are locale dependent and can not
>> be called "ASCII". There are numerous different encodings used for the
>> upper 128 code points.
>
>
> Precisely the reason why I posted this. :) The 'ASCII2UTF8' name was
> taken for lack of a better title. Admittedly, it's a misnomer. Same
> goes for my use of "high-ASCII".
>
>>
>> The above is correct if the source text is in Latin1 (ISO-8859-1)
>> coding. This is probably the most common single byte encoding for
>> Western Europe and the US. The windows english standard charset 1252
>> is a superset of latin1 and defines the range 0x80-0x9f differently.
>>
>>> Should I be taking user locale into account? Are high-ASCII chars
>>> considered to be universal?
>>
>>
>> Rename the function Latin12UTF8 and you have something that behaves
>> correctly according to spec. :)
>
>
> Makes sense to me. If I can't find a way to determine what codepage
> users are using in the forum for non-Latin1 posts, I'll just try Latin-1
> and see what happens.
You might also want to look at the message headers:
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Especially the Content-Type header often tells you directly what the
coding is.
More information about the Digitalmars-d-learn
mailing list