ASCII to UTF8 Conversion - is this right?
Pragma
ericanderton at yahoo.removeme.com
Mon Dec 18 10:36:38 PST 2006
Oskar Linde wrote:
> Pragma wrote:
>> Here's something that came up recently. As some of you may already
>> know, I've been doing some work with forum data recently.
>>
>> I wanted to move some old forum data, which was stored in ASCII over
>> to UTF8 via D. The problem is that some of the data has characters in
>> the 0x80-0xff range, which causes UTF-BOM detection to fail.
>>
>> So I rolled the following function to 'transcode' these characters:
>>
>> char[] ASCII2UTF8(char[] value){
>> char[] result;
>> for(uint i=0; i<value.length; i++){
>> char ch = value[i];
>> if(ch < 0x80){
>> result ~= ch;
>> }
>> else{
>> result ~= 0xC0 | (ch >> 6);
>> result ~= 0x80 | (ch & 0x3F);
>> debug writefln("converted: %0.2X to %0.2X
>> %0.2X",ch, result[$-2], result[$-1]);
>> }
>> }
>> return result;
>> }
>>
>> So my question is, while this conversion is done against a literal
>> interpretation of the UTF-8 spec: is this the correct way to treat
>> these characters?
>
> First, ASCII is a 7 bit encoding that only defines characters <= 0x7f.
> The encoding of the upper 128 bytes are locale dependent and can not be
> called "ASCII". There are numerous different encodings used for the
> upper 128 code points.
Precisely the reason why I posted this. :) The 'ASCII2UTF8' name was
taken for lack of a better title. Admittedly, it's a misnomer. Same
goes for my use of "high-ASCII".
>
> The above is correct if the source text is in Latin1 (ISO-8859-1)
> coding. This is probably the most common single byte encoding for
> Western Europe and the US. The windows english standard charset 1252 is
> a superset of latin1 and defines the range 0x80-0x9f differently.
>
>> Should I be taking user locale into account? Are high-ASCII chars
>> considered to be universal?
>
> Rename the function Latin12UTF8 and you have something that behaves
> correctly according to spec. :)
Makes sense to me. If I can't find a way to determine what codepage
users are using in the forum for non-Latin1 posts, I'll just try Latin-1
and see what happens.
Thanks!
--
- EricAnderton at yahoo
More information about the Digitalmars-d-learn
mailing list