Implicit encoding conversion on string ~= int ?

Marco Leise Marco.Leise at gmx.de
Sun Jun 23 10:33:31 PDT 2013


Am Sun, 23 Jun 2013 19:12:21 +0200
schrieb Marco Leise <Marco.Leise at gmx.de>:

> Am Sun, 23 Jun 2013 18:37:16 +0200
> schrieb "bearophile" <bearophileHUGS at lycos.com>:
> 
> > Adam D. Ruppe:
> > 
> > > char[] a;
> > > int b = 1000;
> > > a ~= b;
> > >
> > > the "a ~= b" is more like "a ~= cast(dchar) b", and then dchar 
> > > -> char means it may be multibyte encoded, going from utf-32 to 
> > > utf-8.
> 
> No no no, this is not what happens. In my case it was:
> string a;
> int b = 228;  // CP850 value for 'ä'. Note: fits in a single byte!
> a ~= b;
> 
> Maybe it goes as follows:
> o compiler sees ~= to a string and becomes "aware" of wchar and dchar
>   conversions to char
> o appended value is only checked for size (type and signedness are lost)
>   and maps int to dchar
> o this dchar value is now checked for Unicode conformity and fails the test
> o the dchar value is now assumed to be Latin-1, Windows-1252 or similar
>   and a conversion routine invoked
> o the dchar value is converted to utf-8 and...
> o appended as a multi-byte string to variable "a".
> 
> That still doesn't sound right to me thought. What if the dchar value is
> not valid Unicode AND >= 256 ?

Actually you were 100% right, Adam. I was distracted by the
fact that the source was CP850.
UTF-32 maps all of Latin-1 in a 1:1 correspondence and most of
CP850 has the same code in Latin-1. So yes, all the compiler
was doing is to append a dchar value.
And with char/ubyte I do find it convenient to mix them
without casting. E.g. "if (someChar < 0x80)" and similar code.

As confusing as it was for me, I agree with "WONT FIX".

-- 
Marco



More information about the Digitalmars-d mailing list