Implicit encoding conversion on string ~= int ?
Marco Leise
Marco.Leise at gmx.de
Sun Jun 23 10:33:31 PDT 2013
Am Sun, 23 Jun 2013 19:12:21 +0200
schrieb Marco Leise <Marco.Leise at gmx.de>:
> Am Sun, 23 Jun 2013 18:37:16 +0200
> schrieb "bearophile" <bearophileHUGS at lycos.com>:
>
> > Adam D. Ruppe:
> >
> > > char[] a;
> > > int b = 1000;
> > > a ~= b;
> > >
> > > the "a ~= b" is more like "a ~= cast(dchar) b", and then dchar
> > > -> char means it may be multibyte encoded, going from utf-32 to
> > > utf-8.
>
> No no no, this is not what happens. In my case it was:
> string a;
> int b = 228; // CP850 value for 'ä'. Note: fits in a single byte!
> a ~= b;
>
> Maybe it goes as follows:
> o compiler sees ~= to a string and becomes "aware" of wchar and dchar
> conversions to char
> o appended value is only checked for size (type and signedness are lost)
> and maps int to dchar
> o this dchar value is now checked for Unicode conformity and fails the test
> o the dchar value is now assumed to be Latin-1, Windows-1252 or similar
> and a conversion routine invoked
> o the dchar value is converted to utf-8 and...
> o appended as a multi-byte string to variable "a".
>
> That still doesn't sound right to me thought. What if the dchar value is
> not valid Unicode AND >= 256 ?
Actually you were 100% right, Adam. I was distracted by the
fact that the source was CP850.
UTF-32 maps all of Latin-1 in a 1:1 correspondence and most of
CP850 has the same code in Latin-1. So yes, all the compiler
was doing is to append a dchar value.
And with char/ubyte I do find it convenient to mix them
without casting. E.g. "if (someChar < 0x80)" and similar code.
As confusing as it was for me, I agree with "WONT FIX".
--
Marco
More information about the Digitalmars-d
mailing list