Implicit encoding conversion on string ~= int ?

Adam D. Ruppe destructionator at gmail.com
Sun Jun 23 10:32:29 PDT 2013


On Sunday, 23 June 2013 at 17:12:41 UTC, Marco Leise wrote:
> int b = 228;  // CP850 value for 'ä'. Note: fits in a single 
> byte!

228 (e4 in hex) is also the Unicode code point for ä, which is 
[195, 164] when encoded as UTF-8. see: 
http://www.utf8-chartable.de/unicode-utf8-table.pl?number=512&utf8=dec

While the number 228 would fit in a byte normally, utf-8 uses the 
high bits as markers that this is part of a multibyte sequence 
(this helps with ascii compatibility), so any code point > 127 
will always be a multibyte sequence in utf-8. see: 
http://en.wikipedia.org/wiki/UTF-8#Description


More information about the Digitalmars-d mailing list