Character is only first byte of an UTF-8 sequence
Nikita Kalaganov
riven-mage at id.ru
Mon Sep 3 17:57:57 PDT 2007
> http://www.wikiservice.at/d/wiki.cgi?DanielKeep/TextInD
And, IMHO, solution is simple - chars must be treated by compiler and
libraries as complete codepoints.
So, "char" can represent codepoints 0x20-0xFF (Basic latin and Latin-1
supplement), "wchar" - codepoints from 0x20...0xFFFF (complete basic
multilingual plane), and "dchar" - all codepoints (including supplementary
planes).
If your program is 100% latin, use char[]. For multi-language programs use
wchar[]. Use dchar[] for exotics :)
Conversion from char[] to wchar/dchar and from wchar to dchar is implicit.
Reverse conversions is not always possible(*).
Main problems solved:
1. Slice-able strings.
2. length property contains real "length" of string.
3. Printable.
4. Easy to understand :)
All conversion from/to UTF-8,UTF-16 and UTF32 should be explicit.
Price is (*).
More information about the Digitalmars-d
mailing list