Character is only first byte of an UTF-8 sequence

Nikita Kalaganov riven-mage at id.ru
Mon Sep 3 17:57:57 PDT 2007


> http://www.wikiservice.at/d/wiki.cgi?DanielKeep/TextInD

And, IMHO, solution is simple - chars must be treated by compiler and  
libraries as complete codepoints.

So, "char" can represent codepoints 0x20-0xFF (Basic latin and Latin-1  
supplement), "wchar" - codepoints from 0x20...0xFFFF (complete basic  
multilingual plane), and "dchar" - all codepoints (including supplementary  
planes).

If your program is 100% latin, use char[]. For multi-language programs use  
wchar[]. Use dchar[] for exotics :)

Conversion from char[] to wchar/dchar and from wchar to dchar is implicit.  
Reverse conversions is not always possible(*).

Main problems solved:
1. Slice-able strings.
2. length property contains real "length" of string.
3. Printable.
4. Easy to understand :)

All conversion from/to UTF-8,UTF-16 and UTF32 should be explicit.

Price is (*).



More information about the Digitalmars-d mailing list