What is the legal range of chars?

Jonathan M Davis jmdavisProg at gmx.com
Wed Jun 19 10:48:25 PDT 2013


On Wednesday, June 19, 2013 19:02:55 anonymous wrote:
> On Wednesday, 19 June 2013 at 16:54:01 UTC, monarch_dodra wrote:
> > Hum... well, that's true for UTF-8 strings, if the _codeunit_
> > 0xe7 appears, it is not 'ç'.
> > 
> > But when handling a 'char', there is no encoding, it "should"
> > be raw _codepoint_.
> 
> No, char is a UTF8 code unit.
> Code unit and code point become synonymous in UTF32, so dchar is
> a code point.

Exactly. char, wchar, and dchar are all code _units_, and dchar (UTF-32) is 
the only case where a code unit is guaranteed to be a code point. For both 
char (UTF-8) and wchar (UTF-16), the number of code units in a code point is 
variable, and in the case of UTF-8, any code point which isn't an ASCII 
characters is multiple code units. Wikipedia and TDPL both have a nice chart 
showing the valid values for UTF-8 and how many code units are in a code point 
for each set of values:

http://en.wikipedia.org/wiki/UTF-8#Description

- Jonathan M Davis


More information about the Digitalmars-d-learn mailing list