What is the legal range of chars?
Jonathan M Davis
jmdavisProg at gmx.com
Wed Jun 19 10:48:25 PDT 2013
On Wednesday, June 19, 2013 19:02:55 anonymous wrote:
> On Wednesday, 19 June 2013 at 16:54:01 UTC, monarch_dodra wrote:
> > Hum... well, that's true for UTF-8 strings, if the _codeunit_
> > 0xe7 appears, it is not 'ç'.
> >
> > But when handling a 'char', there is no encoding, it "should"
> > be raw _codepoint_.
>
> No, char is a UTF8 code unit.
> Code unit and code point become synonymous in UTF32, so dchar is
> a code point.
Exactly. char, wchar, and dchar are all code _units_, and dchar (UTF-32) is
the only case where a code unit is guaranteed to be a code point. For both
char (UTF-8) and wchar (UTF-16), the number of code units in a code point is
variable, and in the case of UTF-8, any code point which isn't an ASCII
characters is multiple code units. Wikipedia and TDPL both have a nice chart
showing the valid values for UTF-8 and how many code units are in a code point
for each set of values:
http://en.wikipedia.org/wiki/UTF-8#Description
- Jonathan M Davis
More information about the Digitalmars-d-learn
mailing list