What is the legal range of chars?

monarch_dodra monarchdodra at gmail.com
Wed Jun 19 12:22:00 PDT 2013


On Wednesday, 19 June 2013 at 17:48:49 UTC, Jonathan M Davis 
wrote:
> On Wednesday, June 19, 2013 19:02:55 anonymous wrote:
>> On Wednesday, 19 June 2013 at 16:54:01 UTC, monarch_dodra 
>> wrote:
>> > Hum... well, that's true for UTF-8 strings, if the _codeunit_
>> > 0xe7 appears, it is not 'ç'.
>> > 
>> > But when handling a 'char', there is no encoding, it "should"
>> > be raw _codepoint_.
>> 
>> No, char is a UTF8 code unit.
>> Code unit and code point become synonymous in UTF32, so dchar 
>> is
>> a code point.
>
> Exactly. char, wchar, and dchar are all code _units_, and dchar 
> (UTF-32) is
> the only case where a code unit is guaranteed to be a code 
> point. For both
> char (UTF-8) and wchar (UTF-16), the number of code units in a 
> code point is
> variable, and in the case of UTF-8, any code point which isn't 
> an ASCII
> characters is multiple code units. Wikipedia and TDPL both have 
> a nice chart
> showing the valid values for UTF-8 and how many code units are 
> in a code point
> for each set of values:
>
> http://en.wikipedia.org/wiki/UTF-8#Description
>
> - Jonathan M Davis

Well, there is still ambiguity when you have a standalone char if 
it is holding a (paritally truncated) code unit, or a partial 
code point.

If I write:
     char  c = '\xDF'; //0b11011111; //Lead UTF-8 2 byte encoding
     wchar w = 'ß';    //0b11011111; \u00DF
     assert(c == w);

The assert passes. Yet 'c' is just the partial of a 2 byte 
sequence, and not 'ß'.

In any case, this conversation gave me the answers I was looking 
for in the context of the original question.


More information about the Digitalmars-d-learn mailing list