What is the legal range of chars?
monarch_dodra
monarchdodra at gmail.com
Wed Jun 19 12:22:00 PDT 2013
On Wednesday, 19 June 2013 at 17:48:49 UTC, Jonathan M Davis
wrote:
> On Wednesday, June 19, 2013 19:02:55 anonymous wrote:
>> On Wednesday, 19 June 2013 at 16:54:01 UTC, monarch_dodra
>> wrote:
>> > Hum... well, that's true for UTF-8 strings, if the _codeunit_
>> > 0xe7 appears, it is not 'ç'.
>> >
>> > But when handling a 'char', there is no encoding, it "should"
>> > be raw _codepoint_.
>>
>> No, char is a UTF8 code unit.
>> Code unit and code point become synonymous in UTF32, so dchar
>> is
>> a code point.
>
> Exactly. char, wchar, and dchar are all code _units_, and dchar
> (UTF-32) is
> the only case where a code unit is guaranteed to be a code
> point. For both
> char (UTF-8) and wchar (UTF-16), the number of code units in a
> code point is
> variable, and in the case of UTF-8, any code point which isn't
> an ASCII
> characters is multiple code units. Wikipedia and TDPL both have
> a nice chart
> showing the valid values for UTF-8 and how many code units are
> in a code point
> for each set of values:
>
> http://en.wikipedia.org/wiki/UTF-8#Description
>
> - Jonathan M Davis
Well, there is still ambiguity when you have a standalone char if
it is holding a (paritally truncated) code unit, or a partial
code point.
If I write:
char c = '\xDF'; //0b11011111; //Lead UTF-8 2 byte encoding
wchar w = 'ß'; //0b11011111; \u00DF
assert(c == w);
The assert passes. Yet 'c' is just the partial of a 2 byte
sequence, and not 'ß'.
In any case, this conversation gave me the answers I was looking
for in the context of the original question.
More information about the Digitalmars-d-learn
mailing list