Why string alias is invariant ?

Fri Feb 1 00:39:07 PST 2008

On Feb 1, 2008 7:50 AM, Janice Caron <caron800 at googlemail.com> wrote:
> So you could uppercase U+00DF (to
> itself) using simple-casing.

I'm obviously telling you stuff you already know -  I apologise. I
would imagine that normalisation forms probably also complicate full
casing.

One interesting thing is that simple casing also works just fine for
wchar (that is, UTF-16). That's because every letter of every living
language will be found in the Basic Multilingual Plane (the range
U+0000 to U+FFFF). Codepoints outside this range are either symbols,
or letters of dead languages. (Or combining characters, etc.), so it's
probably safe to leave all non-BMP codepoints unchanged when
case-changing. Codepoints in the BMP occupy a single UTF-16 code unit.

In many languages (e.g. Chinese), a UTF-16 string is likely to be
shorter than the corresponding UTF-8 string. This makes me suspect
that UTF-16 may well be the ideal choice for string representation in
the real world. (It's what Java went with). Maybe UTF-16, not UTF-8,
should be the default kind of string?