The Case Against Autodecode
tsbockman via Digitalmars-d
digitalmars-d at puremagic.com
Thu Jun 2 14:00:17 PDT 2016
On Thursday, 2 June 2016 at 20:49:52 UTC, Andrei Alexandrescu
wrote:
> On 06/02/2016 04:47 PM, tsbockman wrote:
>> That doesn't sound like much of an endorsement for defaulting
>> to only
>> level 1 support to me - "it does not handle more complex
>> languages or
>> extensions to the Unicode Standard very well".
>
> Code point/Level 1 support sounds like a sweet spot between
> efficiency/complexity and conviviality. Level 2 is opt-in with
> byGrapheme. -- Andrei
Actually, according to the document Walter Bright linked level 1
does NOT operate at the code point level:
> Level 1: Basic Unicode Support. At this level, the regular
> expression engine provides support for Unicode characters as
> basic 16-bit logical units. (This is independent of the actual
> serialization of Unicode as UTF-8, UTF-16BE, UTF-16LE, or
> UTF-32.)
> ...
> Level 1 support works well in many circumstances. However, it
> does not handle more complex languages or extensions to the
> Unicode Standard very well. Particularly important cases are
> **surrogates** ...
So, level 1 appears to be UTF-16 code units, not code points. To
do code points it would have to recognize surrogates, which are
specifically mentioned as not supported.
Level 2 skips straight to graphemes, and there is no code point
level.
However, this document is very old - from Unicode 3.0 and the
year 2000:
> While there are no surrogate characters in Unicode 3.0 (outside
> of private use characters), future versions of Unicode will
> contain them...
Perhaps level 1 has since been redefined?
More information about the Digitalmars-d
mailing list