The Case Against Autodecode

Thu Jun 2 14:00:17 PDT 2016

On Thursday, 2 June 2016 at 20:49:52 UTC, Andrei Alexandrescu 
wrote:
> On 06/02/2016 04:47 PM, tsbockman wrote:
>> That doesn't sound like much of an endorsement for defaulting 
>> to only
>> level 1 support to me - "it does not handle more complex 
>> languages or
>> extensions to the Unicode Standard very well".
>
> Code point/Level 1 support sounds like a sweet spot between 
> efficiency/complexity and conviviality. Level 2 is opt-in with 
> byGrapheme. -- Andrei

Actually, according to the document Walter Bright linked level 1 
does NOT operate at the code point level:

> Level 1: Basic Unicode Support. At this level, the regular 
> expression engine provides support for Unicode characters as 
> basic 16-bit logical units. (This is independent of the actual 
> serialization of Unicode as UTF-8, UTF-16BE, UTF-16LE, or 
> UTF-32.)
> ...
> Level 1 support works well in many circumstances. However, it 
> does not handle more complex languages or extensions to the 
> Unicode Standard very well. Particularly important cases are 
> **surrogates** ...

So, level 1 appears to be UTF-16 code units, not code points. To 
do code points it would have to recognize surrogates, which are 
specifically mentioned as not supported.

Level 2 skips straight to graphemes, and there is no code point 
level.

However, this document is very old - from Unicode 3.0 and the 
year 2000:

> While there are no surrogate characters in Unicode 3.0 (outside 
> of private use characters), future versions of Unicode will 
> contain them...

Perhaps level 1 has since been redefined?