The Case Against Autodecode

Timon Gehr via Digitalmars-d digitalmars-d at puremagic.com
Tue May 31 12:20:19 PDT 2016


On 31.05.2016 20:53, Jonathan M Davis via Digitalmars-d wrote:
> On Tuesday, May 31, 2016 14:30:08 Andrei Alexandrescu via Digitalmars-d wrote:
>> >On 5/31/16 2:11 PM, Jonathan M Davis via Digitalmars-d wrote:
>>> > >On Tuesday, May 31, 2016 13:21:57 Andrei Alexandrescu via Digitalmars-d
> wrote:
>>>> > >>On 05/31/2016 01:15 PM, Jonathan M Davis via Digitalmars-d wrote:
>>>>> > >>>Saying that operating at the code point level - UTF-32 - is correct
>>>>> > >>>is like saying that operating at UTF-16 instead of UTF-8 is correct.
>>>> > >>
>>>> > >>Could you please substantiate that? My understanding is that code unit
>>>> > >>is a higher-level Unicode notion independent of encoding, whereas code
>>>> > >>point is an encoding-dependent representation detail. -- Andrei
>>> > >
>> >Does walkLength yield the same number for all representations?
> walkLength treats a code point like it's a character. My point is that
> that's incorrect behavior. It will not result in correct string processing
> in the general case, because a code point is not guaranteed to be a
> full character.
> ...

What's "correct"? Maybe the user intended to count the number of code 
points in order to pre-allocate a dchar[] of the correct size.

Generally, I don't see how algorithms become magically "incorrect" when 
applied to utf code units.

> walkLength does not report the length of a character as one in all cases
> just like length does not report the length of a character as one in all
> cases. walkLength is counting bigger units than length, but it's still
> counting pieces of a character rather than counting full characters.
>

The 'length' of a character is not one in all contexts.
The following text takes six columns in my terminal:

日本語
123456


More information about the Digitalmars-d mailing list