Unicode Normalization (and graphemes and locales)

Steven Schveighoffer via Digitalmars-d digitalmars-d at puremagic.com
Fri Jun 3 05:23:01 PDT 2016


On 6/3/16 8:06 AM, Jonathan M Davis via Digitalmars-d wrote:
> On Friday, June 03, 2016 07:37:59 Steven Schveighoffer via Digitalmars-d wrote:
>> But consider the case where you are searching the string: "cassé"
>>
>> for the letter 'e'. If é is encoded as 'e' + U+0301, then you will
>> succeed when you should fail! However, it may be that you actually want
>> to find specifically any code points with 'e', including ones with
>> combining characters. This is why we really need more discretion from
>> Phobos, and less hand-holding.
>>
>> There are certainly searches that will be correct. For example,
>> searching for newline should always work in code-point space. Actually,
>> what happens when you use a combining character on newline? Is it an
>> invalid unicode sequence? Does it matter? :)
>>
>> A nice function to determine whether code points or graphemes are
>> required for comparison given a needle may be useful.
>
> Well, if you know that you're dealing with a grapheme that has that problem,
> you can just iterate by graphemes rather than code units like find would
> normally.

Yes, I agree. This is exactly the point. Don't assume anything, just 
treat a type as it is written. And tell the user this!

If you are going to search a range of code points with a code point, you 
may not get what you expect. If you want to do a grapheme-aware search, 
change it to a range of graphemes, and do a grapheme search.

What I was trying say with my example is that searching by code points, 
even for graphemes that definitively fit into one code point, may still 
not be correct in all use cases.

-Steve


More information about the Digitalmars-d mailing list