Unicode Normalization (and graphemes and locales)

Steven Schveighoffer via Digitalmars-d digitalmars-d at puremagic.com
Fri Jun 3 04:37:59 PDT 2016


On 6/3/16 2:24 AM, Jonathan M Davis via Digitalmars-d wrote:
> On Thursday, June 02, 2016 17:14:13 Walter Bright via Digitalmars-d wrote:
>> On 6/2/2016 4:29 PM, Jonathan M Davis via Digitalmars-d wrote:
>>  > How do you suggest that we handle the normalization issue? Should we just
>>  > assume NFC like std.uni.normalize does and provide an optional template
>>  > argument to indicate a different normalization (like normalize does)?
>>  > Since
>>  > without providing a way to deal with the normalization, we're not
>>  > actually
>>  > making the code fully correct, just faster.
>>
>> The short answer is, we don't.
>
> I generally agree. The main problem that I was concerned about were the
> cases like find where we're talking about encoding the needle to match the
> haystack so that we can compare with code units, and I was thinking that
> we'd be forced to pick a normalization scheme with that, and if that didn't
> match the normalization of the haystack, we'd be in trouble (hence the
> concern about being able to specify a normalization scheme). However,
> thinking about it further, that's not actually a problem. If the needle is a
> dchar, then code point normalization isn't an issue, because it's only ever
> one code point, and if the needle uses a different encoding (e.g. UTF-16
> instead of UTF-8), and we re-encode it with the encoding of the haystack,
> that doesn't change the normalization of the needle. Even if the code units
> have changed, the code points that they represent are the same. So, it
> doesn't even potentially make sense to try and doing anything with the
> normalization when re-encoding the needle.

But consider the case where you are searching the string: "cassé"

for the letter 'e'. If é is encoded as 'e' + U+0301, then you will 
succeed when you should fail! However, it may be that you actually want 
to find specifically any code points with 'e', including ones with 
combining characters. This is why we really need more discretion from 
Phobos, and less hand-holding.

There are certainly searches that will be correct. For example, 
searching for newline should always work in code-point space. Actually, 
what happens when you use a combining character on newline? Is it an 
invalid unicode sequence? Does it matter? :)

A nice function to determine whether code points or graphemes are 
required for comparison given a needle may be useful.

-Steve


More information about the Digitalmars-d mailing list