Unicode Normalization (and graphemes and locales)

Thu Jun 2 23:24:47 PDT 2016

On Thursday, June 02, 2016 17:14:13 Walter Bright via Digitalmars-d wrote:
> On 6/2/2016 4:29 PM, Jonathan M Davis via Digitalmars-d wrote:
>  > How do you suggest that we handle the normalization issue? Should we just
>  > assume NFC like std.uni.normalize does and provide an optional template
>  > argument to indicate a different normalization (like normalize does)?
>  > Since
>  > without providing a way to deal with the normalization, we're not
>  > actually
>  > making the code fully correct, just faster.
>
> The short answer is, we don't.

I generally agree. The main problem that I was concerned about were the
cases like find where we're talking about encoding the needle to match the
haystack so that we can compare with code units, and I was thinking that
we'd be forced to pick a normalization scheme with that, and if that didn't
match the normalization of the haystack, we'd be in trouble (hence the
concern about being able to specify a normalization scheme). However,
thinking about it further, that's not actually a problem. If the needle is a
dchar, then code point normalization isn't an issue, because it's only ever
one code point, and if the needle uses a different encoding (e.g. UTF-16
instead of UTF-8), and we re-encode it with the encoding of the haystack,
that doesn't change the normalization of the needle. Even if the code units
have changed, the code points that they represent are the same. So, it
doesn't even potentially make sense to try and doing anything with the
normalization when re-encoding the needle.

So, it looks like my concern was born from not thinking the issue through
thoroughly enough.

- Jonathan M Davis