Unicode Normalization (and graphemes and locales)

Fri Jun 3 09:56:04 PDT 2016

On Fri, Jun 03, 2016 at 05:06:33AM -0700, Jonathan M Davis via Digitalmars-d wrote:
> On Friday, June 03, 2016 07:37:59 Steven Schveighoffer via Digitalmars-d wrote:
> > But consider the case where you are searching the string: "cassé"
> >
> > for the letter 'e'. If é is encoded as 'e' + U+0301, then you will
> > succeed when you should fail! However, it may be that you actually
> > want to find specifically any code points with 'e', including ones
> > with combining characters. This is why we really need more
> > discretion from Phobos, and less hand-holding.
> >
> > There are certainly searches that will be correct. For example,
> > searching for newline should always work in code-point space.
> > Actually, what happens when you use a combining character on
> > newline? Is it an invalid unicode sequence? Does it matter? :)

I'm guessing it's an invalid sequence.

[...]
> Well, if you know that you're dealing with a grapheme that has that
> problem, you can just iterate by graphemes rather than code units like
> find would normally. Otherwise, what you probably end up doing is
> searching for the needle and then verifying that the resultant range
> starts with the right grapheme and not just the right code point and
> then call find again to search further into the range if it was just
> the right code point.
[...]

And this is a prime illustration of why defaulting to a particular
support level is not a good idea.  What if the programmer wants to count
how many variations of e + diacritics are in his string? Then iterating
by grapheme won't work, and you'd actually want to iterate by code
point.  Actually, that wouldn't work either; you'd have to normalize to
NFD first, then iterate by code point.  Whereas if the programmer wanted
to count e but not é, then you'd have to iterate by grapheme. Or if you
wanted to count é but not e, then you'd have to normalize to NFC and
then iterate by grapheme.

There's no getting around learning how Unicode works, and having the
standard library default to something arbitrary that doesn't always do
the "right" thing while pretending it does, doesn't help.

T

-- 
Why have vacation when you can work?? -- EC