[Issue 12455] [uni][reg] Bad lowercase mapping for 'LATIN CAPITAL LETTER I WITH DOT ABOVE'

Fri Jul 4 13:20:20 PDT 2014

https://issues.dlang.org/show_bug.cgi?id=12455

Dmitry Olshansky <dmitry.olsh at gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dmitry.olsh at gmail.com

--- Comment #3 from Dmitry Olshansky <dmitry.olsh at gmail.com> ---
(In reply to monarchdodra from comment #2)
> I toyed around. The issue (apparently) is that it *can* be converted as:
> 

Indeed the key problem is that simple case mapping and full case mapping do
differ for this character. Turns out there also about 13 characters with
similar problem, but much less frequently used.

Secondly Turkish language further makes it confusing by making both mapping
work as simple case (dropping the extra combining dot).

And last but not least somebody introduced this bit of Turk tailoring into
original std.uni, probably Ali :)

> 
> Because uni "thinks" the lowercase doesn't fit in a single dchar, it simply
> does nothing (as documeted).
> 
> However, it's still wrong, as the standard (from what I read), is pretty
> clear on the fact that the lower case is simply 'i'.

In fact it's 2 codepoints. See SpecialCasing.txt file, even though it "looks"
like it's one character in the web cldr utility.

Here the first line:

0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE

Then at the end of file:

# Turkish and Azeri

# I and i-dotless; I-dot and i are case pairs in Turkish and Azeri
# The following rules handle those cases.

0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE

# When lowercasing, remove dot_above in the sequence I + dot_above, which will
turn into i.
# This matches the behavior of the canonically equivalent I-dot_above

0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE
0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE

# When lowercasing, unless an I is before a dot_above, it turns into a dotless
i.

0049; 0131; 0049; 0049; tr Not_Before_Dot; # LATIN CAPITAL LETTER I
0049; 0131; 0049; 0049; az Not_Before_Dot; # LATIN CAPITAL LETTER I

# When uppercasing, i turns into a dotted capital I

0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I
0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I

> 
> Furthermore, "LATIN SMALL LETTER I + COMBINING DOT ABOVE" is pretty
> redundant...

Can't say much on this but it's also the result of NFD normalization.

The course of action is clear - got to make it map to 'i' for toLower with
dchar, and keep the current mapping in the string version. 

Then when processing Turk text dot after I may be removed as a separate step.

--