Replacing tango.text.Ascii.isearch

Siarhei Siamashka siarhei.siamashka at gmail.com
Fri Oct 28 22:05:05 UTC 2022


On Wednesday, 26 October 2022 at 06:05:14 UTC, Ali Çehreli wrote:
> The problem with Unicode is its main aim of allowing characters 
> of multiple writing systems in the same text. When multiple 
> writing systems are in play, conflicts and ambiguities will 
> appear.

I personally don't think that it's the problem of the Unicode 
itself. Based on what I can see, it looks like the individuals or 
the committees responsible for mapping the Turkish alphabet to 
Unicode just made a blunder.

For example, let's compare the Latin uppercase "B" and the 
Cyrillic uppercase "В". Looks exactly the same, right? Would it 
be a smart idea for them to share the same index in the Unicode 
table? But wait. What happens if we convert these letters to 
lowercase? The Latin "B" becomes "b" and the Cyrillic "В" becomes 
"в". Oops! So by having different indexes for the Latin uppercase 
"B" and the Cyrillic uppercase "В", we dodged a whole bunch of 
nasty problems.

Another example. Patrick Schluter mentioned the Greek sigma 
letter and the [wikipedia 
article](https://en.wikipedia.org/wiki/Sigma) says: "uppercase Σ, 
lowercase σ, lowercase in word-final position ς", which makes 
everything rather problematic. Now let's compare this to the 
Belarusian language and its letter "у". The Belarusian "у" 
transforms into "ў" depending on context, however this 
transformation doesn't happen for the first letter of proper 
nouns or in acronyms (and this theoretically makes the uppercase 
"ў" redundant). Just imagine an alternative Greek-inspired 
reality, where both "у" and "ў" uppercase to "У". And yet the 
uppercase "Ў" exists in Unicode, so luckily in our reality we 
don't have to deal with uppercase/lowercase round trip failures. 
This is computers friendly. And as I already mentioned in an 
earlier comment, the Germans also got the uppercase "ẞ" in 
Unicode since 2008 (better late than never).

> I solved my problem by writing an Alphabet hierarchy in the 
> past. I don't like that code but it still works:
>
> [...]
>
> It's confusing but it seems to work. :) It doesn't matter. Life 
> is imperfect and things will somehow work in the end.

What's your opinion/conclusion? Is it fine the way it is? Do you 
think that some unique property of the Turkish language/alphabet 
made these difficulties unavoidable? Or do you think that it was 
a mistake, but now it has to live with us forever for 
compatibility reasons? Anything else?

And as for the D language and Phobos, should "ß" still uppercase 
to "SS"? Or can we change it to uppercase "ẞ" and remove German 
from the list of tricky languages at 
https://dlang.org/library/std/uni/to_upper.html ? Should Turkish 
be listed there?


More information about the Digitalmars-d-learn mailing list