Replacing tango.text.Ascii.isearch
Siarhei Siamashka
siarhei.siamashka at gmail.com
Fri Oct 28 22:05:05 UTC 2022
On Wednesday, 26 October 2022 at 06:05:14 UTC, Ali Çehreli wrote:
> The problem with Unicode is its main aim of allowing characters
> of multiple writing systems in the same text. When multiple
> writing systems are in play, conflicts and ambiguities will
> appear.
I personally don't think that it's the problem of the Unicode
itself. Based on what I can see, it looks like the individuals or
the committees responsible for mapping the Turkish alphabet to
Unicode just made a blunder.
For example, let's compare the Latin uppercase "B" and the
Cyrillic uppercase "В". Looks exactly the same, right? Would it
be a smart idea for them to share the same index in the Unicode
table? But wait. What happens if we convert these letters to
lowercase? The Latin "B" becomes "b" and the Cyrillic "В" becomes
"в". Oops! So by having different indexes for the Latin uppercase
"B" and the Cyrillic uppercase "В", we dodged a whole bunch of
nasty problems.
Another example. Patrick Schluter mentioned the Greek sigma
letter and the [wikipedia
article](https://en.wikipedia.org/wiki/Sigma) says: "uppercase Σ,
lowercase σ, lowercase in word-final position ς", which makes
everything rather problematic. Now let's compare this to the
Belarusian language and its letter "у". The Belarusian "у"
transforms into "ў" depending on context, however this
transformation doesn't happen for the first letter of proper
nouns or in acronyms (and this theoretically makes the uppercase
"ў" redundant). Just imagine an alternative Greek-inspired
reality, where both "у" and "ў" uppercase to "У". And yet the
uppercase "Ў" exists in Unicode, so luckily in our reality we
don't have to deal with uppercase/lowercase round trip failures.
This is computers friendly. And as I already mentioned in an
earlier comment, the Germans also got the uppercase "ẞ" in
Unicode since 2008 (better late than never).
> I solved my problem by writing an Alphabet hierarchy in the
> past. I don't like that code but it still works:
>
> [...]
>
> It's confusing but it seems to work. :) It doesn't matter. Life
> is imperfect and things will somehow work in the end.
What's your opinion/conclusion? Is it fine the way it is? Do you
think that some unique property of the Turkish language/alphabet
made these difficulties unavoidable? Or do you think that it was
a mistake, but now it has to live with us forever for
compatibility reasons? Anything else?
And as for the D language and Phobos, should "ß" still uppercase
to "SS"? Or can we change it to uppercase "ẞ" and remove German
from the list of tricky languages at
https://dlang.org/library/std/uni/to_upper.html ? Should Turkish
be listed there?
More information about the Digitalmars-d-learn
mailing list