Converting a character to upper case in string

Patrick Schluter Patrick.Schluter at bbox.fr
Sat Sep 22 21:04:54 UTC 2018


On Saturday, 22 September 2018 at 06:01:20 UTC, Vladimir 
Panteleev wrote:
> On Friday, 21 September 2018 at 12:15:52 UTC, NX wrote:
>> How can I properly convert a character, say, first one to 
>> upper case in a unicode correct manner?
>
> That would depend on how you'd define correctness. If your 
> application needs to support "all" languages, then (depending 
> how you interpret it) the task may not be meaningful, as some 
> languages don't have the notion of "upper-case" or even 
> "character" (as an individual glyph). Some languages do have 
> those notions, but they serve a specific purpose that doesn't 
> align with the one in English (e.g. Lojban).

There are other traps in the question of uppercase/lowercase 
which makes is indeed very difficult to handle correctly if we 
don't define what correctly means.
Examples:
- It may be necessary to know the locale, i.e. the language of 
the string to uppercase. In Turkish uppercase of i is not I but İ 
and lowercase of I is ı (that was a reason for the calamitous low 
performance of toUpper/toLower in Java for example.
- Some uppercases depend on what they are used for. German ß 
shouldbe uppercased as SS (note also btw that 1 codepoint becomes 
2 in uppercase) in normal text, but for calligraphic work, road 
signs and other usages it can be capital ẞ.
- Greek has 2 lowercase forms for Σ but two lowercase forms σ and 
ς depending on the word position.
- While it becomes less and less relevant Serbo-croatian may use 
digraphs when transcoding the script from Cyrillic (Serbian) to 
Latin (Croatian), these digraphs have 2 uppercase forms 
(title-case and all capital):
   - dž -> DŽ or Dž
   - lj -> LJ or Lj
   - NJ -> Nj or nj
Normalization would normally take care of that case.
- Some languages may modify or remove diacritical signs when 
uppercasing. It is quite usual in French to not put accents on 
capitals.

It is also clear that the operation of uppercasing is not 
symetric with lowercasing.

>
>> In which code level I should be working on? Grapheme? Or maybe 
>> code point is sufficient?
>
> Using graphemes is necessary if you need to support e.g. 
> combining marks (e.g. ̏◌ + S = ̏S).




More information about the Digitalmars-d-learn mailing list