More Unicode fun

spir denis.spir at gmail.com
Mon Jan 17 05:51:37 PST 2011


On 01/14/2011 11:29 PM, foobar wrote:
> So it's definitly possible in Hebrew to have more than one combining mark on the same base letter. When comparing such letters the order of the combining marks should not matter and I think there's a default normalized order in such cases.

Unicode defines an standard order for combining marks _of different 
kinds_ inside a given "grapheme". "Different kinds" mainly means how 
they are supposed to be placed relative base marks by rendering engines. 
Combining marks of the same kind ordered different are supposed to 
describe a different character: for instance, <e>+<acute accent 
above>+<grave accent above> is not the same character for Unicode as 
<e>+<grave accent above>+<acute accent above> (there is a subtile 
placement difference). But <e>+<acute accent above>+<grave accent below> 
is equal to <e>+<grave accent below>+<acute accent above>: reordereing 
will happen.
This order is not imposed to users or any text-producing software, so 
that an ordering phase is necessary to end any normalisation process 
--at least if the goal is to produce a unique character representation 
allowing direct comparison.

> 2. case depends on locale. In Turkish for instance, they have two 'i' letters, one with a dot and one without. Therefore the Turkish upper case of i is a capital 'i' with a dot, different from English.

Casing issues are very complicated and, as you say, language-specific. 
But not only: in french, for instance, there is no single applied 
uppercasing rule for accented letters (even in official texts or 
newspapers). This is why I consider casing simply doesn't belong to a 
general-purpose text manipulation type. Instead, tools to help and 
define language-, script-, culture- specific casing algorithms (or app- 
or domain- specific ones) should be made available in a Unicode toolkit 
library.
But it's only me.

Denis
_________________
vita es estrany
spir.wikidot.com



More information about the Digitalmars-d mailing list