VLERange: a range in between BidirectionalRange and RandomAccessRange

Wed Jan 19 01:15:50 PST 2011

On 01/18/2011 06:11 AM, Ali Çehreli wrote:
> Thanks to all that has contributed, I am also following this thread with
> great interest. :)
>
> Michel Fortin wrote:
>  > I mean, a grapheme is a slice of a string, can have multiple code points
>  > (like a string), can be appended the same way as a string, can be
>  > composed or decomposed using canonical normalization or compatibility
>  > normalization (like a string), and should be sorted, uppercased, and
>  > lowercased according to Unicode rules (like a string). Basically, a
>  > grapheme is just a string that happens to contain only one grapheme.
>
> I would like to stress the fact that Unicode knows nothing about
> sorting, uppercasing, or lowercasing.
>
> Those operations are tied to the alphabet (or writing system) that a
> certain grapheme happens to belong to at a given time. For example, we
> cannot uppercase the letter i without knowing what alphabet we are
> dealing with. Two possibilities: I and İ (I dot above).
>
> It is the same issue with sorting.

This is true and false ;-)

You are right, indeed, on the fact that issues like sorting one are 
language-specific, and more, use-case-specific. The case of the turkish 
beeing a good example. For another one, in french I do not even know 
whether there is an official rule! Anyway, whatever the answer, even eg 
famous newpapers, and official documents, used different rules. Most of 
them let down accents on uppercase, possibly because of computer 
limitation; there is a recent move (back) toward accented uppercase.
This is very annoying: "Hélène" has 2 consistent and used uppercase 
versions. Conversely, how is software supposed to guess the lowercase 
version of "HELENE"?

Upon Unicode, it still defines norms for casing and so-called collation 
(compare, for sorting) algorithms. Dunno much more, i have never applied 
them, personly, for reasons like the ones above. The full list of it's 
technical docs can be found at http://unicode.org/reports/. See in 
particular http://unicode.org/reports/tr10/ for collation. 
(Unfortnately, case mapping is know part of the core standard doc, so 
that it's hard to get it.)

Denis
_________________
vita es estrany
spir.wikidot.com