VLERange: a range in between BidirectionalRange and RandomAccessRange

Steven Schveighoffer schveiguy at yahoo.com
Fri Jan 14 05:37:56 PST 2011


On Fri, 14 Jan 2011 08:14:02 -0500, spir <denis.spir at gmail.com> wrote:

> On 01/14/2011 05:23 AM, Andrei Alexandrescu wrote:
>
>>> That's forgetting that most of the time people care about graphemes
>>> (user-perceived characters), not code points.
>>
>> I'm not so sure about that. What do you base this assessment on? Denis
>> wrote a library that according to him does grapheme-related stuff nobody
>> else does. So apparently graphemes is not what people care about
>> (although it might be what they should care about).
>
> I'm aware of that, and I have no definitive answer to the question. The  
> issue *does* exist --as shown even by trivial examples such as Michel's  
> below, not corner cases. The actual question is _not_ whether code or  
> "grapheme" is the proper level of abstraction. To this, the answer is  
> clear: codes are simply meaningless in 99% cases. (All historic software  
> deal with chars, conceptually, but they happen too be coded with single  
> codes.)
> (And what about Objective-C? Why did its designers even bother with  
> that?).
>
> The question is rather: why do we nearly all happily go on ignoring the  
> issue? My present guess is a combination of factors:
>
> * The issue is masked by the misleading use of "abstract character" in  
> unicode literature. "Abstract" is very correct, but they should have  
> found another term as "character", say "abstract scripting mark". Their  
> deceiving terminological choice lets most programmers believe that  
> codepoints code characters, like in historic charsets.
> (Even worse: some doc explicitely states that ICU's notion of character  
> matches the programming notion of character.)
> * ICU added precomposed codes for a bunch of characters, supposedly for  
> backward compatility with said charsets. (But where is the gain? We need  
> to decode them anyway...) The consequence is, at the pedagogical level,  
> very bad: most text-producing software (like editors) use such  
> precomposed codes when available for a given character. So that  
> programmers can happily go on believing in the code=character myth.  
> (Note: the gain in space is ridiculous for western text.)
> * Most characters that appear in western texts (at least "official"  
> characters of natural languages) have precomposed forms.
> * Programmers can very easily be unaware their code is incorrect: how do  
> you even notice it in test output?

* I don't even know how to make a grapheme that is more than one  
code-unit, let alone more than one code-point :)  Every time I try, I get  
'invalid utf sequence'.

I feel significantly ignorant on this issue, and I'm slowly getting enough  
knowledge to join the discussion, but being a dumb American who only  
speaks English, I have a hard time grasping how this shit all works.

-Steve


More information about the Digitalmars-d mailing list