VLERange: a range in between BidirectionalRange and RandomAccessRange

spir denis.spir at gmail.com
Fri Jan 14 05:14:02 PST 2011


On 01/14/2011 05:23 AM, Andrei Alexandrescu wrote:

>> That's forgetting that most of the time people care about graphemes
>> (user-perceived characters), not code points.
>
> I'm not so sure about that. What do you base this assessment on? Denis
> wrote a library that according to him does grapheme-related stuff nobody
> else does. So apparently graphemes is not what people care about
> (although it might be what they should care about).

I'm aware of that, and I have no definitive answer to the question. The 
issue *does* exist --as shown even by trivial examples such as Michel's 
below, not corner cases. The actual question is _not_ whether code or 
"grapheme" is the proper level of abstraction. To this, the answer is 
clear: codes are simply meaningless in 99% cases. (All historic software 
deal with chars, conceptually, but they happen too be coded with single 
codes.)
(And what about Objective-C? Why did its designers even bother with that?).

The question is rather: why do we nearly all happily go on ignoring the 
issue? My present guess is a combination of factors:

* The issue is masked by the misleading use of "abstract character" in 
unicode literature. "Abstract" is very correct, but they should have 
found another term as "character", say "abstract scripting mark". Their 
deceiving terminological choice lets most programmers believe that 
codepoints code characters, like in historic charsets.
(Even worse: some doc explicitely states that ICU's notion of character 
matches the programming notion of character.)
* ICU added precomposed codes for a bunch of characters, supposedly for 
backward compatility with said charsets. (But where is the gain? We need 
to decode them anyway...) The consequence is, at the pedagogical level, 
very bad: most text-producing software (like editors) use such 
precomposed codes when available for a given character. So that 
programmers can happily go on believing in the code=character myth. 
(Note: the gain in space is ridiculous for western text.)
* Most characters that appear in western texts (at least "official" 
characters of natural languages) have precomposed forms.
* Programmers can very easily be unaware their code is incorrect: how do 
you even notice it in test output?

Thus, practically, programmers can (1) simply don't know the issue (2) 
have code that really works in typical use cases for their software (3) 
do not notice their code runs incorrectly.
There is also an intermediate situation between (2) & (3), similar to 
old problems with previous ASCII-only apps: they work wrongly when used 
in a non-english environment, but what can users do, concretely? Most 
often, they just have to cope with incorrectness, reinterpret outputs 
differently, and/or find workarounds by cheating with the interface.

The responsability of designers of tools for programmers is, imo, 
important. We should make the issue clear, first (very difficult, it's 
an ubiquitous myth to break down), and propose services that run 
correctly in situations where said issue is relevant, here manipulation 
of universal text, even if not very efficient at start.
On my side, and about D, I wish that most D programmers (1) are aware of 
the problem (2) understand its why's & how's (3) know there is a correct 
solution. Then, (4) use it actually is their choice (and I don't care 
whether or not they do).

>>>> It also supports this:
>>>>
>>>> foreach(i, d; s)
>>>> {
>>>> writeln("The character in position ", i, " is ", d);
>>>> }
>>>>
>>>> where i is the index (might not be sequential)
>>>
>>> Well string supports that too, albeit with the nit that you need to
>>> specify dchar.
>>
>> Except it breaks with combining characters. For instance, take the
>> string "t̃", which is two code points -- 't' followed by combining tilde
>> (U+0303) -- and you'll get the following output:
>>
>> The character in position 0 is t
>> The character in position 1 is ̃
>>
>> (Note that the tilde becomes combined with the preceding space
>> character.)
>>
>> The conception of character that normal people have does not match the
>> notion of code points when combining characters enters the equation.
>
> This might be a good time to see whether we need to address graphemes
> systematically. Could you please post a few links that would educate me
> and others in the mysteries of combining characters?

Beware! far too long text. 
https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/U%20missing%20level%20of%20abstraction
(the directory above contains the current rough implementation of Text, 
plus a bit of its brother package DUnicode)

> Thanks,
>
> Andrei

Denis
_________________
vita es estrany
spir.wikidot.com



More information about the Digitalmars-d mailing list