VLERange: a range in between BidirectionalRange and RandomAccessRange

Fri Jan 14 15:21:00 PST 2011

On 2011-01-14 17:04:08 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

> On 1/14/11 7:50 AM, Michel Fortin wrote:
>> On 2011-01-13 23:23:10 -0500, Andrei Alexandrescu
>> <SeeWebsiteForEmail at erdani.org> said:
>> 
>>> On 1/13/11 7:09 PM, Michel Fortin wrote:
>>>> That's forgetting that most of the time people care about graphemes
>>>> (user-perceived characters), not code points.
>>> 
>>> I'm not so sure about that. What do you base this assessment on? Denis
>>> wrote a library that according to him does grapheme-related stuff
>>> nobody else does. So apparently graphemes is not what people care
>>> about (although it might be what they should care about).
>> 
>> Apple implemented all these things in the NSString class in Cocoa. They
>> did all this work on Unicode at the beginning of Mac OS X, at a time
>> where making such changes wouldn't break anything.
>> 
>> It's a hard thing to change later when you have code that depend on the
>> old behaviour. It's a complicated matter and not so many people will
>> understand the issues, so it's no wonder many languages just deal with
>> code points.
> 
> That's a strong indicator, but we shouldn't get ahead of ourselves.
> 
> D took a certain risk by defaulting to Unicode at a time where the 
> dominant extant systems languages left the decision to more or less 
> exotic libraries, Java used UTF16 de jure but UCS2 de facto, and other 
> languages were just starting to adopt Unicode.
> 
> I think that risk was justified because the relative loss in speed was 
> often acceptable and the gains were there. Even so, there are people in 
> this who protest against the loss in efficiency and argue that life is 
> harder for ASCII users.

Then perhaps it's time we find out a way to handle non-Unicode 
encodings too. We can get away treating ASCII strings as Unicode 
strings because of a useful property of UTF-8, but should we really do 
this?

Also, it'd really help this discussion to have some hard numbers about 
the cost of decoding graphemes.

> Switching to variable-length representation of graphemes as bundles of 
> dchars and committing to that through and through will bring with it a 
> larger hit in efficiency and an increased difficulty in usage. I agree 
> that at a level that's the "right" thing to do, but I don't have yet 
> the feeling that combining characters are a widely-adopted winner. For 
> the most part, fonts don't support combining characters, and as a font 
> dilettante I can tell that putting arbitrary sets of diacritics on top 
> of characters is not what one should be doing as it'll look terrible. 
> Unicode is begrudgingly acknowledging combining characters. Only a 
> handful of libraries deal with them. I don't know how many applications 
> need or care for them, versus how many applications do fine with 
> precombined characters. I have trouble getting combining characters to 
> combine on this machine in any of the applications I use - and this is 
> a Mac.

I'm using the character palette: Edit menu > Special Characters... from 
there you can insert arbitrary code points. Use the search function of 
the palette to get code points with "combining" in their names, then 
click the big character box on the lower left to insert them. Have fun!

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/