VLERange: a range in between BidirectionalRange and RandomAccessRange

Sat Jan 15 21:24:33 PST 2011

On 2011-01-15 23:58:30 -0500, Jonathan M Davis <jmdavisProg at gmx.com> said:

> On Saturday 15 January 2011 20:45:53 Michel Fortin wrote:
>> On 2011-01-15 20:49:00 -0500, Jonathan M Davis <jmdavisProg at gmx.com> said:
>>> On Saturday 15 January 2011 04:24:33 Michel Fortin wrote:
>>>> I have my idea.
>>>> 
>>>> I think it'd be a good idea is to improve upon Andrei's first idea --
>>>> which was to treat char[], wchar[], and dchar[] all as ranges of dchar
>>>> elements -- by changing the element type to be the same as the string.
>>>> For instance, iterating on a char[] would give you slices of char[],
>>>> each having one grapheme.
>>>> 
>>>> The second component would be to make the string equality operator (
>>> 
>>> =)
>>> 
>>>> for strings compare them in their normalized form, so that ("e" with
>>>> combining acute accent) == (pre-combined "é"). I think this woul
> d m
>>> 
>>> ake
>>> 
>>>> D support for Unicode much more intuitive.
>>>> 
>>>> This implies some semantic changes, mainly that everywhere you write a
>>>> "character" you must use double-quotes (string "a") instead of single
>>>> quote (code point 'a'), but from the user's point of view that's pretty
>>>> much all there is to change.
>>>> 
>>>> There'll still be plenty of room for specialized algorithms, but their
>>>> purpose would be limited to optimization. Correctness would be taken
>>>> care of by the basic range interface, and foreach should follow suit
>>>> and iterate by grapheme by default.
>>>> 
>>>> I wrote this example (or something similar) earlier in this thread:
>>>> 	foreach (grapheme; "exposé")
>>>> 	
>>>> 		if (grapheme == "é")
>>>> 		
>>>> 			break;
>>>> 
>>>> In this example, even if one of these two strings use the pre-combined
>>>> form of "é" and the other uses a combining acute accent, the equality
>>>> would still hold since foreach iterates on full graphemes and
>>>> compares using normalization.
>>> 
>>> I think that that would cause definite problems. Having the element
>>> type of the range be the same type as the range seems like it could
>>> cause a lot of problems in std.algorithm and the like, and it's
>>> _definitely_ going to confuse programmers. I'd expect it to be highly
>>> bug-prone. They _need_ to be separate types.
>> 
>> I remember that someone already complained about this issue because he
>> had a tree of ranges, and Andrei said he would take a look at this
>> problem eventually. Perhaps now would be a good time.
>> 
>>> Now, given that dchar can't actually work completely as an element
>>> type, you'd either need the string type to be a new type or the element
>>> type to be a new type. So, either the string type has char[], wchar[],
>>> or dchar[] for its element type, or char[], wchar[], and dchar[] have
>>> something like uchar as their element type, where uchar is a struct
>>> which contains a char[], wchar[], or dchar[]
>>> which holds a single grapheme.
>> 
>> Having a new type for grapheme would work too. My preference still goes
>> to reusing the string type because it makes the semantic simpler to
>> understand, especially when comparing graphemes with literals.
> 
> If a character literal actually became a grapheme instead of a dchar, then
> that would likely solve that issue. But I fear that the semantics of 
> having a range
> be its own element type actually make understanding it _harder_, not simpler.
> Being forced to compare a string literals against what should be a 
> character would definitely confuse programmers.

Character literals are treated as simple numbers by the language. By 
that I mean that you can write 'b' - 'a' == 1 and it'll be true. 
Arithmetic makes absolutely no sense for graphemes. If you want a 
special literal for graphemes, I'm afraid you'll have to invent 
something new. And at this point, why not use a string?

> Making a new character or grapheme type which represented a grapheme 
> would be _far_ simpler to understand IMO. However, making it work 
> really well would likely require that the compiler know about the 
> grapheme type like it knows about dchar.

I'm looking for a simple solution. One that doesn't involve inventing a 
new grapheme literal syntax or adding new types the compiler most know 
about. I'm not really opposed to any of this, but the more complicated 
is the solution, the less likely it is to be adopted.

All I'm asking is that Unicode strings behave as Unicode strings should 
behave. Making iteration use graphemes by default and string comparison 
use the normalized form by default seems like a simple way to achieve 
that goal.

The most important is not the implementation, but that the default 
behaviour be the right behaviour.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/