VLERange: a range in between BidirectionalRange and RandomAccessRange

Sat Jan 15 20:58:30 PST 2011

On Saturday 15 January 2011 20:45:53 Michel Fortin wrote:
> On 2011-01-15 20:49:00 -0500, Jonathan M Davis <jmdavisProg at gmx.com> said:
> > On Saturday 15 January 2011 04:24:33 Michel Fortin wrote:
> >> I have my idea.
> >> 
> >> I think it'd be a good idea is to improve upon Andrei's first idea --
> >> which was to treat char[], wchar[], and dchar[] all as ranges of dchar
> >> elements -- by changing the element type to be the same as the string.
> >> For instance, iterating on a char[] would give you slices of char[],
> >> each having one grapheme.
> >> 
> >> The second component would be to make the string equality operator (=
> > 
> > =)
> > 
> >> for strings compare them in their normalized form, so that ("e" with
> >> combining acute accent) == (pre-combined "é"). I think this would m
> > 
> > ake
> > 
> >> D support for Unicode much more intuitive.
> >> 
> >> This implies some semantic changes, mainly that everywhere you write a
> >> "character" you must use double-quotes (string "a") instead of single
> >> quote (code point 'a'), but from the user's point of view that's pretty
> >> much all there is to change.
> >> 
> >> There'll still be plenty of room for specialized algorithms, but their
> >> purpose would be limited to optimization. Correctness would be taken
> >> care of by the basic range interface, and foreach should follow suit
> >> and iterate by grapheme by default.
> >> 
> >> I wrote this example (or something similar) earlier in this thread:
> >> 	foreach (grapheme; "exposé")
> >> 	
> >> 		if (grapheme == "é")
> >> 		
> >> 			break;
> >> 
> >> In this example, even if one of these two strings use the pre-combined
> >> form of "é" and the other uses a combining acute accent, the equality
> >> would still hold since foreach iterates on full graphemes and =
> >> compares using normalization.
> > 
> > I think that that would cause definite problems. Having the element
> > type of the range be the same type as the range seems like it could
> > cause a lot of problems in std.algorithm and the like, and it's
> > _definitely_ going to confuse programmers. I'd expect it to be highly
> > bug-prone. They _need_ to be separate types.
> 
> I remember that someone already complained about this issue because he
> had a tree of ranges, and Andrei said he would take a look at this
> problem eventually. Perhaps now would be a good time.
> 
> > Now, given that dchar can't actually work completely as an element
> > type, you'd either need the string type to be a new type or the element
> > type to be a new type. So, either the string type has char[], wchar[],
> > or dchar[] for its element type, or char[], wchar[], and dchar[] have
> > something like uchar as their element type, where uchar is a struct
> > which contains a char[], wchar[], or dchar[]
> > which holds a single grapheme.
> 
> Having a new type for grapheme would work too. My preference still goes
> to reusing the string type because it makes the semantic simpler to
> understand, especially when comparing graphemes with literals.

If a character literal actually became a grapheme instead of a dchar, then that 
would likely solve that issue. But I fear that the semantics of having a range 
be its own element type actually make understanding it _harder_, not simpler. 
Being forced to compare a string literals against what should be a character 
would definitely confuse programmers. Making a new character or grapheme type 
which represented a grapheme would be _far_ simpler to understand IMO. However, 
making it work really well would likely require that the compiler know about the 
grapheme type like it knows about dchar.

- Jonathan M Davis