VLERange: a range in between BidirectionalRange and RandomAccessRange

Sat Jan 15 17:49:00 PST 2011

On Saturday 15 January 2011 04:24:33 Michel Fortin wrote:
> On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn
> 
> <lutger.blijdestijn at gmail.com> said:
> > Nick Sabalausky wrote:
> >> "Andrei Alexandrescu" <SeeWebsiteForEmail at erdani.org> wrote in message
> >> news:ignon1$2p4k$1 at digitalmars.com...
> >> 
> >>> This may sometimes not be what the user expected; most of the time
> >>> they'd care about the code points.
> >> 
> >> I dunno, spir has succesfuly convinced me that most of the time it's
> >> graphemes the user cares about, not code points. Using code points is
> >> just as misleading as using UTF-16 code units.
> > 
> > I agree. This is a very informative thread, thanks spir and everybody
> > else.
> > 
> > Going back to the topic, it seems to me that a unicode string is a
> > surprisingly complicated data structure that can be viewed from multiple
> > types of ranges. In the light of this thread, a dchar doesn't seem like
> > such a useful type anymore, it is still a low level abstraction for the
> > purpose of correctly dealing with text. Perhaps even less useful, since
> > it gives the illusion of correctness for those who are not in the know.
> > 
> > The algorithms in std.string can be upgraded to work correctly with all
> > the issues mentioned, but the generic ones in std.algorithm will just
> > subtly do the wrong thing when presented with dchar ranges. And, as I
> > understood it, the purpose of a VleRange was exactly to make generic
> > algorithms just work (tm) for strings.
> > 
> > Is it still possible to solve this problem or are we stuck with
> > specialized string algorithms? Would it work if VleRange of string was a
> > bidirectional range with string slices of graphemes as the ElementType
> > and indexing with code units? Often used string algorithms could be
> > specialized for performance, but if not, generic algorithms would still
> > work.
> 
> I have my idea.
> 
> I think it'd be a good idea is to improve upon Andrei's first idea --
> which was to treat char[], wchar[], and dchar[] all as ranges of dchar
> elements -- by changing the element type to be the same as the string.
> For instance, iterating on a char[] would give you slices of char[],
> each having one grapheme.
> 
> The second component would be to make the string equality operator (==)
> for strings compare them in their normalized form, so that ("e" with
> combining acute accent) == (pre-combined "é"). I think this would make
> D support for Unicode much more intuitive.
> 
> This implies some semantic changes, mainly that everywhere you write a
> "character" you must use double-quotes (string "a") instead of single
> quote (code point 'a'), but from the user's point of view that's pretty
> much all there is to change.
> 
> There'll still be plenty of room for specialized algorithms, but their
> purpose would be limited to optimization. Correctness would be taken
> care of by the basic range interface, and foreach should follow suit
> and iterate by grapheme by default.
> 
> I wrote this example (or something similar) earlier in this thread:
> 
> 	foreach (grapheme; "exposé")
> 		if (grapheme == "é")
> 			break;
> 
> In this example, even if one of these two strings use the pre-combined
> form of "é" and the other uses a combining acute accent, the equality
> would still hold since foreach iterates on full graphemes and ==
> compares using normalization.
> 
> The important thing to keep in mind here is that the grapheme-splitting
> algorithm should be optimized for the case where there is no combining
> character and the compare algorithm for the case where the string is
> already normalized, since most strings will exhibit these
> characteristics.
> 
> As for ASCII, we could make it easier to use ubyte[] for it by making
> string literals implicitly convert to ubyte[] if all their characters
> are in ASCII range.

I think that that would cause definite problems. Having the element type of the 
range be the same type as the range seems like it could cause a lot of problems 
in std.algorithm and the like, and it's _definitely_ going to confuse 
programmers. I'd expect it to be highly bug-prone. They _need_ to be separate 
types.

Now, given that dchar can't actually work completely as an element type, you'd 
either need the string type to be a new type or the element type to be a new 
type. So, either the string type has char[], wchar[], or dchar[] for its element 
type, or char[], wchar[], and dchar[] have something like uchar as their element 
type, where uchar is a struct which contains a char[], wchar[], or dchar[] which 
holds a single grapheme.

I think that it's a great idea that programmers try to use substrings and slices 
rather than dchar, but making the element type a slice the original type sounds 
like it's really asking for trouble.

- Jonathan M Davis