VLERange: a range in between BidirectionalRange and RandomAccessRange

Sat Jan 15 04:24:33 PST 2011

On 2011-01-15 05:03:20 -0500, Lutger Blijdestijn 
<lutger.blijdestijn at gmail.com> said:

> Nick Sabalausky wrote:
> 
>> "Andrei Alexandrescu" <SeeWebsiteForEmail at erdani.org> wrote in message
>> news:ignon1$2p4k$1 at digitalmars.com...
>>> 
>>> This may sometimes not be what the user expected; most of the time they'd
>>> care about the code points.
>>> 
>> 
>> I dunno, spir has succesfuly convinced me that most of the time it's
>> graphemes the user cares about, not code points. Using code points is just
>> as misleading as using UTF-16 code units.
> 
> I agree. This is a very informative thread, thanks spir and everybody else.
> 
> Going back to the topic, it seems to me that a unicode string is a
> surprisingly complicated data structure that can be viewed from multiple
> types of ranges. In the light of this thread, a dchar doesn't seem like such
> a useful type anymore, it is still a low level abstraction for the purpose
> of correctly dealing with text. Perhaps even less useful, since it gives the
> illusion of correctness for those who are not in the know.
> 
> The algorithms in std.string can be upgraded to work correctly with all the
> issues mentioned, but the generic ones in std.algorithm will just subtly do
> the wrong thing when presented with dchar ranges. And, as I understood it,
> the purpose of a VleRange was exactly to make generic algorithms just work
> (tm) for strings.
> 
> Is it still possible to solve this problem or are we stuck with specialized
> string algorithms? Would it work if VleRange of string was a bidirectional
> range with string slices of graphemes as the ElementType and indexing with
> code units? Often used string algorithms could be specialized for
> performance, but if not, generic algorithms would still work.

I have my idea.

I think it'd be a good idea is to improve upon Andrei's first idea -- 
which was to treat char[], wchar[], and dchar[] all as ranges of dchar 
elements -- by changing the element type to be the same as the string. 
For instance, iterating on a char[] would give you slices of char[], 
each having one grapheme.

The second component would be to make the string equality operator (==) 
for strings compare them in their normalized form, so that ("e" with 
combining acute accent) == (pre-combined "é"). I think this would make 
D support for Unicode much more intuitive.

This implies some semantic changes, mainly that everywhere you write a 
"character" you must use double-quotes (string "a") instead of single 
quote (code point 'a'), but from the user's point of view that's pretty 
much all there is to change.

There'll still be plenty of room for specialized algorithms, but their 
purpose would be limited to optimization. Correctness would be taken 
care of by the basic range interface, and foreach should follow suit 
and iterate by grapheme by default.

I wrote this example (or something similar) earlier in this thread:

	foreach (grapheme; "exposé")
		if (grapheme == "é")
			break;

In this example, even if one of these two strings use the pre-combined 
form of "é" and the other uses a combining acute accent, the equality 
would still hold since foreach iterates on full graphemes and == 
compares using normalization.

The important thing to keep in mind here is that the grapheme-splitting 
algorithm should be optimized for the case where there is no combining 
character and the compare algorithm for the case where the string is 
already normalized, since most strings will exhibit these 
characteristics.

As for ASCII, we could make it easier to use ubyte[] for it by making 
string literals implicitly convert to ubyte[] if all their characters 
are in ASCII range.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/