VLERange: a range in between BidirectionalRange and RandomAccessRange

Fri Jan 14 09:01:42 PST 2011

On 2011-01-14 09:34:55 -0500, "Steven Schveighoffer" 
<schveiguy at yahoo.com> said:

> On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.spir at gmail.com> wrote:
> 
>> The point is not playing like that with Unicode flexibility. Rather 
>> that  composite characters are just normal thingies in most languages 
>> of the  world. Actually, on this point, english is a rare exception 
>> (discarding  letters imported from foreign languages like french 'à'); 
>> to the point  of beeing, I guess, the only western language without any 
>> diacritic.
> 
> Is it common to have multiple modifiers on a single character?

Not in my knowledge. But I rarely deal with non-latin texts, there's 
probably some scripts out there that takes advantage of this.

> The  problem I see with using decomposed canonical form for strings is 
> that we  would have to return a dchar[] for each 'element', which 
> severely  complicates code that, for instance, only expects to handle 
> English.

Actually, returning a sliced char[] or wchar[] could also be valid. 
User-perceived characters are basically a substring of one or more code 
points. I'm not sure it complicates that much the semantics of the 
language -- what's complicated about writing str.front == "a" instead 
of str.front == 'a'? -- although it probably would complicate the 
generated code and make it a little slower.

In the case of NSString in Cocoa, you can only access the 'characters' 
in their UTF-16 form. But everything from comparison to search for 
substring is done using graphemes. It's like they implemented 
specialized Unicode-aware algorithms for these functions. There's no 
genericness about how it handles graphemes.

I'm not sure yet about what would be the right approach for D.

> I was hoping to lazily transform a string into its composed canonical  
> form, allowing the (hopefully rare) exception when a composed character 
>  does not exist.  My thinking was that this at least gives a useful 
> string  representation for 90% of usages, leaving the remaining 10% of 
> usages to  find a more complex representation (like your Text type).  
> If we only get  like 20% or 30% there by making dchar the element type, 
> then we haven't  made it useful enough.
> 
> Either way, we need a string type that can be compared canonically for  
> things like searches or opEquals.

I wonder if normalized string comparison shouldn't be built directly in 
the char[] wchar[] and dchar[] types instead. Also bring the idea above 
that iterating on a string would yield graphemes as char[] and this 
code would work perfectly irrespective of whether you used combining 
characters:

	foreach (grapheme; "exposé") {
		if (grapheme == "é")
			break;
	}

I think a good standard to evaluate our handling of Unicode is to see 
how easy it is to do things the right way. In the above, foreach would 
slice the string grapheme by grapheme, and the == operator would 
perform a normalized comparison. While it works correctly, it's 
probably not the most efficient way to do thing however.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/