VLERange: a range in between BidirectionalRange and RandomAccessRange

Steven Schveighoffer schveiguy at yahoo.com
Sat Jan 15 08:59:04 PST 2011


On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin  
<michel.fortin at michelf.com> wrote:

> On 2011-01-14 09:34:55 -0500, "Steven Schveighoffer"  
> <schveiguy at yahoo.com> said:
>
>> On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.spir at gmail.com> wrote:
>>
>>> The point is not playing like that with Unicode flexibility. Rather  
>>> that  composite characters are just normal thingies in most languages  
>>> of the  world. Actually, on this point, english is a rare exception  
>>> (discarding  letters imported from foreign languages like french 'à');  
>>> to the point  of beeing, I guess, the only western language without  
>>> any diacritic.
>>  Is it common to have multiple modifiers on a single character?
>
> Not in my knowledge. But I rarely deal with non-latin texts, there's  
> probably some scripts out there that takes advantage of this.
>
>
>> The  problem I see with using decomposed canonical form for strings is  
>> that we  would have to return a dchar[] for each 'element', which  
>> severely  complicates code that, for instance, only expects to handle  
>> English.
>
> Actually, returning a sliced char[] or wchar[] could also be valid.  
> User-perceived characters are basically a substring of one or more code  
> points. I'm not sure it complicates that much the semantics of the  
> language -- what's complicated about writing str.front == "a" instead of  
> str.front == 'a'? -- although it probably would complicate the generated  
> code and make it a little slower.

Hm... this pushes the normalization outside the type, and into the  
algorithms (such as find).  I was hoping to avoid that.  I think I can  
come up with an algorithm that normalizes into canonical form as it  
iterates.  It just might return part of a grapheme if the grapheme cannot  
be composed.

I do think that we could make a byGrapheme member to aid in this:

foreach(grapheme; s.byGrapheme) // grapheme is a substring that contains  
one composed grapheme.

>
> In the case of NSString in Cocoa, you can only access the 'characters'  
> in their UTF-16 form. But everything from comparison to search for  
> substring is done using graphemes. It's like they implemented  
> specialized Unicode-aware algorithms for these functions. There's no  
> genericness about how it handles graphemes.
>
> I'm not sure yet about what would be the right approach for D.

I hope we can use generic versions, so the type itself handles the  
conversions.  That makes any algorithm using the string range correct.

>> I was hoping to lazily transform a string into its composed canonical   
>> form, allowing the (hopefully rare) exception when a composed character  
>>  does not exist.  My thinking was that this at least gives a useful  
>> string  representation for 90% of usages, leaving the remaining 10% of  
>> usages to  find a more complex representation (like your Text type).   
>> If we only get  like 20% or 30% there by making dchar the element type,  
>> then we haven't  made it useful enough.
>>  Either way, we need a string type that can be compared canonically  
>> for  things like searches or opEquals.
>
> I wonder if normalized string comparison shouldn't be built directly in  
> the char[] wchar[] and dchar[] types instead.

No, in my vision of how strings should be typed, char[] is an array, not a  
string.  It should be treated like an array of code-units, where two forms  
that create the same grapheme are considered different.

> Also bring the idea above that iterating on a string would yield  
> graphemes as char[] and this code would work perfectly irrespective of  
> whether you used combining characters:
>
> 	foreach (grapheme; "exposé") {
> 		if (grapheme == "é")
> 			break;
> 	}
>
> I think a good standard to evaluate our handling of Unicode is to see  
> how easy it is to do things the right way. In the above, foreach would  
> slice the string grapheme by grapheme, and the == operator would perform  
> a normalized comparison. While it works correctly, it's probably not the  
> most efficient way to do thing however.

I think this is a good alternative, but I'd rather not impose this on  
people like myself who deal mostly with English.  I think this should be  
possible to do with wrapper types or intermediate ranges which have  
graphemes as elements (per my suggestion above).

Does this sound reasonable?

-Steve


More information about the Digitalmars-d mailing list