VLERange: a range in between BidirectionalRange and RandomAccessRange

Michel Fortin michel.fortin at michelf.com
Sat Jan 15 10:32:10 PST 2011


On 2011-01-15 11:59:04 -0500, "Steven Schveighoffer" 
<schveiguy at yahoo.com> said:

> On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin  
> <michel.fortin at michelf.com> wrote:
> 
>> Actually, returning a sliced char[] or wchar[] could also be valid.  
>> User-perceived characters are basically a substring of one or more code 
>>  points. I'm not sure it complicates that much the semantics of the  
>> language -- what's complicated about writing str.front == "a" instead 
>> of  str.front == 'a'? -- although it probably would complicate the 
>> generated  code and make it a little slower.
> 
> Hm... this pushes the normalization outside the type, and into the  
> algorithms (such as find).
> 
> I was hoping to avoid that.

Not really. It pushes the normalization to the string comparison 
operator, as explained later.


> I think I can  come up with an algorithm that normalizes into canonical 
> form as it  iterates.  It just might return part of a grapheme if the 
> grapheme cannot  be composed.

The problem with normalization while iterating is that you lose 
information about what the actual code points part of the grapheme. If 
you wanted to count the number of grapheme with a particular code point 
you're lost that information.

Moreover, if all you want is to count the number of grapheme, 
normalizing the character is a waste of time.

I suggested in another post that we implement ranges for decomposing 
and recomposing on-the-fly a string in its normalized form. That's 
basically the same thing as you suggest, but it'd have to be explicit 
to avoid the problem above.


>> I wonder if normalized string comparison shouldn't be built directly in 
>>  the char[] wchar[] and dchar[] types instead.
> 
> No, in my vision of how strings should be typed, char[] is an array, 
> not a  string.  It should be treated like an array of code-units, where 
> two forms  that create the same grapheme are considered different.

Well, I agree there's a need for that sometime. But if what you want is 
just a dumb array of code units, why not use ubyte[], ushort[] and 
uint[] instead?

It seems to me that the whole point of having a different type for 
char[], wchar[], and dchar[] is that you know they are Unicode strings 
and can treat them as such. And if you treat them as Unicode strings, 
then perhaps the runtime and the compiler should too, for consistency's 
sake.


>> Also bring the idea above that iterating on a string would yield  
>> graphemes as char[] and this code would work perfectly irrespective of  
>> whether you used combining characters:
>> 
>> 	foreach (grapheme; "exposé") {
>> 		if (grapheme == "é")
>> 			break;
>> 	}
>> 
>> I think a good standard to evaluate our handling of Unicode is to see  
>> how easy it is to do things the right way. In the above, foreach would  
>> slice the string grapheme by grapheme, and the == operator would 
>> perform  a normalized comparison. While it works correctly, it's 
>> probably not the  most efficient way to do thing however.
> 
> I think this is a good alternative, but I'd rather not impose this on  
> people like myself who deal mostly with English.

I'm not suggesting we impose it, just that we make it the default. If 
you want to iterate by dchar, wchar, or char, just write:

	foreach (dchar c; "exposé") {}
	foreach (wchar c; "exposé") {}
	foreach (char c; "exposé") {}
	// or
	foreach (dchar c; "exposé".by!dchar()) {}
	foreach (wchar c; "exposé".by!wchar()) {}
	foreach (char c; "exposé".by!char()) {}

and it'll work. But the default would be a slice containing the 
grapheme, because this is the right way to represent a Unicode 
character.


> I think this should be  possible to do with wrapper types or 
> intermediate ranges which have  graphemes as elements (per my 
> suggestion above).

I think it should be the reverse. If you want your code to break when 
it encounters multi-code-point graphemes then it's your choice, but you 
should have to make your choice explicit. The default should be to 
handle strings correctly.


-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/



More information about the Digitalmars-d mailing list