VLERange: a range in between BidirectionalRange and RandomAccessRange

Steven Schveighoffer schveiguy at yahoo.com
Sat Jan 15 12:20:08 PST 2011


On Sat, 15 Jan 2011 13:32:10 -0500, Michel Fortin  
<michel.fortin at michelf.com> wrote:

> On 2011-01-15 11:59:04 -0500, "Steven Schveighoffer"  
> <schveiguy at yahoo.com> said:
>
>> On Fri, 14 Jan 2011 12:01:42 -0500, Michel Fortin   
>> <michel.fortin at michelf.com> wrote:
>>
>>> Actually, returning a sliced char[] or wchar[] could also be valid.   
>>> User-perceived characters are basically a substring of one or more  
>>> code  points. I'm not sure it complicates that much the semantics of  
>>> the  language -- what's complicated about writing str.front == "a"  
>>> instead of  str.front == 'a'? -- although it probably would complicate  
>>> the generated  code and make it a little slower.
>>  Hm... this pushes the normalization outside the type, and into the   
>> algorithms (such as find).
>>  I was hoping to avoid that.
>
> Not really. It pushes the normalization to the string comparison  
> operator, as explained later.
>
>
>> I think I can  come up with an algorithm that normalizes into canonical  
>> form as it  iterates.  It just might return part of a grapheme if the  
>> grapheme cannot  be composed.
>
> The problem with normalization while iterating is that you lose  
> information about what the actual code points part of the grapheme. If  
> you wanted to count the number of grapheme with a particular code point  
> you're lost that information.

Are these common requirements?  I thought users mostly care about  
graphemes, not code points.  Asking in the dark here, since I have next to  
zero experience with unicode strings.

>
> Moreover, if all you want is to count the number of grapheme,  
> normalizing the character is a waste of time.

This is true.  I can see this being a common need.

>
> I suggested in another post that we implement ranges for decomposing and  
> recomposing on-the-fly a string in its normalized form. That's basically  
> the same thing as you suggest, but it'd have to be explicit to avoid the  
> problem above.

OK, I see your point.

>
>
>>> I wonder if normalized string comparison shouldn't be built directly  
>>> in  the char[] wchar[] and dchar[] types instead.
>>  No, in my vision of how strings should be typed, char[] is an array,  
>> not a  string.  It should be treated like an array of code-units, where  
>> two forms  that create the same grapheme are considered different.
>
> Well, I agree there's a need for that sometime. But if what you want is  
> just a dumb array of code units, why not use ubyte[], ushort[] and  
> uint[] instead?

Because ubyte[] ushort[] and uint[] do not say that their data is unicode  
text.  The point is, I want to write a function that takes utf-8, ubyte[]  
opens it up to any data, not just UTF-8 data.  But if we have a method of  
iterating code-units as you specify below, then I think we are OK.

> It seems to me that the whole point of having a different type for  
> char[], wchar[], and dchar[] is that you know they are Unicode strings  
> and can treat them as such. And if you treat them as Unicode strings,  
> then perhaps the runtime and the compiler should too, for consistency's  
> sake.

I'd agree with you, but then there's that pesky [] after it indicating  
it's an array.  For consistency's sake, I'd say the compiler should treat  
T[] as an array of T's.

>>> Also bring the idea above that iterating on a string would yield   
>>> graphemes as char[] and this code would work perfectly irrespective  
>>> of  whether you used combining characters:
>>>  	foreach (grapheme; "exposé") {
>>> 		if (grapheme == "é")
>>> 			break;
>>> 	}
>>>  I think a good standard to evaluate our handling of Unicode is to  
>>> see  how easy it is to do things the right way. In the above, foreach  
>>> would  slice the string grapheme by grapheme, and the == operator  
>>> would perform  a normalized comparison. While it works correctly, it's  
>>> probably not the  most efficient way to do thing however.
>>  I think this is a good alternative, but I'd rather not impose this on   
>> people like myself who deal mostly with English.
>
> I'm not suggesting we impose it, just that we make it the default. If  
> you want to iterate by dchar, wchar, or char, just write:
>
> 	foreach (dchar c; "exposé") {}
> 	foreach (wchar c; "exposé") {}
> 	foreach (char c; "exposé") {}
> 	// or
> 	foreach (dchar c; "exposé".by!dchar()) {}
> 	foreach (wchar c; "exposé".by!wchar()) {}
> 	foreach (char c; "exposé".by!char()) {}
>
> and it'll work. But the default would be a slice containing the  
> grapheme, because this is the right way to represent a Unicode character.

I think this is a good idea.  I previously was nervous about it, but I'm  
not sure it makes a huge difference.  Returning a char[] is certainly less  
work than normalizing a grapheme into one or more code points, and then  
returning them.  All that it takes is to detect all the code points within  
the grapheme.  Normalization can be done if needed, but would probably  
have to output another char[], since a normalized grapheme can occupy more  
than one dchar.

What if I modified my proposed string_t type to return T[] as its element  
type, as you say, and string literals are typed as string_t!(whatever)?   
In addition, the restrictions I imposed on slicing a code point actually  
get imposed on slicing a grapheme.  That is, it is illegal to substring a  
string_t in a way that slices through a grapheme (and by deduction, a code  
point)?

Actually, we would need a grapheme to be its own type, because comparing  
two char[]'s that don't contain equivalent bits and having them be equal,  
violates the expectation that char[] is an array.

So the string_t!char would return a grapheme_t!char (names to be  
discussed) as its element type.

>
>
>> I think this should be  possible to do with wrapper types or  
>> intermediate ranges which have  graphemes as elements (per my  
>> suggestion above).
>
> I think it should be the reverse. If you want your code to break when it  
> encounters multi-code-point graphemes then it's your choice, but you  
> should have to make your choice explicit. The default should be to  
> handle strings correctly.

You are probably right.

-Steve


More information about the Digitalmars-d mailing list