VLERange: a range in between BidirectionalRange and RandomAccessRange

Michel Fortin michel.fortin at michelf.com
Sat Jan 15 14:45:37 PST 2011


On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer" 
<schveiguy at yahoo.com> said:

> On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin  
> <michel.fortin at michelf.com> wrote:
> 
>> On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"  
>> <schveiguy at yahoo.com> said:
>> 
>>>> I'm not suggesting we impose it, just that we make it the default. If   
>>>> you want to iterate by dchar, wchar, or char, just write:
>>>>  	foreach (dchar c; "exposé") {}
>>>> 	foreach (wchar c; "exposé") {}
>>>> 	foreach (char c; "exposé") {}
>>>> 	// or
>>>> 	foreach (dchar c; "exposé".by!dchar()) {}
>>>> 	foreach (wchar c; "exposé".by!wchar()) {}
>>>> 	foreach (char c; "exposé".by!char()) {}
>>>>  and it'll work. But the default would be a slice containing the   
>>>> grapheme, because this is the right way to represent a Unicode  
>>>> character.
>>>  I think this is a good idea.  I previously was nervous about it, but  
>>> I'm  not sure it makes a huge difference.  Returning a char[] is  
>>> certainly less  work than normalizing a grapheme into one or more code  
>>> points, and then  returning them.  All that it takes is to detect all  
>>> the code points within  the grapheme.  Normalization can be done if  
>>> needed, but would probably  have to output another char[], since a  
>>> normalized grapheme can occupy more  than one dchar.
>> 
>> I'm glad we agree on that now.
> 
> It's a matter of me slowly wrapping my brain around unicode and how 
> it's  used.  It seems like it's a typical committee defined standard 
> where there  are 10 ways to do everything, I was trying to weed out the 
> lesser used (or  so I perceived) pieces to allow a more implementable 
> library.  It's doubly  hard for me since I have limited experience with 
> other languages, and I've  never tried to write them with a computer 
> (my language classes in high  school were back in the days of actually 
> writing stuff down on paper).

Actually, I don't think Unicode was so badly designed. It's just that 
nobody hat an idea of the real scope of the problem they had in hand at 
first, and so they had to add a lot of things but wanted to keep things 
backward-compatible. We're at Unicode 6.0 now, can you name one other 
standard that evolved enough to get 6 major versions? I'm surprised 
it's not worse given all that it must support.

That said, I'm sure if someone could redesign Unicode by breaking 
backward-compatibility we'd have something simpler. You could probably 
get rid of pre-combined characters and reduce the number of 
normalization forms. But would you be able to get rid of normalization 
entirely? I don't think so. Reinventing Unicode is probably not worth 
it.


>> I'm not opposed to that on principle. I'm a little uneasy about having  
>> so many types representing a string however. Some other raw comments:
>> 
>> I agree that things would be more coherent if char[], wchar[], and  
>> dchar[] behaved like other arrays, but I can't really see a  
>> justification for those types to be in the language if there's nothing  
>> special about them (why not a library type?).
> 
> I would not be opposed to getting rid of those types.  But I am very  
> opposed to char[] not being an array.  If you want a string to be  
> something other than an array, make it have a different syntax.  We 
> also  have to consider C compatibility.
> 
> However, we are in radical-change mode then, and this is probably 
> pushed  to D3 ;)  If we can find some way to fix the situation without  
> invalidating TDPL, we should strive for that first IMO.

Indeed, the change would probably be too radical for D2.

I think we agree that the default type should behave as a Unicode 
string, not an array of characters. I understand your opposition to 
conflating arrays of char with strings, and I agree with you to a 
certain extent that it could have been done better. But we can't really 
change the type of string literals, can we. The only thing we can 
change (I hope) at this point is how iterating on strings work.

Walter said earlier that he oppose changing foreach's default element 
type to dchar for char[] and wchar[] (as Andrei did for ranges) on the 
ground that it would silently break D1 compatibility. This is a valid 
point in my opinion.

I think you're right when you say that not treating char[] as an array 
of character breaks, to a certain extent, C compatibility. Another 
valid point.

That said, I want to emphasize that iterating by grapheme, contrary to 
iterating by dchar, does not break any code *silently*. The compiler 
will complain loudly that you're comparing a string to a char, so 
you'll have to change your code somewhere if you want things to 
compile. You'll have to look at the code and decide what to do.

One more thing:

NSString in Cocoa is in essence the same thing as I'm proposing here: 
as array of UTF-16 code units, but with string behaviour. It supports 
by-code-unit indexing, but appending, comparing, searching for 
substrings, etc. all behave correctly as a Unicode string. Again, I 
agree that it's probably not the best design, but I can tell you it 
works well in practice. In fact, NSString doesn't even expose the 
concept of grapheme, it just uses them internally, and you're pretty 
much limited to the built-in operation. I think what we have here in 
concept is much better... even if it somewhat conflates code-unit 
arrays and strings.


>> Or you could make a grapheme a string_t. ;-)
> 
> I'm a little uneasy having a range return itself as its element type.  
> For  all intents and purposes, a grapheme is a string of one 'element', 
> so it  could potentially be a string_t.
> 
> It does seem daunting to have so many types, but at the same time, 
> types  convey relationships at compile time that can make coding 
> impossible to  get wrong, or make things actually possible when having 
> a single type  doesn't.
> 
> I'll give you an example from a previous life:
> 
> [...]
> I feel that making extra types when the relationship between them is  
> important is worth the possible repetition of functionality.  Catching  
> bugs during compilation is soooo much better than experiencing them 
> during  runtime.

I can understand the utility of a separate type in your DateTime 
example, but in this case I fail to see any advantage.

I mean, a grapheme is a slice of a string, can have multiple code 
points (like a string), can be appended the same way as a string, can 
be composed or decomposed using canonical normalization or 
compatibility normalization (like a string), and should be sorted, 
uppercased, and lowercased according to Unicode rules (like a string). 
Basically, a grapheme is just a string that happens to contain only one 
grapheme. What would a custom type do differently than a string?

Also, grapheme == "a" is easy to understand because both are strings. 
But if a grapheme is a separate type, what would a grapheme literal 
look like?

So in the end I don't think a grapheme needs a specific type, at least 
not for general purpose text processing. If I split a string on 
whitespace, do I get a range where elements are of type "word"? No, 
just sliced strings.

That said, I'm much less concerned by the type used to represent a 
grapheme than by the Unicode correctness. I'm not opposed to a separate 
type, I just don't really see the point.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/



More information about the Digitalmars-d mailing list