VLERange: a range in between BidirectionalRange and RandomAccessRange

Mon Jan 17 04:33:35 PST 2011

On Sat, 15 Jan 2011 17:45:37 -0500, Michel Fortin  
<michel.fortin at michelf.com> wrote:

> On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer"  
> <schveiguy at yahoo.com> said:
>
>> On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin   
>> <michel.fortin at michelf.com> wrote:
>>
>>> On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"   
>>> <schveiguy at yahoo.com> said:
>>>
>>>>> I'm not suggesting we impose it, just that we make it the default.  
>>>>> If   you want to iterate by dchar, wchar, or char, just write:
>>>>>  	foreach (dchar c; "exposé") {}
>>>>> 	foreach (wchar c; "exposé") {}
>>>>> 	foreach (char c; "exposé") {}
>>>>> 	// or
>>>>> 	foreach (dchar c; "exposé".by!dchar()) {}
>>>>> 	foreach (wchar c; "exposé".by!wchar()) {}
>>>>> 	foreach (char c; "exposé".by!char()) {}
>>>>>  and it'll work. But the default would be a slice containing the    
>>>>> grapheme, because this is the right way to represent a Unicode   
>>>>> character.
>>>>  I think this is a good idea.  I previously was nervous about it,  
>>>> but  I'm  not sure it makes a huge difference.  Returning a char[]  
>>>> is  certainly less  work than normalizing a grapheme into one or more  
>>>> code  points, and then  returning them.  All that it takes is to  
>>>> detect all  the code points within  the grapheme.  Normalization can  
>>>> be done if  needed, but would probably  have to output another  
>>>> char[], since a  normalized grapheme can occupy more  than one dchar.
>>>  I'm glad we agree on that now.
>>  It's a matter of me slowly wrapping my brain around unicode and how  
>> it's  used.  It seems like it's a typical committee defined standard  
>> where there  are 10 ways to do everything, I was trying to weed out the  
>> lesser used (or  so I perceived) pieces to allow a more implementable  
>> library.  It's doubly  hard for me since I have limited experience with  
>> other languages, and I've  never tried to write them with a computer  
>> (my language classes in high  school were back in the days of actually  
>> writing stuff down on paper).
>
> Actually, I don't think Unicode was so badly designed. It's just that  
> nobody hat an idea of the real scope of the problem they had in hand at  
> first, and so they had to add a lot of things but wanted to keep things  
> backward-compatible. We're at Unicode 6.0 now, can you name one other  
> standard that evolved enough to get 6 major versions? I'm surprised it's  
> not worse given all that it must support.

I didn't read the standard, all I understand about unicode is from this NG  
;)  What I meant was the ability to do things more than one way seems like  
a committee-designed standard.  Usually with one of those, you have one  
party who "absolutely needs" one way of doing things (most likely because  
all their code is based on it), and other parties who want it a different  
way.  When compromises occur, the end result is, you have a standard  
that's unnecessarily difficult to implement.

> Indeed, the change would probably be too radical for D2.
>
> I think we agree that the default type should behave as a Unicode  
> string, not an array of characters. I understand your opposition to  
> conflating arrays of char with strings, and I agree with you to a  
> certain extent that it could have been done better. But we can't really  
> change the type of string literals, can we. The only thing we can change  
> (I hope) at this point is how iterating on strings work.

I was hoping to change string literal types.  If we don't do that, we have  
a half-ass solution.  I don't think it's going to be impossible, because  
string, wstring, dstring are all aliases.

In fact, with my current proposed type, this already works:

mystring s = "hello";

But this doesn't:

auto s = "hello"; // still typed as immutable(char)[]

This isn't so bad, just require one to specify the type, right?  Well, it  
fails miserably here:

foo(mystring s) {...}
foo("hello"); // fails to match.

In order to have a string type, string literals have to be typed as that  
type.

> Walter said earlier that he oppose changing foreach's default element  
> type to dchar for char[] and wchar[] (as Andrei did for ranges) on the  
> ground that it would silently break D1 compatibility. This is a valid  
> point in my opinion.
>
> I think you're right when you say that not treating char[] as an array  
> of character breaks, to a certain extent, C compatibility. Another valid  
> point.
>
> That said, I want to emphasize that iterating by grapheme, contrary to  
> iterating by dchar, does not break any code *silently*. The compiler  
> will complain loudly that you're comparing a string to a char, so you'll  
> have to change your code somewhere if you want things to compile. You'll  
> have to look at the code and decide what to do.

Changing iteration and not indexing is not going to fix the mess we have  
right now.

> One more thing:
>
> NSString in Cocoa is in essence the same thing as I'm proposing here: as  
> array of UTF-16 code units, but with string behaviour. It supports  
> by-code-unit indexing, but appending, comparing, searching for  
> substrings, etc. all behave correctly as a Unicode string. Again, I  
> agree that it's probably not the best design, but I can tell you it  
> works well in practice. In fact, NSString doesn't even expose the  
> concept of grapheme, it just uses them internally, and you're pretty  
> much limited to the built-in operation. I think what we have here in  
> concept is much better... even if it somewhat conflates code-unit arrays  
> and strings.

But is NSString typed the *exact same* as an array, or is it a wrapper for  
an array?  Looking at the docs, it appears it is not.

>>> Or you could make a grapheme a string_t. ;-)
>>  I'm a little uneasy having a range return itself as its element type.   
>> For  all intents and purposes, a grapheme is a string of one 'element',  
>> so it  could potentially be a string_t.
>>  It does seem daunting to have so many types, but at the same time,  
>> types  convey relationships at compile time that can make coding  
>> impossible to  get wrong, or make things actually possible when having  
>> a single type  doesn't.
>>  I'll give you an example from a previous life:
>>  [...]
>> I feel that making extra types when the relationship between them is   
>> important is worth the possible repetition of functionality.  Catching   
>> bugs during compilation is soooo much better than experiencing them  
>> during  runtime.
>
> I can understand the utility of a separate type in your DateTime  
> example, but in this case I fail to see any advantage.
>
> I mean, a grapheme is a slice of a string, can have multiple code points  
> (like a string), can be appended the same way as a string, can be  
> composed or decomposed using canonical normalization or compatibility  
> normalization (like a string), and should be sorted, uppercased, and  
> lowercased according to Unicode rules (like a string). Basically, a  
> grapheme is just a string that happens to contain only one grapheme.  
> What would a custom type do differently than a string?

A grapheme type would not be a range, it would be an element of the string  
range.  You could not append to it (otherwise, that makes it into a  
string).

In all other respects, it should act similar to a string (as you say,  
printing, upper-casing, comparison, etc.)

>
> Also, grapheme == "a" is easy to understand because both are strings.  
> But if a grapheme is a separate type, what would a grapheme literal look  
> like?

A grapheme should be comparable to a string literal.  It should be  
assignable to a string literal.  The drawback is we would need a runtime  
check to ensure the string literal was actually one grapheme.  Some  
compiler help in this regard would be useful, but I'm not sure how the  
mechanics would work (you couldn't exactly type a literal differently  
based on its contents).  Another possibility is to come up with a  
different syntax to denote grapheme literals.

> So in the end I don't think a grapheme needs a specific type, at least  
> not for general purpose text processing. If I split a string on  
> whitespace, do I get a range where elements are of type "word"? No, just  
> sliced strings.

It is not clear that using a separate type is the "right answer."  It may  
be that an element of a string should be a string.  This does work in  
other languages that don't have a concept of a character.  An extra type  
however, allows us to have more concrete positions to work with.

> That said, I'm much less concerned by the type used to represent a  
> grapheme than by the Unicode correctness. I'm not opposed to a separate  
> type, I just don't really see the point.

I will try to explain better by making an actual candidate type.

-Steve