VLERange: a range in between BidirectionalRange and RandomAccessRange

Sat Jan 15 19:25:47 PST 2011

On Saturday 15 January 2011 15:59:27 Andrei Alexandrescu wrote:
> On 1/15/11 4:45 PM, Michel Fortin wrote:
> > On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer"
> > 
> > <schveiguy at yahoo.com> said:
> >> On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin
> >> 
> >> <michel.fortin at michelf.com> wrote:
> >>> On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"
> >>> 
> >>> <schveiguy at yahoo.com> said:
> >>>>> I'm not suggesting we impose it, just that we make it the default.
> >>>>> If you want to iterate by dchar, wchar, or char, just write:
> >>>>> foreach (dchar c; "exposé") {}
> >>>>> foreach (wchar c; "exposé") {}
> >>>>> foreach (char c; "exposé") {}
> >>>>> // or
> >>>>> foreach (dchar c; "exposé".by!dchar()) {}
> >>>>> foreach (wchar c; "exposé".by!wchar()) {}
> >>>>> foreach (char c; "exposé".by!char()) {}
> >>>>> and it'll work. But the default would be a slice containing the
> >>>>> grapheme, because this is the right way to represent a Unicode
> >>>>> character.
> >>>> 
> >>>> I think this is a good idea. I previously was nervous about it, but
> >>>> I'm not sure it makes a huge difference. Returning a char[] is
> >>>> certainly less work than normalizing a grapheme into one or more
> >>>> code points, and then returning them. All that it takes is to detect
> >>>> all the code points within the grapheme. Normalization can be done
> >>>> if needed, but would probably have to output another char[], since a
> >>>> normalized grapheme can occupy more than one dchar.
> >>> 
> >>> I'm glad we agree on that now.
> >> 
> >> It's a matter of me slowly wrapping my brain around unicode and how
> >> it's used. It seems like it's a typical committee defined standard
> >> where there are 10 ways to do everything, I was trying to weed out the
> >> lesser used (or so I perceived) pieces to allow a more implementable
> >> library. It's doubly hard for me since I have limited experience with
> >> other languages, and I've never tried to write them with a computer
> >> (my language classes in high school were back in the days of actually
> >> writing stuff down on paper).
> > 
> > Actually, I don't think Unicode was so badly designed. It's just that
> > nobody hat an idea of the real scope of the problem they had in hand at
> > first, and so they had to add a lot of things but wanted to keep things
> > backward-compatible. We're at Unicode 6.0 now, can you name one other
> > standard that evolved enough to get 6 major versions? I'm surprised it's
> > not worse given all that it must support.
> > 
> > That said, I'm sure if someone could redesign Unicode by breaking
> > backward-compatibility we'd have something simpler. You could probably
> > get rid of pre-combined characters and reduce the number of
> > normalization forms. But would you be able to get rid of normalization
> > entirely? I don't think so. Reinventing Unicode is probably not worth it.
> > 
> >>> I'm not opposed to that on principle. I'm a little uneasy about
> >>> having so many types representing a string however. Some other raw
> >>> comments:
> >>> 
> >>> I agree that things would be more coherent if char[], wchar[], and
> >>> dchar[] behaved like other arrays, but I can't really see a
> >>> justification for those types to be in the language if there's
> >>> nothing special about them (why not a library type?).
> >> 
> >> I would not be opposed to getting rid of those types. But I am very
> >> opposed to char[] not being an array. If you want a string to be
> >> something other than an array, make it have a different syntax. We
> >> also have to consider C compatibility.
> >> 
> >> However, we are in radical-change mode then, and this is probably
> >> pushed to D3 ;) If we can find some way to fix the situation without
> >> invalidating TDPL, we should strive for that first IMO.
> > 
> > Indeed, the change would probably be too radical for D2.
> > 
> > I think we agree that the default type should behave as a Unicode
> > string, not an array of characters. I understand your opposition to
> > conflating arrays of char with strings, and I agree with you to a
> > certain extent that it could have been done better. But we can't really
> > change the type of string literals, can we. The only thing we can change
> > (I hope) at this point is how iterating on strings work.
> > 
> > Walter said earlier that he oppose changing foreach's default element
> > type to dchar for char[] and wchar[] (as Andrei did for ranges) on the
> > ground that it would silently break D1 compatibility. This is a valid
> > point in my opinion.
> > 
> > I think you're right when you say that not treating char[] as an array
> > of character breaks, to a certain extent, C compatibility. Another valid
> > point.
> > 
> > That said, I want to emphasize that iterating by grapheme, contrary to
> > iterating by dchar, does not break any code *silently*. The compiler
> > will complain loudly that you're comparing a string to a char, so you'll
> > have to change your code somewhere if you want things to compile. You'll
> > have to look at the code and decide what to do.
> > 
> > One more thing:
> > 
> > NSString in Cocoa is in essence the same thing as I'm proposing here: as
> > array of UTF-16 code units, but with string behaviour. It supports
> > by-code-unit indexing, but appending, comparing, searching for
> > substrings, etc. all behave correctly as a Unicode string. Again, I
> > agree that it's probably not the best design, but I can tell you it
> > works well in practice. In fact, NSString doesn't even expose the
> > concept of grapheme, it just uses them internally, and you're pretty
> > much limited to the built-in operation. I think what we have here in
> > concept is much better... even if it somewhat conflates code-unit arrays
> > and strings.
> 
> I'm unclear on where this is converging to. At this point the commitment
> of the language and its standard library to (a) UTF aray representation
> and (b) code points conceptualization is quite strong. Changing that
> would be quite difficult and disruptive, and the benefits are virtually
> nonexistent for most of D's user base.
> 
> It may be more realistic to consider using what we have as back-end for
> grapheme-oriented processing. For example:
> 
> struct Grapheme(Char) if (isSomeChar!Char)
> {
>      private const Char[] rep;
>      ...
> }
> 
> auto byGrapheme(S)(S s) if (isSomeString!S)
> {
>     ...
> }
> 
> string s = "Hello";
> foreach (g; byGrapheme(s)
> {
>      ...
> }

Considering that strings are already dealt with specially in order to have an 
element of dchar, I wouldn't think that it would be all that distruptive to make 
it so that they had an element type of Grapheme instead. Wouldn't that then fix 
all of std.algorithm and the like without really disrupting anything?

The issue of foreach remains, but without being willing to change what foreach 
defaults to, you can't really fix it - though I'd suggest that we at least make 
it a warning to iterate over strings without specifying the type. And if foreach 
were made to understand Grapheme like it understands dchar, then you could do

foreach(Grapheme g; str) { ... }

and have the compiler warn about

foreach(g; str) { ... }

and tell you to use Grapheme if you want to be comparing actual characters. 
Regardless, by making strings ranges of Grapheme rather than dchar, I would 
think that we would solve most of the problem. At minimum, we'd have pretty much 
the same problems that we have right now with char and wchar arrays, but we'd 
get rid of a whole class of unicode problems. So, nothing would be worse, but 
some of it would be better.

- Jonathan M Davis