VLERange: a range in between BidirectionalRange and RandomAccessRange

Sat Jan 15 20:18:48 PST 2011

On Saturday 15 January 2011 19:25:47 Jonathan M Davis wrote:
> On Saturday 15 January 2011 15:59:27 Andrei Alexandrescu wrote:
> > On 1/15/11 4:45 PM, Michel Fortin wrote:
> > > On 2011-01-15 16:29:47 -0500, "Steven Schveighoffer"
> > > 
> > > <schveiguy at yahoo.com> said:
> > >> On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin
> > >> 
> > >> <michel.fortin at michelf.com> wrote:
> > >>> On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"
> > >>> 
> > >>> <schveiguy at yahoo.com> said:
> > >>>>> I'm not suggesting we impose it, just that we make it the default.
> > >>>>> If you want to iterate by dchar, wchar, or char, just write:
> > >>>>> foreach (dchar c; "exposé") {}
> > >>>>> foreach (wchar c; "exposé") {}
> > >>>>> foreach (char c; "exposé") {}
> > >>>>> // or
> > >>>>> foreach (dchar c; "exposé".by!dchar()) {}
> > >>>>> foreach (wchar c; "exposé".by!wchar()) {}
> > >>>>> foreach (char c; "exposé".by!char()) {}
> > >>>>> and it'll work. But the default would be a slice containing the
> > >>>>> grapheme, because this is the right way to represent a Unicode
> > >>>>> character.
> > >>>> 
> > >>>> I think this is a good idea. I previously was nervous about it, but
> > >>>> I'm not sure it makes a huge difference. Returning a char[] is
> > >>>> certainly less work than normalizing a grapheme into one or more
> > >>>> code points, and then returning them. All that it takes is to detect
> > >>>> all the code points within the grapheme. Normalization can be done
> > >>>> if needed, but would probably have to output another char[], since a
> > >>>> normalized grapheme can occupy more than one dchar.
> > >>> 
> > >>> I'm glad we agree on that now.
> > >> 
> > >> It's a matter of me slowly wrapping my brain around unicode and how
> > >> it's used. It seems like it's a typical committee defined standard
> > >> where there are 10 ways to do everything, I was trying to weed out the
> > >> lesser used (or so I perceived) pieces to allow a more implementable
> > >> library. It's doubly hard for me since I have limited experience with
> > >> other languages, and I've never tried to write them with a computer
> > >> (my language classes in high school were back in the days of actually
> > >> writing stuff down on paper).
> > > 
> > > Actually, I don't think Unicode was so badly designed. It's just that
> > > nobody hat an idea of the real scope of the problem they had in hand at
> > > first, and so they had to add a lot of things but wanted to keep things
> > > backward-compatible. We're at Unicode 6.0 now, can you name one other
> > > standard that evolved enough to get 6 major versions? I'm surprised
> > > it's not worse given all that it must support.
> > > 
> > > That said, I'm sure if someone could redesign Unicode by breaking
> > > backward-compatibility we'd have something simpler. You could probably
> > > get rid of pre-combined characters and reduce the number of
> > > normalization forms. But would you be able to get rid of normalization
> > > entirely? I don't think so. Reinventing Unicode is probably not worth
> > > it.
> > > 
> > >>> I'm not opposed to that on principle. I'm a little uneasy about
> > >>> having so many types representing a string however. Some other raw
> > >>> comments:
> > >>> 
> > >>> I agree that things would be more coherent if char[], wchar[], and
> > >>> dchar[] behaved like other arrays, but I can't really see a
> > >>> justification for those types to be in the language if there's
> > >>> nothing special about them (why not a library type?).
> > >> 
> > >> I would not be opposed to getting rid of those types. But I am very
> > >> opposed to char[] not being an array. If you want a string to be
> > >> something other than an array, make it have a different syntax. We
> > >> also have to consider C compatibility.
> > >> 
> > >> However, we are in radical-change mode then, and this is probably
> > >> pushed to D3 ;) If we can find some way to fix the situation without
> > >> invalidating TDPL, we should strive for that first IMO.
> > > 
> > > Indeed, the change would probably be too radical for D2.
> > > 
> > > I think we agree that the default type should behave as a Unicode
> > > string, not an array of characters. I understand your opposition to
> > > conflating arrays of char with strings, and I agree with you to a
> > > certain extent that it could have been done better. But we can't really
> > > change the type of string literals, can we. The only thing we can
> > > change (I hope) at this point is how iterating on strings work.
> > > 
> > > Walter said earlier that he oppose changing foreach's default element
> > > type to dchar for char[] and wchar[] (as Andrei did for ranges) on the
> > > ground that it would silently break D1 compatibility. This is a valid
> > > point in my opinion.
> > > 
> > > I think you're right when you say that not treating char[] as an array
> > > of character breaks, to a certain extent, C compatibility. Another
> > > valid point.
> > > 
> > > That said, I want to emphasize that iterating by grapheme, contrary to
> > > iterating by dchar, does not break any code *silently*. The compiler
> > > will complain loudly that you're comparing a string to a char, so
> > > you'll have to change your code somewhere if you want things to
> > > compile. You'll have to look at the code and decide what to do.
> > > 
> > > One more thing:
> > > 
> > > NSString in Cocoa is in essence the same thing as I'm proposing here:
> > > as array of UTF-16 code units, but with string behaviour. It supports
> > > by-code-unit indexing, but appending, comparing, searching for
> > > substrings, etc. all behave correctly as a Unicode string. Again, I
> > > agree that it's probably not the best design, but I can tell you it
> > > works well in practice. In fact, NSString doesn't even expose the
> > > concept of grapheme, it just uses them internally, and you're pretty
> > > much limited to the built-in operation. I think what we have here in
> > > concept is much better... even if it somewhat conflates code-unit
> > > arrays and strings.
> > 
> > I'm unclear on where this is converging to. At this point the commitment
> > of the language and its standard library to (a) UTF aray representation
> > and (b) code points conceptualization is quite strong. Changing that
> > would be quite difficult and disruptive, and the benefits are virtually
> > nonexistent for most of D's user base.
> > 
> > It may be more realistic to consider using what we have as back-end for
> > grapheme-oriented processing. For example:
> > 
> > struct Grapheme(Char) if (isSomeChar!Char)
> > {
> > 
> >      private const Char[] rep;
> >      ...
> > 
> > }
> > 
> > auto byGrapheme(S)(S s) if (isSomeString!S)
> > {
> > 
> >     ...
> > 
> > }
> > 
> > string s = "Hello";
> > foreach (g; byGrapheme(s)
> > {
> > 
> >      ...
> > 
> > }
> 
> Considering that strings are already dealt with specially in order to have
> an element of dchar, I wouldn't think that it would be all that
> distruptive to make it so that they had an element type of Grapheme
> instead. Wouldn't that then fix all of std.algorithm and the like without
> really disrupting anything?
> 
> The issue of foreach remains, but without being willing to change what
> foreach defaults to, you can't really fix it - though I'd suggest that we
> at least make it a warning to iterate over strings without specifying the
> type. And if foreach were made to understand Grapheme like it understands
> dchar, then you could do
> 
> foreach(Grapheme g; str) { ... }
> 
> and have the compiler warn about
> 
> foreach(g; str) { ... }
> 
> and tell you to use Grapheme if you want to be comparing actual characters.
> Regardless, by making strings ranges of Grapheme rather than dchar, I would
> think that we would solve most of the problem. At minimum, we'd have pretty
> much the same problems that we have right now with char and wchar arrays,
> but we'd get rid of a whole class of unicode problems. So, nothing would
> be worse, but some of it would be better.

I suppose that the one major omission though is that string comparisons would be 
by code unit, not graphemes, which would be a problem. == could be made to use 
graphemes instead, but then you couldn't compare them by code units or code 
points unless you cast to ubyte[], ushort[], or uint[]... It would still 
probably be worth making == use graphemes though.

- Jonathan M Davis