VLERange: a range in between BidirectionalRange and

Sat Jan 15 14:19:48 PST 2011

Steven Schveighoffer Wrote:

> On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin  
> <michel.fortin at michelf.com> wrote:
> 
> > On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"  
> > <schveiguy at yahoo.com> said:
> >
> >>> I'm not suggesting we impose it, just that we make it the default. If   
> >>> you want to iterate by dchar, wchar, or char, just write:
> >>>  	foreach (dchar c; "exposÃ©") {}
> >>> 	foreach (wchar c; "exposÃ©") {}
> >>> 	foreach (char c; "exposÃ©") {}
> >>> 	// or
> >>> 	foreach (dchar c; "exposÃ©".by!dchar()) {}
> >>> 	foreach (wchar c; "exposÃ©".by!wchar()) {}
> >>> 	foreach (char c; "exposÃ©".by!char()) {}
> >>>  and it'll work. But the default would be a slice containing the   
> >>> grapheme, because this is the right way to represent a Unicode  
> >>> character.
> >>  I think this is a good idea.  I previously was nervous about it, but  
> >> I'm  not sure it makes a huge difference.  Returning a char[] is  
> >> certainly less  work than normalizing a grapheme into one or more code  
> >> points, and then  returning them.  All that it takes is to detect all  
> >> the code points within  the grapheme.  Normalization can be done if  
> >> needed, but would probably  have to output another char[], since a  
> >> normalized grapheme can occupy more  than one dchar.
> >
> > I'm glad we agree on that now.
> 
> It's a matter of me slowly wrapping my brain around unicode and how it's  
> used.  It seems like it's a typical committee defined standard where there  
> are 10 ways to do everything, I was trying to weed out the lesser used (or  
> so I perceived) pieces to allow a more implementable library.  It's doubly  
> hard for me since I have limited experience with other languages, and I've  
> never tried to write them with a computer (my language classes in high  
> school were back in the days of actually writing stuff down on paper).
> 
> I once told a colleague who was on a standards committee that their  
> proposed KLV standard (key length value) was ridiculous.  The wise  
> committee had decided that in order to avoid future issues, the length  
> would be encoded as a single byte if < 128, or 128 + length of the length  
> field for anything higher.  This means you could potentially have to parse  
> and process a 127-byte integer!
> 
> >
> >
> >> What if I modified my proposed string_t type to return T[] as its  
> >> element  type, as you say, and string literals are typed as  
> >> string_t!(whatever)?   In addition, the restrictions I imposed on  
> >> slicing a code point actually  get imposed on slicing a grapheme.  That  
> >> is, it is illegal to substring a  string_t in a way that slices through  
> >> a grapheme (and by deduction, a code  point)?
> >
> > I'm not opposed to that on principle. I'm a little uneasy about having  
> > so many types representing a string however. Some other raw comments:
> >
> > I agree that things would be more coherent if char[], wchar[], and  
> > dchar[] behaved like other arrays, but I can't really see a  
> > justification for those types to be in the language if there's nothing  
> > special about them (why not a library type?).
> 
> I would not be opposed to getting rid of those types.  But I am very  
> opposed to char[] not being an array.  If you want a string to be  
> something other than an array, make it have a different syntax.  We also  
> have to consider C compatibility.
> 
> However, we are in radical-change mode then, and this is probably pushed  
> to D3 ;)  If we can find some way to fix the situation without  
> invalidating TDPL, we should strive for that first IMO.
> 
> > If strings and arrays of code units are distinct, slicing in the middle  
> > of a grapheme or in the middle of a code point could throw an error, but  
> > for performance reasons it should probably check for that only when  
> > array bounds checking is turned on (that would require compiler support  
> > however).
> 
> Not really, it could use assert, but that throws an assert error instead  
> of a RangeError.  Of course, both are errors and will abort the program.   
> I do wish there was a version(noboundscheck) to do this kind of stuff  
> with...
> 
> >> Actually, we would need a grapheme to be its own type, because  
> >> comparing  two char[]'s that don't contain equivalent bits and having  
> >> them be equal,  violates the expectation that char[] is an array.
> >>  So the string_t!char would return a grapheme_t!char (names to be   
> >> discussed) as its element type.
> >
> > Or you could make a grapheme a string_t. ;-)
> 
> I'm a little uneasy having a range return itself as its element type.  For  
> all intents and purposes, a grapheme is a string of one 'element', so it  
> could potentially be a string_t.
> 
> It does seem daunting to have so many types, but at the same time, types  
> convey relationships at compile time that can make coding impossible to  
> get wrong, or make things actually possible when having a single type  
> doesn't.
> 
> I'll give you an example from a previous life:
> 
> Tango had a type called DateTime.  This type represented *either* a point  
> in time, or a span of time (depending on how you used it).  But I proposed  
> we switch to two distinct types, one for a point in time, one for a span  
> of time.  It was argued that both were so similar, why couldn't we just  
> keep one type?  The answer is simple -- having them be separate types  
> allows me to express relationships that the compiler enforces.  For  
> example, you can add two time spans together, but you can't add two points  
> in time together.  Or maybe you want a function to accept a time span  
> (like a sleep operation).  If there was only one type, then  
> sleep(DateTime.now()) compiles and sleeps for what, 2011 years? ;)
> 
> I feel that making extra types when the relationship between them is  
> important is worth the possible repetition of functionality.  Catching  
> bugs during compilation is soooo much better than experiencing them during  
> runtime.
> 
> -Steve

I like Michel's proposed semantics and I also agree with you that it should be a distinct string type and not break consistency of regular arrays. 

Regarding your last point: Do you mean that a grapheme would be a sub-type of string? (a specialization where the string represents a single element)? If so, than it sounds good to me.