VLERange: a range in between BidirectionalRange and RandomAccessRange

Steven Schveighoffer schveiguy at yahoo.com
Sat Jan 15 13:29:47 PST 2011


On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin  
<michel.fortin at michelf.com> wrote:

> On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer"  
> <schveiguy at yahoo.com> said:
>
>>> I'm not suggesting we impose it, just that we make it the default. If   
>>> you want to iterate by dchar, wchar, or char, just write:
>>>  	foreach (dchar c; "exposé") {}
>>> 	foreach (wchar c; "exposé") {}
>>> 	foreach (char c; "exposé") {}
>>> 	// or
>>> 	foreach (dchar c; "exposé".by!dchar()) {}
>>> 	foreach (wchar c; "exposé".by!wchar()) {}
>>> 	foreach (char c; "exposé".by!char()) {}
>>>  and it'll work. But the default would be a slice containing the   
>>> grapheme, because this is the right way to represent a Unicode  
>>> character.
>>  I think this is a good idea.  I previously was nervous about it, but  
>> I'm  not sure it makes a huge difference.  Returning a char[] is  
>> certainly less  work than normalizing a grapheme into one or more code  
>> points, and then  returning them.  All that it takes is to detect all  
>> the code points within  the grapheme.  Normalization can be done if  
>> needed, but would probably  have to output another char[], since a  
>> normalized grapheme can occupy more  than one dchar.
>
> I'm glad we agree on that now.

It's a matter of me slowly wrapping my brain around unicode and how it's  
used.  It seems like it's a typical committee defined standard where there  
are 10 ways to do everything, I was trying to weed out the lesser used (or  
so I perceived) pieces to allow a more implementable library.  It's doubly  
hard for me since I have limited experience with other languages, and I've  
never tried to write them with a computer (my language classes in high  
school were back in the days of actually writing stuff down on paper).

I once told a colleague who was on a standards committee that their  
proposed KLV standard (key length value) was ridiculous.  The wise  
committee had decided that in order to avoid future issues, the length  
would be encoded as a single byte if < 128, or 128 + length of the length  
field for anything higher.  This means you could potentially have to parse  
and process a 127-byte integer!

>
>
>> What if I modified my proposed string_t type to return T[] as its  
>> element  type, as you say, and string literals are typed as  
>> string_t!(whatever)?   In addition, the restrictions I imposed on  
>> slicing a code point actually  get imposed on slicing a grapheme.  That  
>> is, it is illegal to substring a  string_t in a way that slices through  
>> a grapheme (and by deduction, a code  point)?
>
> I'm not opposed to that on principle. I'm a little uneasy about having  
> so many types representing a string however. Some other raw comments:
>
> I agree that things would be more coherent if char[], wchar[], and  
> dchar[] behaved like other arrays, but I can't really see a  
> justification for those types to be in the language if there's nothing  
> special about them (why not a library type?).

I would not be opposed to getting rid of those types.  But I am very  
opposed to char[] not being an array.  If you want a string to be  
something other than an array, make it have a different syntax.  We also  
have to consider C compatibility.

However, we are in radical-change mode then, and this is probably pushed  
to D3 ;)  If we can find some way to fix the situation without  
invalidating TDPL, we should strive for that first IMO.

> If strings and arrays of code units are distinct, slicing in the middle  
> of a grapheme or in the middle of a code point could throw an error, but  
> for performance reasons it should probably check for that only when  
> array bounds checking is turned on (that would require compiler support  
> however).

Not really, it could use assert, but that throws an assert error instead  
of a RangeError.  Of course, both are errors and will abort the program.   
I do wish there was a version(noboundscheck) to do this kind of stuff  
with...

>> Actually, we would need a grapheme to be its own type, because  
>> comparing  two char[]'s that don't contain equivalent bits and having  
>> them be equal,  violates the expectation that char[] is an array.
>>  So the string_t!char would return a grapheme_t!char (names to be   
>> discussed) as its element type.
>
> Or you could make a grapheme a string_t. ;-)

I'm a little uneasy having a range return itself as its element type.  For  
all intents and purposes, a grapheme is a string of one 'element', so it  
could potentially be a string_t.

It does seem daunting to have so many types, but at the same time, types  
convey relationships at compile time that can make coding impossible to  
get wrong, or make things actually possible when having a single type  
doesn't.

I'll give you an example from a previous life:

Tango had a type called DateTime.  This type represented *either* a point  
in time, or a span of time (depending on how you used it).  But I proposed  
we switch to two distinct types, one for a point in time, one for a span  
of time.  It was argued that both were so similar, why couldn't we just  
keep one type?  The answer is simple -- having them be separate types  
allows me to express relationships that the compiler enforces.  For  
example, you can add two time spans together, but you can't add two points  
in time together.  Or maybe you want a function to accept a time span  
(like a sleep operation).  If there was only one type, then  
sleep(DateTime.now()) compiles and sleeps for what, 2011 years? ;)

I feel that making extra types when the relationship between them is  
important is worth the possible repetition of functionality.  Catching  
bugs during compilation is soooo much better than experiencing them during  
runtime.

-Steve


More information about the Digitalmars-d mailing list