VLERange: a range in between BidirectionalRange and RandomAccessRange

Thu Jan 13 14:01:35 PST 2011

On Thu, 13 Jan 2011 15:51:00 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail at erdani.org> wrote:

> On 1/13/11 11:35 AM, Steven Schveighoffer wrote:
>> On Thu, 13 Jan 2011 14:08:36 -0500, Andrei Alexandrescu
>> <SeeWebsiteForEmail at erdani.org> wrote:
>>> Let's take a look:
>>>
>>> // Incorrect string code
>>> void fun(string s) {
>>> foreach (i; 0 .. s.length) {
>>> writeln("The character in position ", i, " is ", s[i]);
>>> }
>>> }
>>>
>>> // Incorrect string_t code
>>> void fun(string_t!char s) {
>>> foreach (i; 0 .. s.codeUnits) {
>>> writeln("The character in position ", i, " is ", s[i]);
>>> }
>>> }
>>>
>>> Both functions are incorrect, albeit in different ways. The only
>>> improvement I'm seeing is that the user needs to write codeUnits
>>> instead of length, which may make her think twice. Clearly, however,
>>> copiously incorrect code can be written with the proposed interface
>>> because it tries to hide the reality that underneath a variable-length
>>> encoding is being used, but doesn't hide it completely (albeit for
>>> good efficiency-related reasons).
>>
>> You might be looking at my previous version. The new version (recently
>> posted) will throw an exception for that code if a multi-code-unit
>> code-point is found.
>
> I was looking at your latest. It's code that compiles and runs, but  
> dynamically fails on some inputs. I agree that it's often better to fail  
> noisily instead of silently, but in a manner of speaking the  
> string-based code doesn't fail at all - it correctly iterates the code  
> units of a string. This may sometimes not be what the user expected;  
> most of the time they'd care about the code points.

iterating the code units is possible by accessing the array data.  i.e.  
you could do:

foreach(i, c; s.data)

if you want the code-units.

That is the point of having a separate type.  Using string_t tells the  
library "I'm using this data as a string".  Using char[] tells the library  
"I'm using this data as an array."

The difference here is, you have to *specifically* try to access the code  
units, the default is code-points.  All it does really is switch the  
default.

>> It also supports this:
>>
>> foreach(i, d; s)
>> {
>> writeln("The character in position ", i, " is ", d);
>> }
>>
>> where i is the index (might not be sequential)
>
> Well string supports that too, albeit with the nit that you need to  
> specify dchar.

This is not a small problem.

>> isRandomAccessRange requires hasLength (see here:
>> http://www.dsource.org/projects/phobos/browser/trunk/phobos/std/range.d#L532).
>> This is not a random access range per that definition.
>
> That's an interesting twist. By the way I specified length is required  
> then because I couldn't imagine having random access into something that  
> I can't tell the length of. Apparently I was wrong :o).

Yes, in fact, you could say that specifically defines VLERange ;)  But  
actually, there are two types of VLE ranges, those which can be randomly  
accessed (where determining the beginning of a code point, given a random  
index is possible) and those that cannot (where decoding depends on the  
exact order of the data).  Actually, those would not be bi-directional  
ranges anyways.

>> But a string
>> isn't a random access range anyways (it's specifically disallowed by
>> std.range per that same reference).
>
> It isn't and it isn't supposed to be.

I agree with that assessment, which is why I omitted length.

-Steve