VLERange: a range in between BidirectionalRange and RandomAccessRange

Tue Jan 11 05:30:46 PST 2011

On Mon, 10 Jan 2011 22:57:36 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail at erdani.org> wrote:

> I've been thinking on how to better deal with Unicode strings. Currently  
> strings are formally bidirectional ranges with a surreptitious random  
> access interface. The random access interface accesses the support of  
> the string, which is understood to hold data in a variable-encoded  
> format. For as long as the programmer understands this relationship,  
> code for string manipulation can be written with relative ease. However,  
> there is still room for writing wrong code that looks legit.
>
> Sometimes the best way to tackle a hairy reality is to invite it to the  
> negotiation table and offer it promotion to first-class abstraction  
> status. Along that vein I was thinking of defining a new range:  
> VLERange, i.e. Variable Length Encoding Range. Such a range would have  
> the power somewhere in between bidirectional and random access.
>
> The primitives offered would include empty, access to front and back,  
> popFront and popBack (just like BidirectionalRange), and in addition  
> properties typical of random access ranges: indexing, slicing, and  
> length. Note that the result of the indexing operator is not the same as  
> the element type of the range, as it only represents the unit of  
> encoding.
>
> In addition to these (and connecting the two), a VLERange would offer  
> two additional primitives:
>
> 1. size_t stepSize(size_t offset) gives the length of the step needed to  
> skip to the next element.
>
> 2. size_t backstepSize(size_t offset) gives the size of the _backward_  
> step that goes to the previous element.
>
> In both cases, offset is assumed to be at the beginning of a logical  
> element of the range.
>
> I suspect that a lot of functions in std.string can be written without  
> Unicode-specific knowledge just by relying on such an interface.  
> Moreover, algorithms can be generalized to other structures that use  
> variable-length encoding, such as those used in data compression. (In  
> that case, the support would be a bit array and the encoded type would  
> be ubyte.)
>
> Writing to such ranges is not addressed by this design. Ideas are  
> welcome.
>
> Adding VLERange would legitimize strings and would clarify their  
> handling, at the cost of adding one additional concept that needs to be  
> minded. Is the trade-off worthwhile?

While this makes it possible to write algorithms that only accept  
VLERanges, I don't think it solves the major problem with strings -- they  
are treated as arrays by the compiler.

I'd also rather see an indexing operation return the element type, and  
have a separate function to get the encoding unit.  This makes more sense  
for generic code IMO.

I noticed you never commented on my proposed string type...

That reminds me, I should update with suggested changes and re-post it.

-Steve