VLERange: a range in between BidirectionalRange and RandomAccessRange

Tue Jan 11 04:41:43 PST 2011

On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

> I've been thinking on how to better deal with Unicode strings. 
> Currently strings are formally bidirectional ranges with a 
> surreptitious random access interface. The random access interface 
> accesses the support of the string, which is understood to hold data in 
> a variable-encoded format. For as long as the programmer understands 
> this relationship, code for string manipulation can be written with 
> relative ease. However, there is still room for writing wrong code that 
> looks legit.
> 
> Sometimes the best way to tackle a hairy reality is to invite it to the 
> negotiation table and offer it promotion to first-class abstraction 
> status. Along that vein I was thinking of defining a new range: 
> VLERange, i.e. Variable Length Encoding Range. Such a range would have 
> the power somewhere in between bidirectional and random access.
> 
> The primitives offered would include empty, access to front and back, 
> popFront and popBack (just like BidirectionalRange), and in addition 
> properties typical of random access ranges: indexing, slicing, and 
> length. Note that the result of the indexing operator is not the same 
> as the element type of the range, as it only represents the unit of 
> encoding.

Seems like a good idea to define things formally.

> In addition to these (and connecting the two), a VLERange would offer 
> two additional primitives:
> 
> 1. size_t stepSize(size_t offset) gives the length of the step needed 
> to skip to the next element.
> 
> 2. size_t backstepSize(size_t offset) gives the size of the _backward_ 
> step that goes to the previous element.

I like the idea, but I'm not sure about this interface. What's the 
result of stepSize if your range must create two elements from one 
underlying unit? Perhaps in those cases the element type could be an 
array (to return more than one element from one iteration).

For instance, say we have a conversion range taking a Unicode string 
and converting it to ISO Latin 1. The best (lossy) conversion for "œ" 
is "oe" (one chararacter to two characters), in this case 'front' could 
simply return "oe" (two characters) in one iteration, with stepSize 
being the size of the "œ" code point. In the same conversion process, 
encountering "e" followed by a combining "´" would return pre-combined 
character "é" (two characters to one character).

> In both cases, offset is assumed to be at the beginning of a logical 
> element of the range.
> 
> I suspect that a lot of functions in std.string can be written without 
> Unicode-specific knowledge just by relying on such an interface. 
> Moreover, algorithms can be generalized to other structures that use 
> variable-length encoding, such as those used in data compression. (In 
> that case, the support would be a bit array and the encoded type would 
> be ubyte.)

Applicability to other problems seems like a valuable benefit.

> Writing to such ranges is not addressed by this design. Ideas are welcome.

Writing, as in assigning to 'front'? That's not really possible with 
variable-length units as it'd need to shift everything in case of a 
length difference. Or maybe you meant writing as in having an output 
range for variable-length elements... I'm not sure

> Adding VLERange would legitimize strings and would clarify their 
> handling, at the cost of adding one additional concept that needs to be 
> minded. Is the trade-off worthwhile?

In my opinion it's not a trade-off at all, it's a formalization of how 
strings are handled which is better in every regard than a "special 
case". I welcome this move very much.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/