VLERange: a range in between BidirectionalRange and RandomAccessRange

Tue Jan 11 11:09:29 PST 2011

On 1/11/11 9:09 AM, spir wrote:
> On 01/11/2011 05:36 PM, Andrei Alexandrescu wrote:
>> On 1/11/11 4:41 AM, Michel Fortin wrote:
>>> On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu
>>> <SeeWebsiteForEmail at erdani.org> said:
>>>> In addition to these (and connecting the two), a VLERange would offer
>>>> two additional primitives:
>>>>
>>>> 1. size_t stepSize(size_t offset) gives the length of the step needed
>>>> to skip to the next element.
>>>>
>>>> 2. size_t backstepSize(size_t offset) gives the size of the _backward_
>>>> step that goes to the previous element.
>>>
>>> I like the idea, but I'm not sure about this interface. What's the
>>> result of stepSize if your range must create two elements from one
>>> underlying unit? Perhaps in those cases the element type could be an
>>> array (to return more than one element from one iteration).
>>>
>>> For instance, say we have a conversion range taking a Unicode string and
>>> converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
>>> "oe" (one chararacter to two characters), in this case 'front' could
>>> simply return "oe" (two characters) in one iteration, with stepSize
>>> being the size of the "œ" code point. In the same conversion process,
>>> encountering "e" followed by a combining "´" would return pre-combined
>>> character "é" (two characters to one character).
>>
>> In the design as I thought of it, the effective length of one logical
>> element is one or more representation units. My understanding is that
>> you are referring to a fractional number of representation units for one
>> logical element.
>
> I think Michel is right. If I understand correctly, VLERange addresses
> the low-level and rather simple issue of each codepoint beeing encoding
> as a variable number of code units. Right?
> If yes, then what is the advantage of VLERange? D already has
> string/wstring/dstring, allowing to work with the most advatageous
> encoding according to given source data, and dstring abstracting from
> low-level encoding issues.

It' not about the data, it's about algorithms. Currently there are 
algorithms that ostensibly work for bidirectional ranges, but internally 
"cheat" by detecting that the input is actually a string, and use that 
knowledge for better implementations.

The benefit of VLERange would that that it legitimizes those algorithms. 
I wouldn't be surprised if an entire class of algorithms would in fact 
require VLERange (e.g. many of those that we commonly consider today 
"string" algorithms).

> The main (and massively ignored) issue when manipulating unicode text is
> rather that, unlike with legacy character sets, one codepoint does *not*
> represent a character in the common sense. In character sets like latin-1:
> * each code represents a character, in the common sense (eg "à")
> * each character representation has the same size (1 or 2 bytes)
> * each character has a single representation ("à" --> always 0xe0)
> All of this is wrong with unicode. And these are complicated and
> high-level issues, that appear _after_ decoding, on codepoint sequences.
>
> If VLERange is helpful is dealing with those problems, then I don't
> understand your presentation, sorry. Do you for instance mean such a
> range would, under the hood, group together codes belonging to the same
> character (thus making indexing meaningful), and/or normalise (decomp &
> order) (thus allowing to comp/find/count correctly).?

VLERange would offer automatic decoding in front, back, popFront, and 
popBack - just like BidirectionalRange does right now. It would also 
offer access to the representational support by means of indexing - also 
like char[] et al already do now. The difference is that VLERange being 
a formal concept, algorithms can specialize on it instead of (a) 
specializing for UTF strings or (b) specializing for BidirectionalRange 
and then manually detecting isSomeString inside. Conversely, when 
defining an algorithm you can specify VLARange as a requirement. 
Boyer-Moore is a perfect example - it doesn't work on bidirectional 
ranges, but it does work on VLARange. I suspect there are many like it.

Of course, it would help a lot if we figured other remarkable VLARanges. 
Here are a few that come to mind:

* Multibyte encodings other than UTF. Currently we have no special 
support for those beyond e.g. forward or bidirectional ranges.

* Huffman, RLE, LZ encoded buffers (and many other compressed formats)

* Vocabulary-based translation systems, e.g. associate each word with a 
number.

* Others...?

Some of these are forward-only (don't allow bidirectional access). Once 
we have a number of examples, it would be great to figure a number of 
remarkable algorithms operating on them.

Andrei