VLERange: a range in between BidirectionalRange and RandomAccessRange

spir denis.spir at gmail.com
Tue Jan 11 09:09:18 PST 2011


On 01/11/2011 05:36 PM, Andrei Alexandrescu wrote:
> On 1/11/11 4:41 AM, Michel Fortin wrote:
>> On 2011-01-10 22:57:36 -0500, Andrei Alexandrescu
>> <SeeWebsiteForEmail at erdani.org> said:
>>> In addition to these (and connecting the two), a VLERange would offer
>>> two additional primitives:
>>>
>>> 1. size_t stepSize(size_t offset) gives the length of the step needed
>>> to skip to the next element.
>>>
>>> 2. size_t backstepSize(size_t offset) gives the size of the _backward_
>>> step that goes to the previous element.
>>
>> I like the idea, but I'm not sure about this interface. What's the
>> result of stepSize if your range must create two elements from one
>> underlying unit? Perhaps in those cases the element type could be an
>> array (to return more than one element from one iteration).
>>
>> For instance, say we have a conversion range taking a Unicode string and
>> converting it to ISO Latin 1. The best (lossy) conversion for "œ" is
>> "oe" (one chararacter to two characters), in this case 'front' could
>> simply return "oe" (two characters) in one iteration, with stepSize
>> being the size of the "œ" code point. In the same conversion process,
>> encountering "e" followed by a combining "´" would return pre-combined
>> character "é" (two characters to one character).
>
> In the design as I thought of it, the effective length of one logical
> element is one or more representation units. My understanding is that
> you are referring to a fractional number of representation units for one
> logical element.

I think Michel is right. If I understand correctly, VLERange addresses 
the low-level and rather simple issue of each codepoint beeing encoding 
as a variable number of code units. Right?
If yes, then what is the advantage of VLERange? D already has 
string/wstring/dstring, allowing to work with the most advatageous 
encoding according to given source data, and dstring abstracting from 
low-level encoding issues.

The main (and massively ignored) issue when manipulating unicode text is 
rather that, unlike with legacy character sets, one codepoint does *not* 
represent a character in the common sense. In character sets like latin-1:
* each code represents a character, in the common sense (eg "à")
* each character representation has the same size (1 or 2 bytes)
* each character has a single representation ("à" --> always 0xe0)
All of this is wrong with unicode. And these are complicated and 
high-level issues, that appear _after_ decoding, on codepoint sequences.

If VLERange is helpful is dealing with those problems, then I don't 
understand your presentation, sorry. Do you for instance mean such a 
range would, under the hood, group together codes belonging to the same 
character (thus making indexing meaningful), and/or normalise (decomp & 
order) (thus allowing to comp/find/count correctly).?


denis
_________________
vita es estrany
spir.wikidot.com



More information about the Digitalmars-d mailing list