VLERange: a range in between BidirectionalRange and RandomAccessRange

spir denis.spir at gmail.com
Tue Jan 11 16:46:23 PST 2011


On 01/11/2011 08:09 PM, Andrei Alexandrescu wrote:
>> The main (and massively ignored) issue when manipulating unicode text is
>> rather that, unlike with legacy character sets, one codepoint does *not*
>> represent a character in the common sense. In character sets like
>> latin-1:
>> * each code represents a character, in the common sense (eg "à")
>> * each character representation has the same size (1 or 2 bytes)
>> * each character has a single representation ("à" --> always 0xe0)
>> All of this is wrong with unicode. And these are complicated and
>> high-level issues, that appear _after_ decoding, on codepoint sequences.
>>
>> If VLERange is helpful is dealing with those problems, then I don't
>> understand your presentation, sorry. Do you for instance mean such a
>> range would, under the hood, group together codes belonging to the same
>> character (thus making indexing meaningful), and/or normalise (decomp &
>> order) (thus allowing to comp/find/count correctly).?
>
> VLERange would offer automatic decoding in front, back, popFront, and
> popBack - just like BidirectionalRange does right now. It would also
> offer access to the representational support by means of indexing - also
> like char[] et al already do now.

IIUC, for the case of text, VLERange helps abstracting from the annoying 
fact that a codepoint is encoded as a variable number of code units.
What I meant is issues like:

     auto text = "a\u0302"d;
     writeln(text);                  // "â"
     auto range = VLERange(text);
     // extracts characters correctly?
     auto letter = range.front();    // "a" or "â"?
     // case yes: compares correctly?
     assert(range.front() == "â");   // fail or pass?

Both fail using all unicode-aware types I know of, because
1. They do not recognise that a character is represented by an arbitrary 
number of codes (code _points_).
2. They do not use normalised forms for comp, search, count, etc...
(while in unicode a given char can have several representations).

> The difference is that VLERange being
> a formal concept, algorithms can specialize on it instead of (a)
> specializing for UTF strings or (b) specializing for BidirectionalRange
> and then manually detecting isSomeString inside. Conversely, when
> defining an algorithm you can specify VLARange as a requirement.
> Boyer-Moore is a perfect example - it doesn't work on bidirectional
> ranges, but it does work on VLARange. I suspect there are many like it.
>
> Of course, it would help a lot if we figured other remarkable VLARanges.

I think I see the point, and the general usefulness of such an 
abstraction. But it would certainly be more useful in other fields than 
text manipulation, because there are far more annoying issues (that, 
like in example above, simply prevent code correctness).

Denis
_________________
vita es estrany
spir.wikidot.com



More information about the Digitalmars-d mailing list