VLERange: a range in between BidirectionalRange and RandomAccessRange

Wed Jan 12 16:45:36 PST 2011

On 2011-01-12 14:57:58 -0500, spir <denis.spir at gmail.com> said:

> On 01/12/2011 08:28 PM, Don wrote:
>> I think the only problem that we really have, is that "char[]",
>> "dchar[]" implies that code points is always the appropriate level of
>> abstraction.
> 
> I'd like to know when it happens that codepoint is the appropriate 
> level of abstraction.

I agree with you. I don't see many use for code points.

One of these uses is writing a parser for a format defined in term of 
code points (XML for instance). But beyond that, I don't see one.

> * If pieces of text are not manipulated, meaning just used in the 
> application, or just transferred via the application as is (from file / 
> input / literal to any kind of output), then any kind of encoding just 
> works. One can even concatenate, provided all pieces use the same 
> encoding. --> _lower_ level than codepoint is OK.
> * But any of manipulation (indexing, slicing, compare, search, count, 
> replace, not to speak about regex/parsing) requires operating at the 
> _higher_ level of characters (in the common sense). Just like with 
> historic character sets in which codes used to represent characters 
> (not lower-level thingies as in UCS). Else, one reads, compares, 
> changes meaningless bits of text.

Very true. In the same way that code points can span on multiple code 
units, user-perceived characters (graphemes) can span on multiple code 
points.

A funny exercise to make a fool of an algorithm working only with code 
points would be to replace the word "fortune" in a text containing the 
word "fortuné". If the last "é" is expressed as two code points, as "e" 
followed by a combining acute accent (this: é), replacing occurrences 
of "fortune" by "expose" would also replace "fortuné" with "exposé" 
because the combining acute accent remains as the code point following 
the word. Quite amusing, but it doesn't really make sense that it works 
like that.

In the case of "é", we're lucky enough to also have a pre-combined 
character to encode it as a single code point, so encountering "é" 
written as two code points is quite rare. But not all combinations of 
marks and characters can be represented as a single code point. The 
correct thing to do is to treat "é" (single code point) and "é" ("e" + 
combining acute accent) as equivalent.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/