VLERange: a range in between BidirectionalRange and RandomAccessRange
Andrei Alexandrescu
SeeWebsiteForEmail at erdani.org
Tue Jan 11 17:22:13 PST 2011
On 1/11/11 4:46 PM, spir wrote:
> On 01/11/2011 08:09 PM, Andrei Alexandrescu wrote:
>>> The main (and massively ignored) issue when manipulating unicode text is
>>> rather that, unlike with legacy character sets, one codepoint does *not*
>>> represent a character in the common sense. In character sets like
>>> latin-1:
>>> * each code represents a character, in the common sense (eg "à")
>>> * each character representation has the same size (1 or 2 bytes)
>>> * each character has a single representation ("à" --> always 0xe0)
>>> All of this is wrong with unicode. And these are complicated and
>>> high-level issues, that appear _after_ decoding, on codepoint sequences.
>>>
>>> If VLERange is helpful is dealing with those problems, then I don't
>>> understand your presentation, sorry. Do you for instance mean such a
>>> range would, under the hood, group together codes belonging to the same
>>> character (thus making indexing meaningful), and/or normalise (decomp &
>>> order) (thus allowing to comp/find/count correctly).?
>>
>> VLERange would offer automatic decoding in front, back, popFront, and
>> popBack - just like BidirectionalRange does right now. It would also
>> offer access to the representational support by means of indexing - also
>> like char[] et al already do now.
>
> IIUC, for the case of text, VLERange helps abstracting from the annoying
> fact that a codepoint is encoded as a variable number of code units.
> What I meant is issues like:
>
> auto text = "a\u0302"d;
> writeln(text); // "â"
> auto range = VLERange(text);
> // extracts characters correctly?
> auto letter = range.front(); // "a" or "â"?
> // case yes: compares correctly?
> assert(range.front() == "â"); // fail or pass?
You should try text.front right now, you might be surprised :o).
Andrei
More information about the Digitalmars-d
mailing list