VLERange: a range in between BidirectionalRange and RandomAccessRange

Tue Jan 18 08:38:45 PST 2011

On 1/18/11 7:17 AM, Michel Fortin wrote:
> On 2011-01-18 01:16:13 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail at erdani.org> said:
>
>> On 1/17/11 9:48 PM, Michel Fortin wrote:
>>> On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin at michelf.com>
>>> said:
>>>
>>>> More seriously, you have four choice:
>>>>
>>>> 1. code unit
>>>> 2. code point
>>>> 3. grapheme
>>>> 4. require the client to state explicitly which kind of 'character' he
>>>> wants; 'character' being an overloaded word, it's reasonable to ask
>>>> for disambiguation.
>>>
>>> This makes me think of what I did with my XML parser after you made code
>>> points the element type for strings. Basically, the parser now uses
>>> 'front' and 'popFront' whenever it needs to get the next code point, but
>>> most of the time it uses 'frontUnit' and 'popFrontUnit' instead (which I
>>> had to add) when testing for or skipping an ASCII character is
>>> sufficient. This way I avoid a lot of unnecessary decoding of code
>>> points.
>>>
>>> For this to work, the same range must let you skip either a unit or a
>>> code point. If I were using a separate range with a call to toDchar or
>>> toCodeUnit (or toGrapheme if I needed to check graphemes), it wouldn't
>>> have helped much because the new range would essentially become a new
>>> slice independent of the original, so you can't interleave "I want to
>>> advance by one unit" with "I want to advance by one code point".
>>>
>>> So perhaps the best interface for strings would be to provide multiple
>>> range-like interfaces that you can use at the level you want.
>>>
>>> I'm not sure if this is a good idea, but I thought I should at least
>>> share my experience.
>>
>> Very insightful. Thanks for sharing. Code it up and make a solid
>> proposal!
>
> What I use right now is this (see below). I'm not sure what would be a
> good name for it though. The expectation is that I'll get either an
> ASCII char or something out of ASCII range if it isn't ASCII.
>
> The abstraction doesn't seem very 'solid' to me, in the sense that I
> can't see how it'd apply to ranges other than strings, so it's only
> useful for strings (the character array kind), and it's only useful as a
> workaround since you made ElementType!(char[]) a dchar. Well, any range
> returning char,dchar,wchar could map frontUnit to front and popFrontUnit
> to popFront to keep things working, but it makes the optimization rather
> pointless. I don't really have an idea where to go from here.
[snip]

I was thinking along the lines of:

struct Grapheme
{
     private string support_;
     ...
}

struct ByGrapheme
{
     private string iteratee_;
     bool empty();
     Grapheme front();
     void popFront();
     // Additional funs
     dchar frontCodePoint();
     void popFrontCodePoint();
     char frontCodeUnit();
     void popFrontCodeUnit();
     ...
}

// helper function
ByGrapheme byGrapheme(string s);

// usage
string s = ...;
size_t i;
foreach (g; byGrapheme(s))
{
     writeln("Grapheme #", i, " is ", g);
}

We need this range in Phobos.

Andrei