VLERange: a range in between BidirectionalRange and RandomAccessRange

spir denis.spir at gmail.com
Tue Jan 18 11:19:11 PST 2011


On 01/18/2011 06:14 PM, Michel Fortin wrote:

On 2011-01-18 11:38:45 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:
>> I was thinking along the lines of:
>>
>> struct Grapheme
>> {
>> private string support_;
>> ...
>> }
>>
>> struct ByGrapheme
>> {
>> private string iteratee_;
>> bool empty();
>> Grapheme front();
>> void popFront();
>> // Additional funs
>> dchar frontCodePoint();
>> void popFrontCodePoint();
>> char frontCodeUnit();
>> void popFrontCodeUnit();
>> ...
>> }
>>
>> // helper function
>> ByGrapheme byGrapheme(string s);
>>
>> // usage
>> string s = ...;
>> size_t i;
>> foreach (g; byGrapheme(s))
>> {
>> writeln("Grapheme #", i, " is ", g);
>> }
>>
>> We need this range in Phobos.
>
> Yes, we need a grapheme range.
>
> But that's not what my thing was about. It was about shortcutting code
> point decoding when it isn't necessary while still keeping the ability
> to decode to code points when iterating on the same range. For instance,
> here's a simple made up example:
>
> string s = "<hello>";
> if (!s.empty && s.frontUnit == '<')
> s.popFrontUnit(); // skip
> while (!s.empty && s.frontUnit != '>')
> s.popFront(); // do something with each code point
> if (!s.empty && s.frontUnit == '>')
> s.popFrontUnit(); // skip
> assert(s.empty);
>
> Here, since I know I'm testing and skipping for '<', an ASCII character,
> decoding the code point is wasted time, so I skip that decoding. The
> problem is that this optimization can't happen with a range that
> abstracts things at the code point level. I can do it with strings
> because strings still allow you to access code units through the
> indexing operators, but this can't really apply to ranges of code points
> in general.
>
> And parsing with range of code unit would also be a pain, because even
> if I'm testing for '<' for the first character, sometimes I really need
> to advance by code point and test for code points.

This means a single string type that exposes various _synchrone_ range 
levels (codeunit, codepoint, grapheme), doesn't it? As opposed to 
Andrei's approach of ranges beeing structures external to string types, 
IIUC, which thus move on independantly?

> One thing that might be interesting is benchmarking my XML parser by
> replacing every instance of frontUnit and popFrontUnit with front and
> popFront. That won't change there results, but it'd give us an idea of
> the overhead of the unnecessary decoded characters code points.

Yes, would you have time to do it? I would be interesting in such perf 
measurements. (--> your idea about a Text variant, for which I would 
like to know whether it's worth still decoding systematically.)

Denis
_________________
vita es estrany
spir.wikidot.com




More information about the Digitalmars-d mailing list