VLERange: a range in between BidirectionalRange and RandomAccessRange

Mon Jan 17 10:23:46 PST 2011

On 01/17/2011 06:36 PM, Andrei Alexandrescu wrote:
> On 1/17/11 10:55 AM, spir wrote:
>> On 01/15/2011 12:21 AM, Michel Fortin wrote:
>>> Also, it'd really help this discussion to have some hard numbers about
>>> the cost of decoding graphemes.
>>
>> Text has a perf module that provides such numbers (on different stages
>> of Text object construction) (but the measured algos are not yet
>> stabilised, so that said numbers regularly change, but in the right
>> sense ;-)
>> You can try the current version at
>> https://bitbucket.org/denispir/denispir-d/src (the perf module is called
>> chrono.d)
>>
>> For information, recently, the cost of full text construction: decoding,
>> normalisation (both decomp & ordering), piling, was about 5 times
>> decoding alone. The heavy part (~ 70%) beeing piling. But Stephan just
>> informed me about a new gain in piling I have not yet tested.
>> This performance places our library in-between Windows native tools and
>> ICU in terms of speed. Which is imo rather good for a brand new tool
>> written in a still unstable language.
>>
>> I have carefully read your arguments on Text's approach to
>> systematically "pile" and normalise source texts not beeing the right
>> one from an efficiency point of view. Even for strict use cases of
>> universal text manipulation (because the relative space cost would
>> indirectly cause time cost due to cache effects). Instead, you state we
>> should "pile" and/or normalise on the fly. But I am, similarly to you,
>> rather doubtful on this point without any numbers available.
>> So, let us produce some benchmark results on both approaches if you like.
>
> Congrats on this great work. The initial numbers are in keeping with my
> expectation; UTF adds for certain primitives up to 3x overhead compared
> to ASCII, and I expect combining character handling to bring about as
> much on top of that.
>
> Your work and Steve's won't go to waste; one way or another we need to
> add grapheme-based processing to D. I think it would be great if later
> on a Phobos submission was made.

Andrei, would you have a look at Text's current state, mainly 
theinterface, when you have time for that (no hurry) at 
https://bitbucket.org/denispir/denispir-d/src
It is actually a bit more than just a string type considering true 
characters as natural elements.
* It is a textual type providing a client interface of common text 
manipulation methods similar to ones in common high-level languages.
(including the fact that a character is a singleton string)
* The repo also holds the main module (unicodedata) of Text's sister lib 
(dunicode), providing access to various unicode algos and data.
(We are about to merge the 2 libs into a new repository.)

Denis
_________________
vita es estrany
spir.wikidot.com