VLERange: a range in between BidirectionalRange and RandomAccessRange

Mon Jan 17 09:24:13 PST 2011

On 01/17/2011 01:44 PM, Steven Schveighoffer wrote:
> On Sun, 16 Jan 2011 13:06:16 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail at erdani.org> wrote:
>
>> On 1/15/11 9:25 PM, Jonathan M Davis wrote:
>>> Considering that strings are already dealt with specially in order to
>>> have an
>>> element of dchar, I wouldn't think that it would be all that
>>> distruptive to make
>>> it so that they had an element type of Grapheme instead. Wouldn't
>>> that then fix
>>> all of std.algorithm and the like without really disrupting anything?
>>
>> It would make everything related a lot (a TON) slower, and it would
>> break all client code that uses dchar as the element type, or is
>> otherwise unprepared to use Graphemes explicitly. There is no question
>> there will be disruption.
>
> I would have agreed with you last week. Now I understand that using
> dchar is just as useless for unicode as using char.
>
> Will it be slower? Perhaps. A TON slower? Probably not.
>
> But it will be correct. Correct and slow is better than incorrect and
> fast. If I showed you a shortest-path algorithm that ran in O(V) time,
> but didn't always find the shortest path, would you call it a success?
>
> We need to get some real numbers together. I'll see what I can create
> for a type, but someone else needs to supply the input :) I'm on short
> supply of unicode data, and any attempts I've made to create some result
> in failure. I have one example of one composed character in this thread
> that I can cling to, but in order to supply some real numbers, we need a
> large amount of data.
>
> -Steve

Hello Steve & Andrei,

I see 2 questions: (1) whether we should provide Unicode correctness as 
a default or not? and relative points of level of abstraction & 
normalisation (2) what is the best way to implement such correctness?
Let us put aside (1) for a while, anyway nothing prevents us to 
experiment while waiting for an agreement; such experiment would in fact 
feed the debate with real facts instead of "airy" ideas.

It seems there are 2 opposite approaches to Unicode correctness. Mine 
was to build a types that systematically abstracts UCS-created issues 
(that real whole characters are coded by mini-arrays of codes I call 
"code piles", that those piles have variable lengths, _and_ that 
cheracters even may have several representations). Then, in my wild 
guesses, every text manipulation method should obviously be "flash 
fast", actually faster than any on the fly algo by several orders of 
magnitude. But Michel let me doubt on that point.

The other approach is precisely to provide needed abstraction ("piling" 
and normalisation) on the fly. Like proposed by Michel, and like 
Objective-C does, IIUC. This way seems to me closer to a kind of 
re-design Steven's new String type and/or Andrei's VLERange.

As you say, we need real timing numbers to decide. I think we should 
measure at least 2 routines:
* indexing (or better iteration?) which only requires "piling"
* counting occurrences of a given character or slice, which requires 
both piling and normalisation

I do not feel like implementating such routine for the on the fly 
version, and have no time for this in coming days; but if anyone is 
volunteer, feel free to rip code and data from Text's current 
implementation if it may help.

As source text, we can use the one at 
https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/data/unicode.txt 
(already my source for perf measures). It has the only merit to be a 
text (about unicode!) in twelve rather different languages.

[My intuitive guess is that Michel is wrong by orders of magnitude --but 
again I know about nothing about code performance.]

Denis
_________________
vita es estrany
spir.wikidot.com