VLERange: a range in between BidirectionalRange and RandomAccessRange

Mon Jan 17 15:13:42 PST 2011

On 01/17/2011 07:57 PM, Andrei Alexandrescu wrote:
> On 1/17/11 12:23 PM, spir wrote:
>> Andrei, would you have a look at Text's current state, mainly
>> theinterface, when you have time for that (no hurry) at
>> https://bitbucket.org/denispir/denispir-d/src
>> It is actually a bit more than just a string type considering true
>> characters as natural elements.
>> * It is a textual type providing a client interface of common text
>> manipulation methods similar to ones in common high-level languages.
>> (including the fact that a character is a singleton string)
>> * The repo also holds the main module (unicodedata) of Text's sister lib
>> (dunicode), providing access to various unicode algos and data.
>> (We are about to merge the 2 libs into a new repository.)
>
> I think this is solid work that reveals good understanding of Unicode.
> That being said, there are a few things I disagree about and I don't
> think it can be integrated into Phobos.

We are exploring a new field. (Except for the work Objective-C designers 
did -- but we just discovered it.)

> One thing is that it looks a lot
> more like D1 code than D2. D2 code of this kind is automatically
> expected to play nice with the rest of Phobos (ranges and algorithms).
> As it is, the code is an island that implements its own algorithms
> (mostly by equivalent handwritten code).

Right. We precisely initially wanted to let it play nicely with the rest 
of new Phobos. This meant mainly provide a range interface, which also 
gives access to std.algorithm routines. But we were blocked by current 
bugs related to ranges. I have posted about those issues (you may 
remember having replied to this post).

> In detail:
>
> * Line 130: representing a text as a dchar[][] has its advantages but
> major efficiency issues. To be frank I think it's a disaster. I think a
> representation building on UTF strings directly is bound to be vastly
> better.

I don't understand your point. Where is the difference with D's builtin 
types, then?

Also, which efficiency issue do you mention? Upon text object 
construction, we do agree and I have given some data. But this happens 
only once; it is an investment intended to provide correctness first, 
and efficiency of _every_ operation on constructed text.
Upon speed ofsuch  methods / algorithms operating _correctly_ on 
universal text, precisely, since there is no alternative to Text (yet), 
there are also no available performance data to judge.

(What about comparing Objective-C's NSString to Text's current 
performance for indexing, slicing, searching, counting,...? Even in its 
current experimental stage, I bet it would not be ridiculous, rather the 
opposite. But I may be completely wrong.)

> * 163: equality does what std.algorithm.equal does.
>
> * 174: equality also does what std.algorithm.equal does (possibly with a
> custom pred)

Right, these are unimportant tool func at the "pile" level. (Initially 
introduced because builtin "==" showed strange inefficency in our case. 
May test again later.)

> * 189: TextException is unnecessary

Agreed.

> * 340: Unless properly motivate, iteration with opApply is archaic and
> inefficient.

See range bug evoked above. opApply is the only workaround AFAIK.
Also, ranges cannot yet provide indexed iteration like
	foreach(i, char ; text) {...}

> * 370: Why lose the information that the result is in fact a single Pile?

I don't know what information loss you mean.

Generally speaking, Pile is more or less an implementation detail used 
to internally represent a true character; while Text is the important thing.
At one time we had to chose whether make Pile an obviously exposed type 
as well, or not. I chose (after some exchange on the topic) not to do it 
for a few reasons:
* Simplicity: one type does all the job well.
* Avoid confusion due to conflict with historic string types which 
elements (codes=characters) were atomic thingies. This was also a reason 
not to name it simply "Character"; "Pile" for me was supposed to rather 
evoke the technical side than the meaningful side.
* Lightness of the interface: if we expose Pile obviously, then we need 
to double all methods that may take or return a single character, like 
searching, counting, replacing etc... and also possibly indexing and 
iteration.

In fact, the resulting interface is more or less like a string type in 
high-level languages such as Python; with the motivating difference that 
it operates correctly on universal text.

Now, it seems you rather expect, maybe, the character/pile type to be 
the important thing and Text to just be a sequence of them? (possibly 
even unnecessary to be defined formally)

> * 430, 456, 474: contains, indexOf, count and probably others should use
> generic algorithms, not duplicate them.
>
> * 534: replace is std.array.replace

I had to write algos because most of them in std.algorithm require a 
range interface, IIUC; and also for testing purpose.

> * 623: copy copies the piles shallowly (not sure if that's a problem)

Had the same interrogation.

> As I mentioned before - why not focus on defining a Grapheme type (what
> you call Pile, but using UTF encoding) and defining a ByGrapheme range
> that iterates a UTF-encoded string by grapheme?

Dunno. This simply was not my approach. Seems to me Text as is provides 
clients with an interface a simple and clear as possible, while 
operating correctly in the backgroung.

It seems if you just build a ByGrapheme iterator, then you have no other 
choice than abstracting on the fly (constructing piles on the fly for 
operations like indexing and normalising them in addition for searching, 
counting...).
As I said in other posts, this may be the right thing to do from an 
efficiency point of view, but this remains to be proven. I bet the 
opposite, in fact, that --with same implementation language and same 
investment in optimisation-- the approach defining a true textual type 
like Text is inevitbly more efficient by orders of magnitude (*). Again, 
Text construction initial cost is an investment. Prove me wrong (**).

> Andrei

Denis

(*) Except, probably, for the choice of making the ElemenType a 
singleton Text (seems costly).
(**) I'm now aware of the high speed loss Text certainly suffers from 
representing characters as mini-arrays, but I guess it is marginally 
relevant compared to the gain of not piling and normalising for every 
operation.
_________________
vita es estrany
spir.wikidot.com