VLERange: a range in between BidirectionalRange and RandomAccessRange
spir
denis.spir at gmail.com
Mon Jan 17 15:13:42 PST 2011
On 01/17/2011 07:57 PM, Andrei Alexandrescu wrote:
> On 1/17/11 12:23 PM, spir wrote:
>> Andrei, would you have a look at Text's current state, mainly
>> theinterface, when you have time for that (no hurry) at
>> https://bitbucket.org/denispir/denispir-d/src
>> It is actually a bit more than just a string type considering true
>> characters as natural elements.
>> * It is a textual type providing a client interface of common text
>> manipulation methods similar to ones in common high-level languages.
>> (including the fact that a character is a singleton string)
>> * The repo also holds the main module (unicodedata) of Text's sister lib
>> (dunicode), providing access to various unicode algos and data.
>> (We are about to merge the 2 libs into a new repository.)
>
> I think this is solid work that reveals good understanding of Unicode.
> That being said, there are a few things I disagree about and I don't
> think it can be integrated into Phobos.
We are exploring a new field. (Except for the work Objective-C designers
did -- but we just discovered it.)
> One thing is that it looks a lot
> more like D1 code than D2. D2 code of this kind is automatically
> expected to play nice with the rest of Phobos (ranges and algorithms).
> As it is, the code is an island that implements its own algorithms
> (mostly by equivalent handwritten code).
Right. We precisely initially wanted to let it play nicely with the rest
of new Phobos. This meant mainly provide a range interface, which also
gives access to std.algorithm routines. But we were blocked by current
bugs related to ranges. I have posted about those issues (you may
remember having replied to this post).
> In detail:
>
> * Line 130: representing a text as a dchar[][] has its advantages but
> major efficiency issues. To be frank I think it's a disaster. I think a
> representation building on UTF strings directly is bound to be vastly
> better.
I don't understand your point. Where is the difference with D's builtin
types, then?
Also, which efficiency issue do you mention? Upon text object
construction, we do agree and I have given some data. But this happens
only once; it is an investment intended to provide correctness first,
and efficiency of _every_ operation on constructed text.
Upon speed ofsuch methods / algorithms operating _correctly_ on
universal text, precisely, since there is no alternative to Text (yet),
there are also no available performance data to judge.
(What about comparing Objective-C's NSString to Text's current
performance for indexing, slicing, searching, counting,...? Even in its
current experimental stage, I bet it would not be ridiculous, rather the
opposite. But I may be completely wrong.)
> * 163: equality does what std.algorithm.equal does.
>
> * 174: equality also does what std.algorithm.equal does (possibly with a
> custom pred)
Right, these are unimportant tool func at the "pile" level. (Initially
introduced because builtin "==" showed strange inefficency in our case.
May test again later.)
> * 189: TextException is unnecessary
Agreed.
> * 340: Unless properly motivate, iteration with opApply is archaic and
> inefficient.
See range bug evoked above. opApply is the only workaround AFAIK.
Also, ranges cannot yet provide indexed iteration like
foreach(i, char ; text) {...}
> * 370: Why lose the information that the result is in fact a single Pile?
I don't know what information loss you mean.
Generally speaking, Pile is more or less an implementation detail used
to internally represent a true character; while Text is the important thing.
At one time we had to chose whether make Pile an obviously exposed type
as well, or not. I chose (after some exchange on the topic) not to do it
for a few reasons:
* Simplicity: one type does all the job well.
* Avoid confusion due to conflict with historic string types which
elements (codes=characters) were atomic thingies. This was also a reason
not to name it simply "Character"; "Pile" for me was supposed to rather
evoke the technical side than the meaningful side.
* Lightness of the interface: if we expose Pile obviously, then we need
to double all methods that may take or return a single character, like
searching, counting, replacing etc... and also possibly indexing and
iteration.
In fact, the resulting interface is more or less like a string type in
high-level languages such as Python; with the motivating difference that
it operates correctly on universal text.
Now, it seems you rather expect, maybe, the character/pile type to be
the important thing and Text to just be a sequence of them? (possibly
even unnecessary to be defined formally)
> * 430, 456, 474: contains, indexOf, count and probably others should use
> generic algorithms, not duplicate them.
>
> * 534: replace is std.array.replace
I had to write algos because most of them in std.algorithm require a
range interface, IIUC; and also for testing purpose.
> * 623: copy copies the piles shallowly (not sure if that's a problem)
Had the same interrogation.
> As I mentioned before - why not focus on defining a Grapheme type (what
> you call Pile, but using UTF encoding) and defining a ByGrapheme range
> that iterates a UTF-encoded string by grapheme?
Dunno. This simply was not my approach. Seems to me Text as is provides
clients with an interface a simple and clear as possible, while
operating correctly in the backgroung.
It seems if you just build a ByGrapheme iterator, then you have no other
choice than abstracting on the fly (constructing piles on the fly for
operations like indexing and normalising them in addition for searching,
counting...).
As I said in other posts, this may be the right thing to do from an
efficiency point of view, but this remains to be proven. I bet the
opposite, in fact, that --with same implementation language and same
investment in optimisation-- the approach defining a true textual type
like Text is inevitbly more efficient by orders of magnitude (*). Again,
Text construction initial cost is an investment. Prove me wrong (**).
> Andrei
Denis
(*) Except, probably, for the choice of making the ElemenType a
singleton Text (seems costly).
(**) I'm now aware of the high speed loss Text certainly suffers from
representing characters as mini-arrays, but I guess it is marginally
relevant compared to the gain of not piling and normalising for every
operation.
_________________
vita es estrany
spir.wikidot.com
More information about the Digitalmars-d
mailing list