string repr & range levels [was: Re: VLERange: ...]

Wed Jan 19 02:38:43 PST 2011

On 01/19/2011 08:43 AM, Ali Çehreli wrote:
> Michel Fortin wrote:
>  > On 2011-01-17 17:54:04 -0500, Michel Fortin <michel.fortin at michelf.com>
>  > said:
>
>  > So perhaps the best interface for strings would be to provide multiple
>  > range-like interfaces that you can use at the level you want.
>
> That's what I've been thinking. The users can choose whether they want
> random access or not. A grapheme-aware string can provide random access
> at a space cost, or no random access for efficient space use.
>
> I see 5 layers in string processing. Layers 1 and 2 are currently
> handled by D, sometimes in an unclear way. e.g. char[] may be used as an
> array of code units or an array of code points depending on the type of
> iteration.

This is very good and helpful summary. But you do not list all relevant 
aspects of the question, I guess. Defining which codes belong to a given 
grapheme (what I call "piling") is necessary for true O(1) 
random-access, but not only. More importantly, all operations involving 
equality comparison (find, count, replace,...) require normalisation 
--in addition to piling.
A few notes:

> 1) Code units: This is what D provides with its string types
>
> This layers models RandomAccessRange

This level is pure implementation artifact that simply cannot make any 
sense. (from user and thus programmer points of view)
Any kind of text manipulation (slice, find, replace...) may lead to 
random incorrectness, except when source texts can be guaranteed to hold 
plain ASCII (which may be hard to prove).
Conversely, pieces of text only passed around by an app do not require 
any more costly representation, in terms of time (decoding) or space. In 
addition, concat works provided all pieces share the same encoding 
(ASCII beeing a subset of most historic charsets and of UTF-8).

> 2) Code points: This is what D and Phobos provide for example with
> foreach(d; stride(s, 1))
>
> dchar[] models RandomAccessRange at this layer
>
> char[] and wchar[] model ForwardRange at this layer
>
> (If I understand it correctly, Steven Schveighoffer is trying to provide
> a pseudo-RandomAccessRange to char[] and wchar[] with his string type.)

This level is also a kind of implementation artifact, compared to 
historic charsets, but actually based on a real fact of natural 
languages: they hold composite characters that can thus be coded by 
combining lower-level codes which represent "scripting marks" (base & 
combining ones).
For this reason, this level can have some sense. My latest guess is that 
apps that consider text as a study object (read linguistic apps), 
instead of a means, may regurarly need operating at this level, in 
addition to the next one.
Normalisation can be applied at this level --and is necessary for the 
above kind of use case. But using it for operations requiring compare 
will typically also require "piling", that is the next level, if only to 
determine what is to be compared.

> 3) Graphemes: This is what the string type that spir is working on.
> There could be at least two types:

This is the meaningful level for, probably, nearly all applications.

> 3a) RandomAccessGraphemeRange: Has random access but the data type is large

I guess this is Text's approach? Text is "flash fast" indeed for any 
operation benefiting from random-access. But not only: since it 
normalises its input, it should be far faster for any operation using 
compare (rough evaluations suggest a speed ratio of 1 to 2 orders of 
magnitude).
The cost is high in terms of space, which in turn certainly reduces its 
speed gain in the general case, because to cache (miss) effects. (Thank 
you Michel for making this clear.)

> 3b) ForwardGraphemeRange: space-efficient but does not provide random
> access

Is it what Andrei expects, namely a Grapheme type with a corresponding 
ByGrapheme iterator IIUC?
Time efficiency of operations?

3) metadata RandomAccessGraphemeRange

Michel Fortin suggested (off list) an alternative approach to Text: 
instead of actually "piling" at construction time, just store metadata 
upon grapheme bounds. The core benefit is indeed to keep "normal" text 
storage (meaning *char[], for modification): would this point please 
Andrei better?
I let you evaluate various consequences of this change (mostly positive, 
I guess). The same metadata principle could certainly be used for 
further optimisations, but this is another story.
I'm motivated to implement this variant, looke like best of both worlds 
tome. (support welcome ;-)

> I think the programmers would be happy to be able to choose.
>
> 4) Letters: Uses either 3a or 3b. This is the layer where the idea of a
> writing system enters the picture: lower/upper case transformations and
> sorting happen at this layer. (I have a library that tries to handle
> this layer but is ignorant of graphemes; I am waiting for spir's string
> type. ;))
>
> 4a) Models RandomAccessRange if based on a RandomAccessGraphemeRange
>
> 4b) Models ForwardRange if based on a ForwardGraphemeRange

I do not understand what this level means. For me, letters are, 
precisely, archetypical true characters, meaning level 3.

[Note: "grapheme", used by Unicode to denote the common sense of 
"character", is simply wrong: "sh" and "ti" are graphemes in english 
(for the same phoneme /ʃ/), not characters; and tab, §, or © are 
probalby not considered graphemes by linguists, while they are 
characters. This is the reason why I try to avoid this term and use 
"character", like ICU's doc, to avoid even more confusion.]

> 5) Text: Collection of Letters. This is where a name like "ali & tim" is
> correctly capitalized as "ALİ & TIM" because the text consists of two
> separate writing systems. (The same library that I mentioned in 4 tries
> to handle this layer as well.)

This is an immensely complicated field. Note that it has nothing to do 
with text & character representation issues: whatever the character set, 
one has to confront problems like uppercase of 'i', 'ss' vs 'ß', 
definiton of "letter" or "character", matching, sorting order...
Text does not even try to address natural language issues. Instead it 
deals onl,y but hopefully clearly & correctly, with restoring simple and 
safe representation for client apps.

> Ali

Denis
_________________
vita es estrany
spir.wikidot.com