[review] new string type

Fri Dec 3 15:14:36 PST 2010

On Fri, 03 Dec 2010 14:11:35 -0800
Ali Çehreli <acehreli at yahoo.com> wrote:

> spir wrote:
> 
>  > What I have in mind is a "UText" type that provides the right abstraction
>  > for text processing / string maipulation as one has when dealing with  ASCII
>  > (in any fact any legacy character set). All what is needed is having a true
>  > one-to-one mapping between characters (in the common sense) and elements of
>  > strings (what I call "code stacks"); one given stack unambiguously denotes
>  > one character. To reach this point, in addition to decoding (ag from  utf8 to
>  > code points), we must:
> 
>  > * group codes into stacks
>  > * normalize (into 'NFD')
> 
> Are those operations independent of the context? Is stacking always desired?
> 
> I guess one would use one of the D string types when grouping or 
> normalization is not desired, right? Makes sense.

We can consider there are two kinds of uses for texts in most applications:
1. just input/output (where input is also from literals in code), possibly with some concatenation.
2. string manipulation / text processing
For the 1st case, there is no need for any sophisticated toolkit like the type I intend to write. We can just read in latin-x, for instance, join it, output is back, without any problems. (As long as all pieces of text share the same encoding.) Problems arise as soon as text is to be manipulated or processed in any other way: indexing, searching, counting, slicing, replacing, etc... all these routines require isolating _characters- in the ordinary sense of the word, inside the string of code units or code points.
A true text type, usable like ASCII in old days, would either provide routines that do that do in the background, but in a costly way, or first group codes into characters once only -- then every later operation is as cheap as possible. Normalising and sorting are also require so that each character has only one representation.
I intend to write a little article to explain the issue (& misunderstandings created by Unicode's use of "abstract character"), and possible solutions.
To say it again: if all one needs is text input/output, then using such a tool is overkill. Actually, even the string type Steven is implementing is not strictly necessary. But it would have the advantage, if I understand correctly, to present a cleaner interface.

>  > * sorts points in stacks
> 
> Ok, I see that it is possible with NFD. I am not experienced with 
> Unicode, but I think there will be issues with other types of Unicode 
> normalization. (Judging from your posts, I know that you know all these. 
> :) )

Yes, the algorithm comes with Unicode's docs about "canonicalisation".

>  > Then, we can for instance index or slice in O(1) as usual, and get a
>  > consistent substring of _characters_ [...] I do not want to deal with 
> anything related to script-, language-, locale- specific issues.
> 
> Is the concept of _character_ well defined in Unicode outside of the 
> context of the an alphabet (I think your "script" covers alphabet.)
> 
> It is an interesting decision when we actually want to see an array of 
> code points as characters. When would it be correct to do so? I think 
> the answer is when we start treating the string as a piece of text.
> 
> For a string to be considered as text, it must be based on an alphabet. 
> ASCII strings are pieces of text, because they are based on the 
> 26-letter alphabet.
> 
> I hope I don't sound like saying against anything that you said. I am 
> also thinking about the other common operations that work on pieces of text:
> 
> - sorting (e.g. ç is between c and d in many alphabets)
> - lowercasing, uppercasing (e.g. i<->İ and ı<->I in many alphabets)
> 
> As a part of the Turkish D community, we've played with the idea of such 
> a text type. It took advantage of D's support for Unicode encoded source 
> code, so it's fully in Turkish. Yay! :)
> 
> Here is the module that takes care of sorting, capitalization, and 
> producing the base forms of the letters of the alphabets:
> 
>      http://code.google.com/p/trileri/source/browse/trunk/tr/alfabe.d
> 
> It is also based on dchar[], as you recommend elsewhere in this thread.
> 
> It is written with the older D2 operator overloading, doesn't support 
> ranges, etc. But it currently supports ten alphabets (including the 
> 26-letter English, and the Old Irish alphabet).
> 
> Going out of the context of this thread, we've also worked on a type 
> that contains pieces of text from different alphabets to make a "text", 
> where a text like "jim & ali" is correctly capitalized as "JIM & ALİ".
> 
> I am thinking more than what you describe. But your string would be 
> useful for implementing ours, as we don't have normalization or stacking 
> support at all.

As said, I do not wish the enter the huge area of script-, natural language-, culture-, specific issues; because it's not general; I just target a general-purpose tool. My type wouldn't even have a default uppercasee routine, for instance: first, it's script specific, second, there is no general definition for that (*) -- even if Unicode provides such data. Sorting issue are even less decidable; it goes down to personal preferences (**). It's also too big, too diverse, too complicated, an area.
But I guess the type I have in mind would provide a good basis for such a work (or rather, hundreds of language- and domain-speicfic tools or applications). At least, issues about grouping codes into characters, and multiple forms of characters, are solved: "ALİ" would always be represented the same way, and in a logical way; so that if you count its characters, you get 3 for sure, and if you search for 'İ' you find it for sure (which is not true today).

> Thanks,
> Ali

Thank to you,

Denis

(*) Is the uppercase of "gâté" "GATE" or "GÂTÉ"?
(**) Search for instance threads about users complaining when KDE decided to sort file names according to supposed user-friendly order ("natural", lol!).
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com