[review] new string type

Wed Dec 1 12:50:29 PST 2010

On Wed, 01 Dec 2010 17:41:17 +0100
stephan <none at example.com> wrote:

> 
> >> There's one other issue that should be considered at some stage: normalization and the fact that a single "character" can be constructed from several code points. (acutes and such)
> >
> > This is my next little project. May build on Steve's job. (But it's not necessary, dchar is enough as a base, I guess.)
> >
> 
> Hi Denis, you might want to consider helping us out.
> 
> We have got a feature-complete Unicode normalization, case-folding, and 
> concatenation implementation passing all test cases in 
> http://unicode.org/Public/6.0.0/ucd/NormalizationTest.txt (and then 
> some) for all recent Unicode versions. This code was part of a bigger 
> project that we have stopped working on.
> 
> We feel that the Unicode normalization part might be useful to others. 
> Therefore we consider releasing them under an open source license. 
> Before we can do so, we have to clean up things a bit. Some open issues are
> 
> a)    The code still contains some TODOs and FIXMEs (bugs, 
> inefficiencies, some bigger issues like more efficient storing of data 
> etc.).
> 
> b)    No profiling and no benchmarking against the ICU implementation 
> (http://site.icu-project.org/) has been done yet (we expect surprises).
> 
> c)    Implementation of additional Unicode algorithms (e.g. full case 
> mapping, matching, collation).
> 
> Since we have stopped working on the bigger project, we haven’t made 
> much progress. Any help would be welcome. Let me know whether this would 
> be of interest to you.

Yes, of course it would be useful. in any case. Either you wish to go on your project, and I may be of some help. Or it would anyway be a useful base or example of how to implement unicode algorithm. Maybe it's time to give some more information of what I intend to write. I have done it already (partially in Python, nearly completely in Lua).

What I have in mind is a "UText" type that provides the right abstraction for text processing / string maipulation as one has when dealing with ASCII (in any fact any legacy character set). All what is needed is having a true one-to-one mapping between characters (in the common sense) and elements of strings (what I call "code stacks"); one given stack unambiguously denotes one character. To reach this point, in addition to decoding (ag from utf8 to code points), we must:
* group codes into stacks 
* normalize (into 'NFD')
* sorts points in stacks
That's the base.

Then, we can for instance index or slice in O(1) as usual, and get a consistent substring of _characters_ (not "abstract characters"). We can search for substrings by simple, direct, comparisons. When dealing with utf32 strings (or worse utf8), simple indexing or counting is O(n) or rather O(k.n) where k represents the (high) cost of "stacking", and normalizing and sorting, on the fly -- it's not only traversing the whole string instead of random, it's heavy computation all along the way.
From this base, all kinds of usual routines can be built without any more complexity. That's all what I want do implement. I wish to write all general-purpose ones (which means, for instance, nothing like casing).

Precisely, I do not want to deal with anything related to script-, language-, locale- specific issues. It's a completely separate & independant topic. This indeed include the "compatibility" normalisation forms of unicode (which precisely do not provide a normal form...). It seems part of your project was to cope such issues.

I would be happy to cooperate if you feel like going on (then, let us communicate off list). I still have the Lua code (which used to run); even if useless as help for implementation (the languages are too different), it could give some more concrete picture of what I have in mind. Also, it includes several test datasets, reprocessed for usability, from unicode's online files.

Denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com