[review] new string type

Wed Dec 1 08:41:17 PST 2010

>> There's one other issue that should be considered at some stage: normalization and the fact that a single "character" can be constructed from several code points. (acutes and such)
>
> This is my next little project. May build on Steve's job. (But it's not necessary, dchar is enough as a base, I guess.)
>

Hi Denis, you might want to consider helping us out.

We have got a feature-complete Unicode normalization, case-folding, and 
concatenation implementation passing all test cases in 
http://unicode.org/Public/6.0.0/ucd/NormalizationTest.txt (and then 
some) for all recent Unicode versions. This code was part of a bigger 
project that we have stopped working on.

We feel that the Unicode normalization part might be useful to others. 
Therefore we consider releasing them under an open source license. 
Before we can do so, we have to clean up things a bit. Some open issues are

a)    The code still contains some TODOs and FIXMEs (bugs, 
inefficiencies, some bigger issues like more efficient storing of data 
etc.).

b)    No profiling and no benchmarking against the ICU implementation 
(http://site.icu-project.org/) has been done yet (we expect surprises).

c)    Implementation of additional Unicode algorithms (e.g. full case 
mapping, matching, collation).

Since we have stopped working on the bigger project, we haven’t made 
much progress. Any help would be welcome. Let me know whether this would 
be of interest to you.