Ready for review: new std.uni

Fri Jan 11 11:31:11 PST 2013

It's been long over due to present the work I did during the last GSOC 
as it was a summer not winter of code after all. Unfortunately some 
compiler bugs, a new job :) and unrelated events of importance have 
postponed its release beyond measure.

Anyway it's polished and ready for the good old collective destruction 
called peer review. I'm looking for a review manager.

The code, including extra tests and a benchmark is here:
https://github.com/blackwhale/gsoc-bench-2012

And documentation:
http://blackwhale.github.com/phobos/uni.html

And with that I'm asking if there is anything ready to be reviewed in 
the queue?

Currently I prefer to keep it standalone so read 'uni' everywhere as 
'std.uni'. On the plus side it makes it easy to try new 'uni' and 
compare with the old one without re-compiling the Phobos.

To use in place of std.uni replace 'std.uni'->'uni' in your programs and 
compare the results. Just make sure both uni and unicode_tables modules 
are linked in, usually rdmd can take care of this dependency.

The list of new functionality is quite large so I'll point out the major 
sections and let the rest to the documentation.

In general there are 3 angles to the new std.uni:

1) The same stuff but better and faster. For one thing isXXX 
classification functions are brought up to date with Unicode 6.2 and are 
up to 5-6x times faster on non-ASCII text.

2) The commonly expected stuff in any modern Unicode-aware language:
normalization, grapheme decoding, composition/decomposition and 
case-insensitive comparison*.

3) I've taken it as a crucial point to provide all of the tools used to 
build Unicode algorithms from ground up to the end user.
Thus all generally useful data structures used to implement the library 
internals are accessible for 'mortals' too:
  - a type for manipulating sets of codepoints, with full set algebra
  - a construction for generating fast multi-stage lookup tables (Trie)
  - a ton of predefined sets to construct your own specific ones from

There is an extra candy for meta-programming text processing libraries: 
a set type can generate source code of unrolled hard-coded binary search 
for a given set of codepoints.

Among other things the entire collection of data required is generated 
automatically by downloading from unicode.org. The tool relies on the 
same foundation (3) and for the most part this version of std.uni should 
be trivially updated to the new versions of the standard (see gen_uni.d 
script).

* The only missing 'big' thing is the collation algorithm. At this point 
I'm proposing to just move the large chunk of new std.uni in place. This 
way potential contributors would have tools to implement missing bits 
later on.

P.S. CodepointSet could be easily adjusted to serve as generic integer 
set type and Trie already supports far more the codepoints->values 
mappings. These should probably be enhanced and later adopted to 
std.container(2).

-- 
Dmitry Olshansky