Ready for review: new std.uni
Dmitry Olshansky
dmitry.olsh at gmail.com
Fri Jan 11 11:31:11 PST 2013
It's been long over due to present the work I did during the last GSOC
as it was a summer not winter of code after all. Unfortunately some
compiler bugs, a new job :) and unrelated events of importance have
postponed its release beyond measure.
Anyway it's polished and ready for the good old collective destruction
called peer review. I'm looking for a review manager.
The code, including extra tests and a benchmark is here:
https://github.com/blackwhale/gsoc-bench-2012
And documentation:
http://blackwhale.github.com/phobos/uni.html
And with that I'm asking if there is anything ready to be reviewed in
the queue?
Currently I prefer to keep it standalone so read 'uni' everywhere as
'std.uni'. On the plus side it makes it easy to try new 'uni' and
compare with the old one without re-compiling the Phobos.
To use in place of std.uni replace 'std.uni'->'uni' in your programs and
compare the results. Just make sure both uni and unicode_tables modules
are linked in, usually rdmd can take care of this dependency.
The list of new functionality is quite large so I'll point out the major
sections and let the rest to the documentation.
In general there are 3 angles to the new std.uni:
1) The same stuff but better and faster. For one thing isXXX
classification functions are brought up to date with Unicode 6.2 and are
up to 5-6x times faster on non-ASCII text.
2) The commonly expected stuff in any modern Unicode-aware language:
normalization, grapheme decoding, composition/decomposition and
case-insensitive comparison*.
3) I've taken it as a crucial point to provide all of the tools used to
build Unicode algorithms from ground up to the end user.
Thus all generally useful data structures used to implement the library
internals are accessible for 'mortals' too:
- a type for manipulating sets of codepoints, with full set algebra
- a construction for generating fast multi-stage lookup tables (Trie)
- a ton of predefined sets to construct your own specific ones from
There is an extra candy for meta-programming text processing libraries:
a set type can generate source code of unrolled hard-coded binary search
for a given set of codepoints.
Among other things the entire collection of data required is generated
automatically by downloading from unicode.org. The tool relies on the
same foundation (3) and for the most part this version of std.uni should
be trivially updated to the new versions of the standard (see gen_uni.d
script).
* The only missing 'big' thing is the collation algorithm. At this point
I'm proposing to just move the large chunk of new std.uni in place. This
way potential contributors would have tools to implement missing bits
later on.
P.S. CodepointSet could be easily adjusted to serve as generic integer
set type and Trie already supports far more the codepoints->values
mappings. These should probably be enhanced and later adopted to
std.container(2).
--
Dmitry Olshansky
More information about the Digitalmars-d
mailing list