[GSOC] New unicode module beta, with Grapheme support!

Dmitry Olshansky dmitry.olsh at gmail.com
Wed Aug 22 14:31:55 PDT 2012


Well, officially the final bell has rung, marking the end of GSOC.

Meaning it's about time to show the project to the community.
This time around I sadly have some unresolved issues. Part of these are 
my fault, others are well known bugs in phobos/compiler.

Still there is a lot of cool stuff in there that I'd love to tell about:

  - all functions isXXX and toUpper/toLower of the old std.uni interface 
suddenly became faster and/or smarter

  - icmp function that does proper case insensitive string comparison 
and  matches e.g. german ß (Sulzbacher form) as equal to 'ss' (full 
casefolding rules)

  - performance maniacs can use faster/simpler one: sicmp that maps only 
1:1 codepoints (simple casefolding rules)

  - extended grapheme cluster support: decode operation (decodeGrapheme) 
& slightly simpler a-la std.utf.stride to only get the length in 
codeunits (graphemeStride)

- normalization currently only NFD & NFKD, have some issues see below 
(and I still need to triple check the correctness) NFC & NFKC are coming 
soon

- decompositon (and composition is coming): either Canonical or 
Compatibility  also yields Grapheme with decomposed codepoint

And the last but not least, library users get access to all the power 
toys used to construct the above algorithms:
     1) codepoint sets with full & fast set ops
     2) highly customizable multi-stage lookup table (aka Trie) with 
easy helpers to construct optimal multi-level dchar-->bool tables
     3) a ton of predefined Unicode sets: see general property, block or 
script

Caveats:
     - the NFC & NFKC normalization are in the works, I'll try to get it 
sometime later this week.

     - more then that normalization depends on patched Phobos and still 
often fails due to the bug 
http://d.puremagic.com/issues/show_bug.cgi?id=4584.

Patched Phobos is here: 
https://github.com/blackwhale/phobos/tree/stable-sort

     - no 64bit currently. Somehow I managed to broke my _fresh_ 64bit 
installation of dmd (it fails both on Phobos unit tests & anything in my 
project), thus x64 lacks a bulk of generated tables and is unsupported 
right now. Any help is appreciated.

Grab sources + tests, benchmarks, tools and sample data from:
https://github.com/blackwhale/gsoc-bench-2012/zipball/beta

And the sketchy DDoc:
http://blackwhale.github.com/phobos/std_uni.html

The first step to usage is "import uni;" vs "import std.uni;" and adding 
uni.d to your command line.

Note: icmp may conflict with its brain dead twin from std.algorithm (or 
was that std.string?) use the usual tricks to disambiguate as necessary.

I'd enjoy some feedback as way back in 2010 I recall a lot of 
Unicode-aware people longing for grapheme support. A short list of Ali 
Çehreli, Fawzi Mohamed and Michel Fortin comes to mind maybe others will 
chime in.

P.S. Consider it as "ready for comments" as opposed to "ready for review".

P.P.S. Volunteers who'd like to test x64 are welcome to run
  rdmd gen_uni.d
and report back (maybe it's my local setup problem).


-- 
Olshansky Dmitry


More information about the Digitalmars-d mailing list