Ready for review: new std.uni

Fri Jan 11 12:35:42 PST 2013

On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote:
[...]
> Anyway it's polished and ready for the good old collective
> destruction called peer review. I'm looking for a review manager.
> 
> The code, including extra tests and a benchmark is here:
> https://github.com/blackwhale/gsoc-bench-2012
> 
> And documentation:
> http://blackwhale.github.com/phobos/uni.html

Excellent!! It looks pretty impressive. We should definitely try out
best to get this into Phobos.

[...]
> In general there are 3 angles to the new std.uni:
> 
> 1) The same stuff but better and faster. For one thing isXXX
> classification functions are brought up to date with Unicode 6.2 and
> are up to 5-6x times faster on non-ASCII text.

And there is a tool for auto-updating the data, correct? So in theory,
barring major structural changes in the standard, we should be able to
recompile Phobos (maybe after running the update tool) and it should be
automatically updated to the latest Unicode release?

> 2) The commonly expected stuff in any modern Unicode-aware language:
> normalization, grapheme decoding, composition/decomposition and
> case-insensitive comparison*.
>
> 3) I've taken it as a crucial point to provide all of the tools used
> to build Unicode algorithms from ground up to the end user.

Are there enough tools to implement line-breaking algorithms (TR14)?

> Thus all generally useful data structures used to implement the
> library internals are accessible for 'mortals' too:
>  - a type for manipulating sets of codepoints, with full set algebra
>  - a construction for generating fast multi-stage lookup tables (Trie)
>  - a ton of predefined sets to construct your own specific ones from

I looked over the examples in the docs. Looks awesome!

One question though: how do I check a character in a specific unicode
category (say Ps, but not Pd, Pe, etc.)? I didn't see a function for the
specific category, and I assume it's overkill to have one function per
category, so I presume it should be possible to do this using the
codepoint sets? Can you add an example of this to the docs, if this is
possible?

[...]
> Among other things the entire collection of data required is
> generated automatically by downloading from unicode.org. The tool
> relies on the same foundation (3) and for the most part this version
> of std.uni should be trivially updated to the new versions of the
> standard (see gen_uni.d script).

Excellent!

> * The only missing 'big' thing is the collation algorithm. At this
> point I'm proposing to just move the large chunk of new std.uni in
> place. This way potential contributors would have tools to implement
> missing bits later on.

But the aforementioned tools should be enough for someone to use as the
basis for implementing the collation algorithm right?

> P.S. CodepointSet could be easily adjusted to serve as generic
> integer set type and Trie already supports far more the
> codepoints->values mappings. These should probably be enhanced and
> later adopted to std.container(2).
[...]

Yes, having an efficient integer set type and generic Trie type would be
a huge plus to add to std.container. Although, isn't Andrei supposed to
be working on a new std.container, with a better range-based API and
custom allocator scheme? If that's not going to happen for a while yet,
I say we should merge std.uni first, then worry about factoring Trie and
int set later (we can always just alias them in std.uni later when we
move them, that should prevent user code breakage).

Now, some nitpicks:

- InversionList:

   - I don't like using ~ to mean symmetric set difference, because ~
     already means concatenation in D, and it's confusing to overload it
     with an unrelated meaning. I propose using ^ instead, because
     symmetric set difference is analogous to XOR.

   - Why is the table of operators in opBinary's doc duplicated twice?
     Is the second table supposed to be something else?

- We should at least briefly describe what the various Unicode
  normalization forms mean (yes I know they can be looked up on
  unicode.org and various other online resources, but having them right
  there in the docs makes the docs so much more helpful).

- Why are the deprecated functions still in there? The deprecation
  messages say "It will be removed in August 2012", which is already
  past. So they should be removed by now.

- Is there a reason there's a composeJamo / hangulDecompose, but no
  other writing system specific functions (such as functions to deal
  with Thai consonant/vowel compositions)? Is this a peculiarity of the
  Unicode standard? Or should there be more general functions that
  handle composition/decomposition of compound characters?

- Why are many of the category checking functions @system rather than
  @safe? It would seem rather crippling to me if much of std.uni can't
  be used from @safe code!

- It would also be nice if these functions can be made pure (I'm not
  sure I understand why checking the category of a character should be
  impure). The lack of nothrow I can understand, since the input
  character may be illegal or otherwise malformed. But @nothrow pure
  seems to me to be necessary for all category-checking functions.

T

-- 
The volume of a pizza of thickness a and radius z can be described by
the following formula: pi zz a. -- Wouter Verhelst