Formal Review of std.uni

Sun May 12 13:06:19 PDT 2013

28-Apr-2013 20:56, Jesse Phillips пишет:
> This is a replacement module for the current std.uni by Dmitry
> Olshansky. The std.uni module provides an implementation of fundamental
> Unicode algorithms and data structures.
>
> To use this module, install 2.63 beta, import uni; and not std.uni,
> compile two files from the source uni.d unicode_tables.d
>
> Docs:
> http://blackwhale.github.io/phobos/uni.html
>
> Source:
> https://github.com/blackwhale/gsoc-bench-2012
>
> DMD Beta:
> http://forum.dlang.org/post/517C8552.7040704@digitalmars.com
>
> It should be noted that inclusion into Phobos may require addressing
> inter-dependencies, see "Reducing the inter-dependencies"
> http://forum.dlang.org/post/kl8hn8$bm3$1@digitalmars.com

We have only one week for review left so I'd like to sort out the last 
issues before we get to the voting.

First to fill in on latest developments.
With a bunch of ugly hacks I've managed to integrate new std.uni in my 
Phobos fork and it passes unittests for me now (on win32 at least).

See it hanging there and waiting to be destroyed by the pull tester:
https://github.com/D-Programming-Language/phobos/pull/1289

Remaining issues that I'm aware of:
- proper toLower/toUpper (current one is simplified codepoint-for-codepoint)
- clean up the debris after crush-landing back into Phobos, revert some 
unrelated changes etc.

Please take time to make that list grow, esp w.r.t interface choices and 
the code itself.

Plus separately I'd need to remove rudimentary versions of the same 
data-structures used in std.regex and rewire it to use the new std.uni.

There are few bugs and issues uncovered during integration that I wish 
to get feedback on.

std.string has a bogus test for toLower:
Of the very few tests being done 2 are very special corner case around 
\u0130 which is I with dot and is expected to be lowercased to i.
But it's *not* supposed to - this conversion is specific to Turk(?) 
locale (=tailoring). What should happen is unfolding it to 2-codepoint 
sequence 'i' and 'dot-above' (this is in works).

I just hope nobody depends on these particular conversions and I am 
wondering who's put them there in the first place.

std.json is another thing - 0x7F somehow is specifically tested as being 
accepted as part of string literal. Yet ECMA script docs clearly state 
that Unicode control characters are to be stripped even before lexing 
(ignored even in literals).

P.S. Someday I need to track down and file about 2 (or 3?) distinct 
compiler bugs (fwd-ref hell, private alias hijacking) that I worked 
around while getting there.
Another one has a fix already (thanks, Kenji):
http://d.puremagic.com/issues/show_bug.cgi?id=10067

-- 
Dmitry Olshansky