Ready for review: new std.uni

Dmitry Olshansky dmitry.olsh at gmail.com
Fri Jan 11 13:06:30 PST 2013


12-Jan-2013 00:35, H. S. Teoh пишет:
> On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote:
> [...]
>> Anyway it's polished and ready for the good old collective
>> destruction called peer review. I'm looking for a review manager.
>>
>> The code, including extra tests and a benchmark is here:
>> https://github.com/blackwhale/gsoc-bench-2012
>>
>> And documentation:
>> http://blackwhale.github.com/phobos/uni.html
>
> Excellent!! It looks pretty impressive. We should definitely try out
> best to get this into Phobos.
>
>
> [...]
>> In general there are 3 angles to the new std.uni:
>>
>> 1) The same stuff but better and faster. For one thing isXXX
>> classification functions are brought up to date with Unicode 6.2 and
>> are up to 5-6x times faster on non-ASCII text.
>
> And there is a tool for auto-updating the data, correct? So in theory,
> barring major structural changes in the standard, we should be able to
> recompile Phobos (maybe after running the update tool) and it should be
> automatically updated to the latest Unicode release?

Yup, except they tweak the algorithms all along. The normalization for 
instance is full of erratas and minor notes accumulated over time. I'll 
probably do a write up on how to implement it and where to find all the 
related info.

>
>
>> 2) The commonly expected stuff in any modern Unicode-aware language:
>> normalization, grapheme decoding, composition/decomposition and
>> case-insensitive comparison*.
>>
>> 3) I've taken it as a crucial point to provide all of the tools used
>> to build Unicode algorithms from ground up to the end user.
>
> Are there enough tools to implement line-breaking algorithms (TR14)?
>

Should be. AFAIK _word_ breaking seemed trivial to do, got to check the 
line breaking.

>
>> Thus all generally useful data structures used to implement the
>> library internals are accessible for 'mortals' too:
>>   - a type for manipulating sets of codepoints, with full set algebra
>>   - a construction for generating fast multi-stage lookup tables (Trie)
>>   - a ton of predefined sets to construct your own specific ones from
>
> I looked over the examples in the docs. Looks awesome!
>
> One question though: how do I check a character in a specific unicode
> category (say Ps, but not Pd, Pe, etc.)? I didn't see a function for the
> specific category, and I assume it's overkill to have one function per
> category, so I presume it should be possible to do this using the
> codepoint sets? Can you add an example of this to the docs, if this is
> possible?

auto psSet = unicode.Ps;
assert(psSet['(']); //should pass

>
>
> [...]
>> Among other things the entire collection of data required is
>> generated automatically by downloading from unicode.org. The tool
>> relies on the same foundation (3) and for the most part this version
>> of std.uni should be trivially updated to the new versions of the
>> standard (see gen_uni.d script).
>
> Excellent!
>
>
>> * The only missing 'big' thing is the collation algorithm. At this
>> point I'm proposing to just move the large chunk of new std.uni in
>> place. This way potential contributors would have tools to implement
>> missing bits later on.
>
> But the aforementioned tools should be enough for someone to use as the
> basis for implementing the collation algorithm right?
>
I sure hope so. I've been looking through all of TRs and found nothing 
surprising beyond more and more tricky codepoint sets and more and more 
wierd rules to use these.

>
>> P.S. CodepointSet could be easily adjusted to serve as generic
>> integer set type and Trie already supports far more the
>> codepoints->values mappings. These should probably be enhanced and
>> later adopted to std.container(2).
> [...]
>
> Yes, having an efficient integer set type and generic Trie type would be
> a huge plus to add to std.container. Although, isn't Andrei supposed to
> be working on a new std.container, with a better range-based API and
> custom allocator scheme? If that's not going to happen for a while yet,
> I say we should merge std.uni first, then worry about factoring Trie and
> int set later (we can always just alias them in std.uni later when we
> move them, that should prevent user code breakage).
>
> Now, some nitpicks:
>

Thanks for early feedback.

> - InversionList:
>
>     - I don't like using ~ to mean symmetric set difference, because ~
>       already means concatenation in D, and it's confusing to overload it
>       with an unrelated meaning. I propose using ^ instead, because
>       symmetric set difference is analogous to XOR.
>

Point taken, arguably '~' was a bad idea. But I don't like ^ either. 
Seems like it's the best candidate though.


>     - Why is the table of operators in opBinary's doc duplicated twice?
>       Is the second table supposed to be something else?
>

Beats me. Must be some glitch with BOOKTABLE macro (or how I use it). I 
think I've hit it before.

> - We should at least briefly describe what the various Unicode
>    normalization forms mean (yes I know they can be looked up on
>    unicode.org and various other online resources, but having them right
>    there in the docs makes the docs so much more helpful).
>

> - Why are the deprecated functions still in there? The deprecation
>    messages say "It will be removed in August 2012", which is already
>    past. So they should be removed by now.
>

Will do before Walter chimes in with his motto of keeping clutter just 
in case.

> - Is there a reason there's a composeJamo / hangulDecompose, but no
>    other writing system specific functions (such as functions to deal
>    with Thai consonant/vowel compositions)? Is this a peculiarity of the
>    Unicode standard? Or should there be more general functions that
>    handle composition/decomposition of compound characters?
>
Hangul are huge (11K+) and they have a specific algorithmic 
decomposition rule. Also they are hell of a special case in Unicode 
standard (that is mentioned in one chapter only!). Everything else is 
table-driven.

> - Why are many of the category checking functions @system rather than
>    @safe? It would seem rather crippling to me if much of std.uni can't
>    be used from @safe code!
>

@system? Well, these got to be either @trusted or @safe.

> - It would also be nice if these functions can be made pure (I'm not
>    sure I understand why checking the category of a character should be
>    impure).

Global sophisticated immutable table. I think this can be pure but the 
compiler might have disagreed.

>  The lack of nothrow I can understand, since the input
>    character may be illegal or otherwise malformed.

nothrow is probably doable as any codepoint will do. (and if it's > 
dchar.max then it's a problem somewhere else).

TBH I've killed these qualifiers early on as it prevented the thing to 
compile. I can't recall is there is still a problem with any of theses.

> But @nothrow pure
>    seems to me to be necessary for all category-checking functions.
>


-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list