Ready for review: new std.uni

Fri Jan 11 13:33:44 PST 2013

On Sat, Jan 12, 2013 at 01:06:30AM +0400, Dmitry Olshansky wrote:
> 12-Jan-2013 00:35, H. S. Teoh пишет:
> >On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote:
> >[...]
> >>Anyway it's polished and ready for the good old collective
> >>destruction called peer review. I'm looking for a review manager.
> >>
> >>The code, including extra tests and a benchmark is here:
> >>https://github.com/blackwhale/gsoc-bench-2012
> >>
> >>And documentation:
> >>http://blackwhale.github.com/phobos/uni.html
[...]
> >>2) The commonly expected stuff in any modern Unicode-aware language:
> >>normalization, grapheme decoding, composition/decomposition and
> >>case-insensitive comparison*.
> >>
> >>3) I've taken it as a crucial point to provide all of the tools used
> >>to build Unicode algorithms from ground up to the end user.
> >
> >Are there enough tools to implement line-breaking algorithms (TR14)?
> >
> 
> Should be. AFAIK _word_ breaking seemed trivial to do, got to check
> the line breaking.

Line-breaking is in TR14. I looked over it. It's definitely more
complicated than I expected it to be. Some alphabets may even require
stylistic rules for line-breaking! (Not to mention hyphenation.) So it's
probably too much to expect std.uni to do it. But at least the tools
necessary to implement it should be there.

[...]
> >One question though: how do I check a character in a specific unicode
> >category (say Ps, but not Pd, Pe, etc.)? I didn't see a function for the
> >specific category, and I assume it's overkill to have one function per
> >category, so I presume it should be possible to do this using the
> >codepoint sets? Can you add an example of this to the docs, if this is
> >possible?
> 
> auto psSet = unicode.Ps;
> assert(psSet['(']); //should pass

It would be nice to have a list of possible categories in the docs for
unicode.opDispatch/opCall, at least, the common categories, with a
hyperlink to unicode.org (or wherever) that provides the full list.
Currently, it's unclear, beyond the single example given, of what the
possibilities are.

[...]
> >- We should at least briefly describe what the various Unicode
> >   normalization forms mean (yes I know they can be looked up on
> >   unicode.org and various other online resources, but having them right
> >   there in the docs makes the docs so much more helpful).
> >
> 
> >- Why are the deprecated functions still in there? The deprecation
> >   messages say "It will be removed in August 2012", which is already
> >   past. So they should be removed by now.
> >
> 
> Will do before Walter chimes in with his motto of keeping clutter
> just in case.

In that case, I would keep the functions (aliases?) but remove them from
DDoc. So old code will still silently work, but the functions will no
longer be officially supported, there will be no documentation for them
and they can disappear anytime.

> >- Is there a reason there's a composeJamo / hangulDecompose, but no
> >   other writing system specific functions (such as functions to deal
> >   with Thai consonant/vowel compositions)? Is this a peculiarity of
> >   the Unicode standard? Or should there be more general functions
> >   that handle composition/decomposition of compound characters?
> >
> Hangul are huge (11K+) and they have a specific algorithmic
> decomposition rule. Also they are hell of a special case in Unicode
> standard (that is mentioned in one chapter only!). Everything else
> is table-driven.

OK. Makes sense, I guess. Though it gave me the impression that std.uni
has some undue bias for Korean text. ;-)

[...]
> >- It would also be nice if these functions can be made pure (I'm not
> >   sure I understand why checking the category of a character should
> >   be impure).
> 
> Global sophisticated immutable table. I think this can be pure but
> the compiler might have disagreed.

Hmm. This sounds like a bug in purity, maybe? It seems clear to me that
immutable values might as well be constants, and therefore using said
values should not make the code impure, even if they are implemented as
a global table. I think a bug should be filed for this.

> > The lack of nothrow I can understand, since the input
> >   character may be illegal or otherwise malformed.
> 
> nothrow is probably doable as any codepoint will do. (and if it's >
> dchar.max then it's a problem somewhere else).

True, an illegal character by definition does not belong to any
category, so all category functions should return false for it. No
exceptions need to be thrown.

> TBH I've killed these qualifiers early on as it prevented the thing
> to compile. I can't recall is there is still a problem with any of
> theses.
[...]

Before merging into Phobos, I would think we should put (as many of) the
qualifiers back (as possible). Preferably everwhere they make sense,
barring compiler bugs.

T

-- 
Государство делает вид, что платит нам зарплату, а мы делаем вид, что работаем.