Ready for review: new std.uni

Sun Jan 13 02:37:22 PST 2013

12-Jan-2013 01:33, H. S. Teoh пишет:
> On Sat, Jan 12, 2013 at 01:06:30AM +0400, Dmitry Olshansky wrote:
>> 12-Jan-2013 00:35, H. S. Teoh пишет:
>>> On Fri, Jan 11, 2013 at 11:31:11PM +0400, Dmitry Olshansky wrote:
>>> [...]
>>>> Anyway it's polished and ready for the good old collective
>>>> destruction called peer review. I'm looking for a review manager.
>>>>
>>>> The code, including extra tests and a benchmark is here:
>>>> https://github.com/blackwhale/gsoc-bench-2012
>>>>
>>>> And documentation:
>>>> http://blackwhale.github.com/phobos/uni.html
> [...]
>>>> 2) The commonly expected stuff in any modern Unicode-aware language:
>>>> normalization, grapheme decoding, composition/decomposition and
>>>> case-insensitive comparison*.
>>>>
>>>> 3) I've taken it as a crucial point to provide all of the tools used
>>>> to build Unicode algorithms from ground up to the end user.
>>>
>>> Are there enough tools to implement line-breaking algorithms (TR14)?
>>>
>>
>> Should be. AFAIK _word_ breaking seemed trivial to do, got to check
>> the line breaking.
>
> Line-breaking is in TR14. I looked over it. It's definitely more
> complicated than I expected it to be. Some alphabets may even require
> stylistic rules for line-breaking! (Not to mention hyphenation.) So it's
> probably too much to expect std.uni to do it. But at least the tools
> necessary to implement it should be there.
>

I had given it a practiced look of tired Unicode implementor. It has a 
lot of subtle moments to it but should be doable.

I'm not about to implement it any time soon, but notes on how to do it 
are the following:

1. Avoid reading all of the reasons behind the algorithm listed there 
(there are good ones, a ton of them). So at first SKIP chapter 5 except 
maybe 5.3-5.4. And the fonts problems listed are not important to get it 
working.

2. Re-read the important chapters that is 2 and 6, 7. Others pretty much 
can be simply ignored until you start testing your implementation. 7 is 
actually pretty nice.

3. The first step to implement is to seek out the character classes 
(=sets) involved in algorithm. These are listed in Table.1 and the 
accompanying data file.

4. Process the file (see e.g. gen_uni.d) and create all of sets.

5. The core part of the algorithm itself is by the end of day not bad - 
section 6.1 lists all of the mandatory line breaking rules. Then there 
is a bunch of tweakable ones that are a bit harder. Plus see chapter 7 
for a (logically) pair-wise table driven solution (though the row/cols 
in the table are "belongs to the set X").

-- 
Dmitry Olshansky