Update #1 on new std.uni
H. S. Teoh
hsteoh at quickfur.ath.cx
Thu Jan 17 10:48:25 PST 2013
On Wed, Jan 16, 2013 at 02:48:30PM +0400, Dmitry Olshansky wrote:
> 11-Jan-2013 23:31, Dmitry Olshansky пишет:
> >
> >The code, including extra tests and a benchmark is here:
> >https://github.com/blackwhale/gsoc-bench-2012
> >
> >And documentation:
> >http://blackwhale.github.com/phobos/uni.html
> >
>
> First of all, @safe pure and nothrow is back. Let me know if
> something is still not.
>
> OK, I've made an extra pass through docs with these things in mind:
> - getting the introduction & terminology part right
> - more explanations and details where applicable
> (let me if that's too much / too little / wrong)
> - hiding away the truly generic (and not easy to use) Trie from
> documentation
> - old deprecated stuff is hidden from docs to discourage its use
[...]
Looks much better now!
Some nitpicks:
- Under Overview:
[4th paragraph] "It's recognized that an application may need
further enhancements and extensions. It could be the need for
less commonly known algorithms or tailoring existing ones for
regional-specific needs. To help users with building any extra
functionality beyond the core primitives the module provides:"
The grammar nazi in me thinks a better wording might be (changes
delimited by {{}}):
"It's recognized that an application may need further
enhancements and extensions{{, such as}} less{{-}}commonly known
algorithms{{,}} or tailoring existing ones for
{{region}}-specific needs. To help users with building any extra
functionality beyond the core primitives{{,}} the module
provides:"
The second item in the subsequent list:
A way to construct optimal packed multi-stage tables also known
as a special case of Trie. {{The functions}} codepointTrie,
codepointSetTrie construct custom tries that map dchar to value.
The end result is {{a}} fast and predictable Ο(1) lookup that
powers functions like isAlpha {{and}} combiningClass{{,}} but
for user-defined data sets.
The last item in the list:
Access to the commonly{{-}}used predefined sets of code points.
The commonly{{-}}defined one{{s}} can be observed in the CLDR
utility, on {{the}} page property index. {{S}}upported ones
include Script, Block and General Category. See unicode for easy
{{}} compile-time checked queries.
- Under Terminology:
[[3rd paragraph]] "The minimal bit combination that can represent
a unit of encoded text for processing or interchange. Depending
on the encoding this could be: 8-bit code units in the UTF-8
(($D char)), [...]"
I think you transposed the "$(" here. :)
The last sentence in this section appears to be truncated. Maybe a
runaway DDoc macro somewhere earlier?
- Under Construction of lookup tables, the grammar nazi says:
[[1st sentence]] "{{The}} Unicode standard describes a set of
algorithms that {{}} depend on having {{the}} ability to quickly
look{{ }}up various properties of a code point. Given the the
codespace of about 1 million code points, it is not a trivial
task to providing a space{{-}}efficient solution for the {{}}
multitude of properties."
[[2nd paragraph]] "[...] Hash-tables {{have}} enormous memory
footprint and binary search over intervals is not fast enough
for some heavy-duty algorithms."
[[3rd paragraph]] "{{(P }}The recommended solution (see Unicode
Implementation Guidelines) {{}} is using multi-stage tables{{,}}
that is{{,}} {{an instance}} of Trie with integer keys and {{a}}
fixed number of stages. For the {{remainder}} of {{this}}
section {{it will be}} called {{a}} fixed trie. The following
describes a particular implementation that is aimed for the
speed of access at the expense of ideal size savings."
[[4th paragraph]] "[...] Split {{the}} number of bits in a key
(code point, 21 bits) {{into}} 2 components (e.g. 15 and 8). The
first is the number of bits in the index of {{the}} trie and the
other is {{the}} number of bits {{in each}} page of {{the}}
trie. The layout of trie is then an array of size
2^^bits-of-index followed an array of memory chunks of size
2^^bits-of-page/size-of-element."
[[5th paragraph]] "[...] {{The}} slots of {{the}} index all have
to contain {{the same[?] number of pages}}. The lookup is then
just a couple of operations - slice {{the}} upper bits, {{then}}
look{{ }}up {{the}} index for these{{.}} The pseudo-code is:"
[[Following the code example]] "[...] Where if the elemsPerPage
is a power of 2 the whole process is a handful of simple
instructions and 2 array reads. {{Subsequent}} levels of
{{the}} trie are introduced by recursing {{}} this notion - the
index array is treated as values. The number of bits in {{the}}
index is then again split into 2 parts, with pages over
'current-index' and {{the}} new 'upper-index'."
[[Next paragraph]] "For completeness the level 1 trie is simply
an array. {{The}} current implementation takes advantage of
bit-packing values when the range is known to be limited in {{}}
advance (such as bool){{.}} {{S}}ee also BitPacked for enforcing
it manually. [...]"
[[Last paragraph]] "The process of construction of a trie is
more involved and is hidden from the user in a form of
{{convenience}} functions: codepointTrie, codepointSetTrie and
even more convenient toTrie. In general a set or built-in AA
with dchar type can be turned into a trie. The trie object in
this module is {{}} read-only (immutable){{;}} it's effectively
frozen after construction."
The grammar nazi has run out of steam, so no more grammar nitpicks for
now. ;-) But there are still the following questions:
- Why is isControl() not pure nothrow?
- Why are the isX() functions @system? I would have expected they should
be at least @trusted? (Or are there technical problems / compiler bugs
preventing this?)
That's all for now. I hope you don't mind me allowing the grammar nazi
to take over for a bit. I want Phobos documentation to be professional
quality. :)
T
--
The trouble with TCP jokes is that it's like hearing the same joke over and over.
More information about the Digitalmars-d
mailing list