Proposed Phobos equivalent of wcswidth()
H. S. Teoh
hsteoh at quickfur.ath.cx
Sat Jan 13 17:26:52 UTC 2018
This past week, while reviewing Phobos PR #6008, I started experimenting
with an optimized D equivalent of wcswidth().
For more details, see:
https://issues.dlang.org/show_bug.cgi?id=7054
https://issues.dlang.org/show_bug.cgi?id=17810
as well as the discussion on:
https://github.com/dlang/phobos/pull/6008
Anyway, the TL;DR summary is this: given a format() spec like "%20s", in
order to insert the correct number of spaces to pad the string to 20
characters (or rather, 20 spaces in the output), we need to compute the
displayed length of the string in monospace font. Unfortunately, given
the complexities of Unicode, this is far from trivial:
- In C, the C library doesn't even pretend to know Unicode, so the
padding is just based on the number of bytes the string occupies.
Obviously, for anything non-ASCII the output will be wrong
(misaligned).
- In D, in the original naïve implementation, we try to be a little
smarter by counting the number of dchars. Unfortunately, this is also
wrong, because of combining diacritics like U+0301 which modify the
preceding character and do not advance the cursor.
- In Phobos PR #6008, we improved this to count grapheme clusters
instead. However, this is *still* wrong, because of the existence of
zero-width characters (don't you just love Unicode?!), and also
because of "wide" or "full-width" East Asian block characters as
specified by Unicode TR11 (and scarily enough, the new Emoji blocks
are included in this "wide" category), which on a text console
generally occupies 2 positions per grapheme rather than 1.
Eventually, the solution boils down to implementing the equivalent of
Posix wcswidth().
But a naïve implementation of this is extremely inefficient, because
segmenting a Unicode string by grapheme and *then* computing its width
is non-trivial. So inefficient that it's just too slow to use in
format(), especially if most strings you'd pass to format() are
ASCII-only or mostly ASCII.
Thankfully, std.uni provides (some of) the tools to optimize this. The
basic idea is this: we don't actually care to segment graphemes; all we
want to do is to know, given some string s, how many display positions
it will occupy, so that we can insert the right number of spaces. The
actual grapheme segmentation and typesetting is the terminal's job, and
none of format()'s business. So we can cut some corners while still
producing the right results.
Basically my current solution consists of:
- Parsing EastAsianWidth.txt published by the Unicode Consortium to
precompute a table of wide/full-width characters (W and F) -- this is
not done at runtime or compile-time, but as a separate step to
generate the source code of the table, since otherwise it's either too
slow at runtime or would slow down Phobos compilation too much, plus
it depends on an external file which is not practical;
- Combining this table with Unicode category Grapheme_extend, plus a
bunch of hand-coded zero-width characters to produce a mapping of
every dchar to 0, 1, or 2. All characters that extend a grapheme, like
a combining diacritic, maps to 0. All characters designated as Wide or
Full-width (excluding grapheme extenders) map to 2. Everything else
maps to 1.
- Compiling this table into a 3-level Trie (std.uni.Trie) for O(1)
runtime lookup per dchar.
- Computing the display width, then, is just a matter of iterating over
dchars in the string and summing the values looked up in the trie.
Of course, no matter how optimized a width lookup is, it's still pretty
slow for an ASCII-only string, which is 90% of the use cases of
format(). So to improve this common case, the additional optimization is
to scan the string for ASCII-only bytes, and just incrementing the width
since we know ASCII characters are always 1 column wide. Only when we
encounter a non-ASCII byte that we bother with UTF-8 decoding and the
table lookup.
Here's my current implementation:
https://github.com/quickfur/strwidth
Here's my current benchmark results:
- walkLength is literally passing the string to std.range.walkLength,
which is basically counting the number of code points in the string.
As mentioned before, this does not produce the correct width.
- byGraphemeWalk is the next step up, to count the number of graphemes
using std.uni.byGrapheme. Unfortunately, this is still not fully
correct.
- graphemeStrideWalk is a slight optimization of byGraphemeWalk, by not
actually decoding the grapheme, but just computing the stride. It also
has the virtue of being usable in CTFE. Performance-wise, it's not
that much different from byGraphemeWalk.
- width0 is the first "correct" string width computation, but with a
naïve, slow implementation. It serves as a baseline to compare the
next implementations.
- width1 is the trie-optimized version of width0. It shows significant
improvement over width0, but is still very slow for ASCII strings
compared with walkLength.
- width2 includes the optimization of skipping ASCII-only parts of the
string, bypassing decoding and trie lookup.
- width3 is a variant of width2 that also skips trie lookup for
characters below U+300, the first block that contains widths not equal
to 1 (combining diacritics). The benchmark results don't show
significant performance differences with width2, however.
[walkLength] (100000 iterations):
ASCII strings of 32 bytes: 24 ms, 483 μs, and 4 hnsecs
Unicode strings of 32 bytes: 16 ms, 829 μs, and 2 hnsecs
ASCII strings of 128 bytes: 94 ms, 538 μs, and 9 hnsecs
Unicode strings of 128 bytes: 59 ms, 395 μs, and 6 hnsecs
ASCII strings of 1024 bytes: 672 ms, 73 μs, and 7 hnsecs
Unicode strings of 1024 bytes: 376 ms, 634 μs, and 7 hnsecs
[byGraphemeWalk] (100000 iterations):
ASCII strings of 32 bytes: 953 ms, 353 μs, and 5 hnsecs
Unicode strings of 32 bytes: 353 ms, 934 μs, and 2 hnsecs
ASCII strings of 128 bytes: 3 secs, 862 ms, 747 μs, and 7 hnsecs
Unicode strings of 128 bytes: 1 sec, 302 ms, 128 μs, and 6 hnsecs
ASCII strings of 1024 bytes: 30 secs, 993 ms, 404 μs, and 1 hnsec
Unicode strings of 1024 bytes: 9 secs, 900 ms, 579 μs, and 1 hnsec
[graphemeStrideWalk] (100000 iterations):
ASCII strings of 32 bytes: 816 ms, 190 μs, and 9 hnsecs
Unicode strings of 32 bytes: 307 ms, 141 μs, and 9 hnsecs
ASCII strings of 128 bytes: 3 secs, 317 ms, 661 μs, and 7 hnsecs
Unicode strings of 128 bytes: 1 sec, 138 ms, 172 μs, and 7 hnsecs
ASCII strings of 1024 bytes: 26 secs, 903 ms, and 898 μs
Unicode strings of 1024 bytes: 8 secs, 804 ms, 48 μs, and 1 hnsec
[width0] (100000 iterations):
ASCII strings of 32 bytes: 1 sec, 62 ms, 163 μs, and 5 hnsecs
Unicode strings of 32 bytes: 397 ms and 517 μs
ASCII strings of 128 bytes: 4 secs, 228 ms, 269 μs, and 2 hnsecs
Unicode strings of 128 bytes: 1 sec, 448 ms, 926 μs, and 5 hnsecs
ASCII strings of 1024 bytes: 33 secs, 797 ms, 690 μs, and 5 hnsecs
Unicode strings of 1024 bytes: 11 secs, 206 ms, 915 μs, and 8 hnsecs
[width1] (100000 iterations):
ASCII strings of 32 bytes: 173 ms, 509 μs, and 8 hnsecs
Unicode strings of 32 bytes: 80 ms, 730 μs, and 9 hnsecs
ASCII strings of 128 bytes: 692 ms and 117 μs
Unicode strings of 128 bytes: 296 ms, 89 μs, and 1 hnsec
ASCII strings of 1024 bytes: 5 secs, 486 ms, 328 μs, and 3 hnsecs
Unicode strings of 1024 bytes: 2 secs, 208 ms, 846 μs, and 9 hnsecs
[width2] (100000 iterations):
ASCII strings of 32 bytes: 12 ms, 791 μs, and 9 hnsecs
Unicode strings of 32 bytes: 69 ms, 51 μs, and 5 hnsecs
ASCII strings of 128 bytes: 50 ms, 643 μs, and 3 hnsecs
Unicode strings of 128 bytes: 283 ms, 572 μs, and 3 hnsecs
ASCII strings of 1024 bytes: 319 ms, 527 μs, and 4 hnsecs
Unicode strings of 1024 bytes: 2 secs, 200 ms, 473 μs, and 3 hnsecs
[width3] (100000 iterations):
ASCII strings of 32 bytes: 12 ms, 927 μs, and 5 hnsecs
Unicode strings of 32 bytes: 67 ms, 952 μs, and 9 hnsecs
ASCII strings of 128 bytes: 50 ms, 322 μs, and 7 hnsecs
Unicode strings of 128 bytes: 283 ms, 628 μs, and 6 hnsecs
ASCII strings of 1024 bytes: 331 ms, 921 μs, and 9 hnsecs
Unicode strings of 1024 bytes: 2 secs, 239 ms, 415 μs, and 5 hnsecs
Given these results, it appears that width2 is probably the best way to
go.
And now the obligatory bikeshed: what should the Phobos equivalent of
wcswidth be called? The current plan is just width(), but the name is
too simplistic and too likely to collide with user-defined identifiers.
Any suggestions? Bring on the rainbow bikeshed! :-P
T
--
Without geometry, life would be pointless. -- VS
More information about the Digitalmars-d
mailing list