Proposed Phobos equivalent of wcswidth()

Sat Jan 13 17:26:52 UTC 2018

This past week, while reviewing Phobos PR #6008, I started experimenting
with an optimized D equivalent of wcswidth().

For more details, see:

	https://issues.dlang.org/show_bug.cgi?id=7054
	https://issues.dlang.org/show_bug.cgi?id=17810

as well as the discussion on:

	https://github.com/dlang/phobos/pull/6008

Anyway, the TL;DR summary is this: given a format() spec like "%20s", in
order to insert the correct number of spaces to pad the string to 20
characters (or rather, 20 spaces in the output), we need to compute the
displayed length of the string in monospace font.  Unfortunately, given
the complexities of Unicode, this is far from trivial:

- In C, the C library doesn't even pretend to know Unicode, so the
  padding is just based on the number of bytes the string occupies.
  Obviously, for anything non-ASCII the output will be wrong
  (misaligned).

- In D, in the original naïve implementation, we try to be a little
  smarter by counting the number of dchars. Unfortunately, this is also
  wrong, because of combining diacritics like U+0301 which modify the
  preceding character and do not advance the cursor. 

- In Phobos PR #6008, we improved this to count grapheme clusters
  instead. However, this is *still* wrong, because of the existence of
  zero-width characters (don't you just love Unicode?!), and also
  because of "wide" or "full-width" East Asian block characters as
  specified by Unicode TR11 (and scarily enough, the new Emoji blocks
  are included in this "wide" category), which on a text console
  generally occupies 2 positions per grapheme rather than 1.

Eventually, the solution boils down to implementing the equivalent of
Posix wcswidth().

But a naïve implementation of this is extremely inefficient, because
segmenting a Unicode string by grapheme and *then* computing its width
is non-trivial. So inefficient that it's just too slow to use in
format(), especially if most strings you'd pass to format() are
ASCII-only or mostly ASCII.

Thankfully, std.uni provides (some of) the tools to optimize this. The
basic idea is this: we don't actually care to segment graphemes; all we
want to do is to know, given some string s, how many display positions
it will occupy, so that we can insert the right number of spaces. The
actual grapheme segmentation and typesetting is the terminal's job, and
none of format()'s business.  So we can cut some corners while still
producing the right results.

Basically my current solution consists of:

- Parsing EastAsianWidth.txt published by the Unicode Consortium to
  precompute a table of wide/full-width characters (W and F) -- this is
  not done at runtime or compile-time, but as a separate step to
  generate the source code of the table, since otherwise it's either too
  slow at runtime or would slow down Phobos compilation too much, plus
  it depends on an external file which is not practical;

- Combining this table with Unicode category Grapheme_extend, plus a
  bunch of hand-coded zero-width characters to produce a mapping of
  every dchar to 0, 1, or 2. All characters that extend a grapheme, like
  a combining diacritic, maps to 0. All characters designated as Wide or
  Full-width (excluding grapheme extenders) map to 2. Everything else
  maps to 1.

- Compiling this table into a 3-level Trie (std.uni.Trie) for O(1)
  runtime lookup per dchar.

- Computing the display width, then, is just a matter of iterating over
  dchars in the string and summing the values looked up in the trie.

Of course, no matter how optimized a width lookup is, it's still pretty
slow for an ASCII-only string, which is 90% of the use cases of
format(). So to improve this common case, the additional optimization is
to scan the string for ASCII-only bytes, and just incrementing the width
since we know ASCII characters are always 1 column wide.  Only when we
encounter a non-ASCII byte that we bother with UTF-8 decoding and the
table lookup.

Here's my current implementation:

	https://github.com/quickfur/strwidth

Here's my current benchmark results:

- walkLength is literally passing the string to std.range.walkLength,
  which is basically counting the number of code points in the string.
  As mentioned before, this does not produce the correct width.

- byGraphemeWalk is the next step up, to count the number of graphemes
  using std.uni.byGrapheme.  Unfortunately, this is still not fully
  correct.

- graphemeStrideWalk is a slight optimization of byGraphemeWalk, by not
  actually decoding the grapheme, but just computing the stride. It also
  has the virtue of being usable in CTFE. Performance-wise, it's not
  that much different from byGraphemeWalk.

- width0 is the first "correct" string width computation, but with a
  naïve, slow implementation. It serves as a baseline to compare the
  next implementations.

- width1 is the trie-optimized version of width0. It shows significant
  improvement over width0, but is still very slow for ASCII strings
  compared with walkLength.

- width2 includes the optimization of skipping ASCII-only parts of the
  string, bypassing decoding and trie lookup.

- width3 is a variant of width2 that also skips trie lookup for
  characters below U+300, the first block that contains widths not equal
  to 1 (combining diacritics).  The benchmark results don't show
  significant performance differences with width2, however.

[walkLength] (100000 iterations):
	ASCII strings of 32 bytes:	24 ms, 483 μs, and 4 hnsecs
	Unicode strings of 32 bytes:	16 ms, 829 μs, and 2 hnsecs
	ASCII strings of 128 bytes:	94 ms, 538 μs, and 9 hnsecs
	Unicode strings of 128 bytes:	59 ms, 395 μs, and 6 hnsecs
	ASCII strings of 1024 bytes:	672 ms, 73 μs, and 7 hnsecs
	Unicode strings of 1024 bytes:	376 ms, 634 μs, and 7 hnsecs
[byGraphemeWalk] (100000 iterations):
	ASCII strings of 32 bytes:	953 ms, 353 μs, and 5 hnsecs
	Unicode strings of 32 bytes:	353 ms, 934 μs, and 2 hnsecs
	ASCII strings of 128 bytes:	3 secs, 862 ms, 747 μs, and 7 hnsecs
	Unicode strings of 128 bytes:	1 sec, 302 ms, 128 μs, and 6 hnsecs
	ASCII strings of 1024 bytes:	30 secs, 993 ms, 404 μs, and 1 hnsec
	Unicode strings of 1024 bytes:	9 secs, 900 ms, 579 μs, and 1 hnsec
[graphemeStrideWalk] (100000 iterations):
	ASCII strings of 32 bytes:	816 ms, 190 μs, and 9 hnsecs
	Unicode strings of 32 bytes:	307 ms, 141 μs, and 9 hnsecs
	ASCII strings of 128 bytes:	3 secs, 317 ms, 661 μs, and 7 hnsecs
	Unicode strings of 128 bytes:	1 sec, 138 ms, 172 μs, and 7 hnsecs
	ASCII strings of 1024 bytes:	26 secs, 903 ms, and 898 μs
	Unicode strings of 1024 bytes:	8 secs, 804 ms, 48 μs, and 1 hnsec
[width0] (100000 iterations):
	ASCII strings of 32 bytes:	1 sec, 62 ms, 163 μs, and 5 hnsecs
	Unicode strings of 32 bytes:	397 ms and 517 μs
	ASCII strings of 128 bytes:	4 secs, 228 ms, 269 μs, and 2 hnsecs
	Unicode strings of 128 bytes:	1 sec, 448 ms, 926 μs, and 5 hnsecs
	ASCII strings of 1024 bytes:	33 secs, 797 ms, 690 μs, and 5 hnsecs
	Unicode strings of 1024 bytes:	11 secs, 206 ms, 915 μs, and 8 hnsecs
[width1] (100000 iterations):
	ASCII strings of 32 bytes:	173 ms, 509 μs, and 8 hnsecs
	Unicode strings of 32 bytes:	80 ms, 730 μs, and 9 hnsecs
	ASCII strings of 128 bytes:	692 ms and 117 μs
	Unicode strings of 128 bytes:	296 ms, 89 μs, and 1 hnsec
	ASCII strings of 1024 bytes:	5 secs, 486 ms, 328 μs, and 3 hnsecs
	Unicode strings of 1024 bytes:	2 secs, 208 ms, 846 μs, and 9 hnsecs
[width2] (100000 iterations):
	ASCII strings of 32 bytes:	12 ms, 791 μs, and 9 hnsecs
	Unicode strings of 32 bytes:	69 ms, 51 μs, and 5 hnsecs
	ASCII strings of 128 bytes:	50 ms, 643 μs, and 3 hnsecs
	Unicode strings of 128 bytes:	283 ms, 572 μs, and 3 hnsecs
	ASCII strings of 1024 bytes:	319 ms, 527 μs, and 4 hnsecs
	Unicode strings of 1024 bytes:	2 secs, 200 ms, 473 μs, and 3 hnsecs
[width3] (100000 iterations):
	ASCII strings of 32 bytes:	12 ms, 927 μs, and 5 hnsecs
	Unicode strings of 32 bytes:	67 ms, 952 μs, and 9 hnsecs
	ASCII strings of 128 bytes:	50 ms, 322 μs, and 7 hnsecs
	Unicode strings of 128 bytes:	283 ms, 628 μs, and 6 hnsecs
	ASCII strings of 1024 bytes:	331 ms, 921 μs, and 9 hnsecs
	Unicode strings of 1024 bytes:	2 secs, 239 ms, 415 μs, and 5 hnsecs

Given these results, it appears that width2 is probably the best way to
go.

And now the obligatory bikeshed: what should the Phobos equivalent of
wcswidth be called? The current plan is just width(), but the name is
too simplistic and too likely to collide with user-defined identifiers.
Any suggestions?  Bring on the rainbow bikeshed! :-P

T

-- 
Without geometry, life would be pointless. -- VS