Fix Phobos dependencies on autodecoding

H. S. Teoh hsteoh at quickfur.ath.cx
Wed Aug 14 17:12:00 UTC 2019


On Wed, Aug 14, 2019 at 07:15:54AM +0000, Argolis via Digitalmars-d wrote:
> On Tuesday, 13 August 2019 at 16:18:03 UTC, H. S. Teoh wrote:
> 
> > But we can't make that the default because it's a big performance
> > hit, and many string algorithms don't actually need grapheme
> > segmentation.
> 
> Can you provide example of algorithms and use cases that don't need
> grapheme segmentation?

Most cases of string processing involve:
- Taking substrings: does not need grapheme segmentation; you just slice
  the string.
- Copying one string to another: does not need grapheme segmentation,
  you just use memcpy (or equivalent).
- Concatenating n strings: does not need grapheme segmentation, you just
  use memcpy (or equivalent).  In D, you just use array append, or
  std.array.appender if you get fancy.
- Comparing one string to another: does not need grapheme segmentation;
  you either use strcmp/memcmp, or if you need more delicate semantics,
  call one of the standard Unicode string collation algorithms (std.uni,
  meaning, your code does not need to worry about grapheme segmentation,
  and besides, Unicode collation algorithms operate at the code point
  level, not at the grapheme level).
- Matching a substring: does not need grapheme segmentation; most
  applications just need subarray matching, i.e., treat the substring as
  an opaque blob of bytes, and match it against the target.  If you need
  more delicate semantics, there are standard Unicode algorithms for
  substring matching (i.e., user code does not need to worry about the
  low-level details -- the inputs are basically opaque Unicode strings
  whose internal structure is unimportant).

You really only need grapheme segmentation when:
- Implementing a text layout algorithm where you need to render glyphs
  to some canvas.  Usually, this is already taken care of by the GUI
  framework or the terminal emulator, so user code rarely has to worry
  about this.
- Measuring the size of some piece of text for output alignment
  purposes: in this case, grapheme segmentation isn't enough; you need
  font size information and other such details (like kerning, spacing
  parameters, etc.). Usually, you wouldn't write this yourself, but use
  a text rendering library.  So most user code don't actually have to
  worry about this.  (Note that iterating by graphemes does NOT give you
  the correct value for width even with a fixed-width font in a text
  mode terminal emulator, because there are such things as double-width
  characters in Unicode, which occupy two cells each. And also
  zero-width characters which count as distinct (empty) graphemes, but
  occupy no space.)


And as an appendix, the way most string processing code is done in C/C++
(iterate over characters) is actually wrong w.r.t. Unicode, because it's
really only reliable for ASCII inputs. For "real" Unicode strings, you
can't really get away with the "character by character" approach, even
if you use grapheme segmentation: in some writing systems like Arabic
breaking up a string like this can cause incorrect behaviour like
breaking ligatures, which may not be intended.  For this sort of
operations the application really needs to be using the standard Unicode
algorithms, that depend on the *purpose* of the function, not the
mechanics of iterating over characters, e.g., find suitable line breaks,
find suitable hyphenation points, etc..  There's a reason the Unicode
Consortium defines standard algorithms for these operations: it's
because naïvely iterating over graphemes, in general, does *not* yield
the correct results in all cases.

Ultimately, the whole point behind removing autodecoding is to put the
onus on the user code to decide what kind of iteration it wants: code
units, code points, or graphemes. (Or just use one of the standard
algorithms and don't reinvent the square wheel.)


> Are they really SO common that the correct default is go for code
> points?

The whole point behind removing autodecoding is so that we do NOT
default to code points, which is currently the default.  We want to put
the choice in the user's hand, not silently default to iteration by code
point under the illusion of correctness, which is actually incorrect for
non-trivial inputs.


> Is it not better to have as a default the grapheme segmentation, the
> correct way of handling a string, instead?

Grapheme segmentation is very complex, and therefore, very slow.  Most
string processing doesn't actually need grapheme segmentation.  Setting
that as the default would mean D string processing will be
excruciatingly slow by default, and furthermore all that extra work will
be mostly for nothing because most of the time we don't need it anyway.

Not to repeat that most naïve iterations over graphemes actually do
*not* yield what one might think is the correct result. For example,
measuring the size of a piece of text in a fixed-width font in a
text-mode terminal by counting graphemes is actually wrong, due to
double-width and zero-width characters.


T

-- 
The most powerful one-line C program: #include "/dev/tty" -- IOCCC


More information about the Digitalmars-d mailing list