char array weirdness

H. S. Teoh via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Mon Mar 28 16:29:50 PDT 2016


On Mon, Mar 28, 2016 at 04:07:22PM -0700, Jonathan M Davis via Digitalmars-d-learn wrote:
[...]
> The range API considers all strings to have an element type of dchar.
> char, wchar, and dchar are UTF code units - UTF-8, UTF-16, and UTF-32
> respectively. One or more code units make up a code point, which is
> actually something displayable but not necessarily what you'd call a
> character (e.g.  it could be an accent). One or more code points then
> make up a grapheme, which is really what a displayable character is.
> When Andrei designed the range API, he didn't know about graphemes -
> just code units and code points, so he thought that code points were
> guaranteed to be full characters and decided that that's what we'd
> operate on for correctness' sake.
[...]

Unfortunately, the fact that the default is *not* to use graphemes makes
working with non-European language strings pretty much just as ugly and
error-prone as working with bare char's in European language strings.

You gave the example of filter() returning wrong results when used with
a range of chars (if we didn't have autodecoding), but the same can be
said of using filter() *with* autodecoding on a string that contains
combining diacritics: your diacritics may get randomly reattached to
stuff they weren't originally attached to, or you may end up with wrong
sequences of Unicode code points (e.g. diacritics not attached to any
grapheme). Using filter() on Korean text, even with autodecoding, will
pretty much produce garbage. And so on.

So in short, we're paying a performance cost for something that's only
arguably better but still not quite there, and this cost is attached to
almost *everything* you do with strings, regardless of whether you need
to (e.g., when you know you're dealing with pure ASCII data).  Even when
dealing with non-ASCII Unicode data, in many cases autodecoding
introduces a constant (and unnecessary!) overhead.  E.g., searching for
a non-ASCII character is equivalent to a substring search on the encoded
form of the character, and there is no good reason why Phobos couldn't
have done this instead of autodecoding every character while scanning
the string.  Regexes on Unicode strings could possibly be faster if the
regex engine internally converted literals in the regex into their
equivalent encoded forms and did the scanning without decoding. (IIRC
Dmitry did remark in some PR some time ago, to the effect that the regex
engine has been optimized to the point where the cost of autodecoding is
becoming visible, and the next step might be to bypass autodecoding.)

I argue that auto-decoding, as currently implemented, is a net minus,
even though I realize this is unlikely to change in this lifetime. It
charges a constant performance overhead yet still does not guarantee
things will behave as the user would expect (i.e., treat the string as
graphemes rather than code points).


T

-- 
We are in class, we are supposed to be learning, we have a teacher... Is it too much that I expect him to teach me??? -- RL


More information about the Digitalmars-d-learn mailing list