char array weirdness

Tue Mar 29 02:32:49 PDT 2016

On Monday, March 28, 2016 16:29:50 H. S. Teoh via Digitalmars-d-learn wrote:
> On Mon, Mar 28, 2016 at 04:07:22PM -0700, Jonathan M Davis via
> Digitalmars-d-learn wrote: [...]
>
> > The range API considers all strings to have an element type of dchar.
> > char, wchar, and dchar are UTF code units - UTF-8, UTF-16, and UTF-32
> > respectively. One or more code units make up a code point, which is
> > actually something displayable but not necessarily what you'd call a
> > character (e.g.  it could be an accent). One or more code points then
> > make up a grapheme, which is really what a displayable character is.
> > When Andrei designed the range API, he didn't know about graphemes -
> > just code units and code points, so he thought that code points were
> > guaranteed to be full characters and decided that that's what we'd
> > operate on for correctness' sake.
>
> [...]
>
> Unfortunately, the fact that the default is *not* to use graphemes makes
> working with non-European language strings pretty much just as ugly and
> error-prone as working with bare char's in European language strings.

Yeah. Operating at the code point level instead of the code unit level is
correct for more text than just operating at the code unit level (especially
if you're dealing with char rather than wchar), but ultimately, it's
definitely not correct, and there's plenty of text that will be processed
incorrectly as code points.

> I argue that auto-decoding, as currently implemented, is a net minus,
> even though I realize this is unlikely to change in this lifetime. It
> charges a constant performance overhead yet still does not guarantee
> things will behave as the user would expect (i.e., treat the string as
> graphemes rather than code points).

I totally agree, and I think that _most_ of the Phobos devs agree at this
point. It's Andrei that doesn't. But we have the twin problems of figuring
out how to convince him and how to deal with the fact that changing it would
break a lot of code. Unicoding is disgusting to deal with if you want to
deal with it correctly _and_ be efficient about it, but hiding it doesn't
really work.

I think that the first steps are to make it so that the algorithms in Phobos
will operate just fine on ranges of char and wchar in addition to dchar and
move towards making it irrelevant wherever we can. Some functions (like
filter) are going to have to be told what level to operate at and would be a
serious problem if/when we switched away from auto-decoding, but many others
(such as find) can be made not to care while still operating on Unicode
correctly. And if we can get the amount code impacted low enough (at least
as far as Phobos goes), then maybe we can find a way to switch away from
auto-decoding. Ultimately though, I fear that we're stuck with it and that
we'll just have to figure out how to make it work well for those who know
what they're doing while minimizing the performance impact of auto-decoding
on those who don't know what they're doing as much as we reasonably can.

- Jonathan M Davis