Dicebot on leaving D: It is anarchy driven development in all its glory.

Jonathan M Davis newsgroup.d at jmdavisprog.com
Thu Sep 6 17:27:16 UTC 2018


On Thursday, September 6, 2018 10:44:11 AM MDT H. S. Teoh via Digitalmars-d 
wrote:
> On Thu, Sep 06, 2018 at 02:42:58PM +0000, Dukc via Digitalmars-d wrote:
> > On Thursday, 6 September 2018 at 14:17:28 UTC, aliak wrote:
> > > // D
> > > auto a = "á";
> > > auto b = "á";
> > > auto c = "\u200B";
> > > auto x = a ~ c ~ a;
> > > auto y = b ~ c ~ b;
> > >
> > > writeln(a.length); // 2 wtf
> > > writeln(b.length); // 3 wtf
> > > writeln(x.length); // 7 wtf
> > > writeln(y.length); // 9 wtf
>
> [...]
>
> This is an unfair comparison.  In the Swift version you used .count, but
> here you used .length, which is the length of the array, NOT the number
> of characters or whatever you expect it to be.  You should rather use
> .count and specify exactly what you want to count, e.g., byCodePoint or
> byGrapheme.
>
> I suspect the Swift version will give you unexpected results if you did
> something like compare "á" to "a\u301", for example (which, in case it
> isn't obvious, are visually identical to each other, and as far as an
> end user is concerned, should only count as 1 grapheme).
>
> Not even normalization will help you if you have a string like
> "a\u301\u302": in that case, the *only* correct way to count the number
> of visual characters is byGrapheme, and I highly doubt Swift's .count
> will give you the correct answer in that case. (I expect that Swift's
> .count will count code points, as is the usual default in many
> languages, which is unfortunately wrong when you're thinking about
> visual characters, which are called graphemes in Unicode parlance.)
>
> And even in your given example, what should .count return when there's a
> zero-width character?  If you're counting the number of visual places
> taken by the string (e.g., you're trying to align output in a
> fixed-width terminal), then *both* versions of your code are wrong,
> because zero-width characters do not occupy any space when displayed. If
> you're counting the number of code points, though, e.g., to allocate the
> right buffer size to convert to dstring, then you want to count the
> zero-width character as 1 rather than 0.  And that's not to mention
> double-width characters, which should count as 2 if you're outputting to
> a fixed-width terminal.
>
> Again I say, you need to know how Unicode works. Otherwise you can
> easily deceive yourself to think that your code (both in D and in Swift
> and in any other language) is correct, when in fact it will fail
> miserably when it receives input that you didn't think of.  Unicode is
> NOT ASCII, and you CANNOT assume there's a 1-to-1 mapping between
> "characters" and display length. Or 1-to-1 mapping between any of the
> various concepts of string "length", in fact.
>
> In ASCII, array length == number of code points == number of graphemes
> == display width.
>
> In Unicode, array length != number of code points != number of graphemes
> != display width.
>
> Code written by anyone who does not understand this is WRONG, because
> you will inevitably end up using the wrong value for the wrong thing:
> e.g., array length for number of code points, or number of code points
> for display length. Not even .byGrapheme will save you here; you *need*
> to understand that zero-width and double-width characters exist, and
> what they imply for display width. You *need* to understand the
> difference between code points and graphemes.  There is no single
> default that will work in every case, because there are DIFFERENT
> CORRECT ANSWERS depending on what your code is trying to accomplish.
> Pretending that you can just brush all this detail under the rug of a
> single number is just deceiving yourself, and will inevitably result in
> wrong code that will fail to handle Unicode input correctly.

Indeed. And unfortunately, the net result is that a large percentage of the
string-processing code out there is going to be wrong, and I don't think
that there's any way around that, because Unicode is simply too complicated
for the average programmer to understand it (sad as that may be) -
especially when most of them don't want to have to understand it.

Really, I'd say that there are only three options that even might be sane if
you really have the flexibility to design a proper solution:

1. Treat strings as ranges of code units by default.

2. Don't allow strings to be ranges, to be iterated, or indexed. They're
opaque types.

3. Treat strings as ranges of graphemes.

If strings are treated as ranges of code units by default (particularly if
they're UTF-8), you'll get failures very quickly if you're dealing with
non-ASCII, and you screw up the Unicode handling. It's also by far the most
performant solution and in many cases is exactly the right thing to do.
Obviously, something like byCodePoint or byGrapheme would then be needed in
the cases where code points or graphemes are the appropriate level to
iterate at.

If strings are opaque types (with ways to get ranges over code units, code
points, etc.), that mostly works in that it forces you to at least try to
understand Unicode well enough to make sane choices about how you iterate
over the string. However, it doesn't completely get away from the issue of
the default, because of ==. It would be a royal pain if == didn't work, and
if it does work, you then have the question of what it's comparing. Code
units? Code points? Graphemes? Assuming that the representation is always
the same encoding, comparing code ponits wouldn't make any sense, but you'd
still have the question of code units or graphemes. As such, I'm not sure
that an opaque type really makes the most sense (though it's suggested often
enough).

If strings are treated as ranges of graphemes, then that should then be
correct for everything that doesn't care about the visual representation
(and thus doesn't care about the display width of characters), but it would
be highly inefficient to do most things at the grapheme level, and it would
likely have many of the same problems that we have with strings now with
regards to stuff like them not being able to be random-access and how
they're don't really work as output ranges.

So, if we were doing things from scratch, and it were up to me, I would
basically go with what Walter originally tried to do and make strings be
arrays of code units but with them also being ranges of code units - thereby
avoiding all of the pain that we get with trying to claim that strings don't
have capabilities that they clearly do have (such as random-access or
length). And then of course, all of the appropriate helper functions would
be available for the different levels of Unicode handling. I think that this
is the solution that quite a few of us want - though some have expressed
interest in an opaque string type, and I think that that's the direction
that RCString (or whatever it's called) may be going.

Unfortunately, right now, it's not looking like we're going to be able to
implement what we'd like here because of the code breakage issues in
removing auto-decoding. RCString may very well end up doing the right thing,
and I know that Andrei wants to then encourage it to be the default string
for everyone to use (much as we don't all agree with that idea), but we're
still stuck with auto-decoding with regular strings and having to worry
about it when writing generic code. _Maybe_ someone will be able to come up
with a sane solution for moving away from auto-decoding, but it's not
seeming likely at the moment.

Either way, what needs to be done first is making sure that Phobos in
general works with ranges of char, wchar, dchar, and graphemes rather than
assuming that all ranges of characters are ranges of dchar. Fortunately,
some work has been done towards that, but it's not yet true of Phobos in
general, and it needs to be. Once it is, then the impact of auto-decoding is
reduced in general, and with Phobos depending on it as little as possible,
it then makes it saner to discuss how we might remove auto-decoding. I'm not
at all convinced that it would make it possible to sanely remove it, but
until that work is done, we definitely can't remove it regardless. And
actually, until that work is done, the workarounds for auto-decoding (e.g.
byCodeUnit) don't work as well as they should. I've done some of that work
(as have some others), but I really should figure out how to get through
enough of my todo list that I can get more done towards that goal -
particularly since I don't think that anyone is actively working the
problem. For the most part, it's only been done when someone ran into a
problem with a specific function, whereas in reality, we need to be adding
the appropriate tests for all of the string-processing functions in Phobos
and then ensure that they pass those tests.

- Jonathan M Davis






More information about the Digitalmars-d mailing list