Dicebot on leaving D: It is anarchy driven development in all its glory.

H. S. Teoh hsteoh at quickfur.ath.cx
Thu Sep 6 16:44:11 UTC 2018


On Thu, Sep 06, 2018 at 02:42:58PM +0000, Dukc via Digitalmars-d wrote:
> On Thursday, 6 September 2018 at 14:17:28 UTC, aliak wrote:
> > // D
> > auto a = "á";
> > auto b = "á";
> > auto c = "\u200B";
> > auto x = a ~ c ~ a;
> > auto y = b ~ c ~ b;
> > 
> > writeln(a.length); // 2 wtf
> > writeln(b.length); // 3 wtf
> > writeln(x.length); // 7 wtf
> > writeln(y.length); // 9 wtf
[...]

This is an unfair comparison.  In the Swift version you used .count, but
here you used .length, which is the length of the array, NOT the number
of characters or whatever you expect it to be.  You should rather use
.count and specify exactly what you want to count, e.g., byCodePoint or
byGrapheme.

I suspect the Swift version will give you unexpected results if you did
something like compare "á" to "a\u301", for example (which, in case it
isn't obvious, are visually identical to each other, and as far as an
end user is concerned, should only count as 1 grapheme).

Not even normalization will help you if you have a string like
"a\u301\u302": in that case, the *only* correct way to count the number
of visual characters is byGrapheme, and I highly doubt Swift's .count
will give you the correct answer in that case. (I expect that Swift's
.count will count code points, as is the usual default in many
languages, which is unfortunately wrong when you're thinking about
visual characters, which are called graphemes in Unicode parlance.)

And even in your given example, what should .count return when there's a
zero-width character?  If you're counting the number of visual places
taken by the string (e.g., you're trying to align output in a
fixed-width terminal), then *both* versions of your code are wrong,
because zero-width characters do not occupy any space when displayed. If
you're counting the number of code points, though, e.g., to allocate the
right buffer size to convert to dstring, then you want to count the
zero-width character as 1 rather than 0.  And that's not to mention
double-width characters, which should count as 2 if you're outputting to
a fixed-width terminal.

Again I say, you need to know how Unicode works. Otherwise you can
easily deceive yourself to think that your code (both in D and in Swift
and in any other language) is correct, when in fact it will fail
miserably when it receives input that you didn't think of.  Unicode is
NOT ASCII, and you CANNOT assume there's a 1-to-1 mapping between
"characters" and display length. Or 1-to-1 mapping between any of the
various concepts of string "length", in fact.

In ASCII, array length == number of code points == number of graphemes
== display width.

In Unicode, array length != number of code points != number of graphemes
!= display width.

Code written by anyone who does not understand this is WRONG, because
you will inevitably end up using the wrong value for the wrong thing:
e.g., array length for number of code points, or number of code points
for display length. Not even .byGrapheme will save you here; you *need*
to understand that zero-width and double-width characters exist, and
what they imply for display width. You *need* to understand the
difference between code points and graphemes.  There is no single
default that will work in every case, because there are DIFFERENT
CORRECT ANSWERS depending on what your code is trying to accomplish.
Pretending that you can just brush all this detail under the rug of a
single number is just deceiving yourself, and will inevitably result in
wrong code that will fail to handle Unicode input correctly.


T

-- 
It's amazing how careful choice of punctuation can leave you hanging:


More information about the Digitalmars-d mailing list