Dicebot on leaving D: It is anarchy driven development in all its glory.

Thu Sep 6 19:04:45 UTC 2018

On Thursday, 6 September 2018 at 16:44:11 UTC, H. S. Teoh wrote:
> On Thu, Sep 06, 2018 at 02:42:58PM +0000, Dukc via 
> Digitalmars-d wrote:
>> On Thursday, 6 September 2018 at 14:17:28 UTC, aliak wrote:
>> > // D
>> > auto a = "á";
>> > auto b = "á";
>> > auto c = "\u200B";
>> > auto x = a ~ c ~ a;
>> > auto y = b ~ c ~ b;
>> > 
>> > writeln(a.length); // 2 wtf
>> > writeln(b.length); // 3 wtf
>> > writeln(x.length); // 7 wtf
>> > writeln(y.length); // 9 wtf
> [...]
>
> This is an unfair comparison.  In the Swift version you used 
> .count, but here you used .length, which is the length of the 
> array, NOT the number of characters or whatever you expect it 
> to be.  You should rather use .count and specify exactly what 
> you want to count, e.g., byCodePoint or byGrapheme.
>
> I suspect the Swift version will give you unexpected results if 
> you did something like compare "á" to "a\u301", for example 
> (which, in case it isn't obvious, are visually identical to 
> each other, and as far as an end user is concerned, should only 
> count as 1 grapheme).
>
> Not even normalization will help you if you have a string like 
> "a\u301\u302": in that case, the *only* correct way to count 
> the number of visual characters is byGrapheme, and I highly 
> doubt Swift's .count will give you the correct answer in that 
> case. (I expect that Swift's .count will count code points, as 
> is the usual default in many languages, which is unfortunately 
> wrong when you're thinking about visual characters, which are 
> called graphemes in Unicode parlance.)
>
> And even in your given example, what should .count return when 
> there's a zero-width character?  If you're counting the number 
> of visual places taken by the string (e.g., you're trying to 
> align output in a fixed-width terminal), then *both* versions 
> of your code are wrong, because zero-width characters do not 
> occupy any space when displayed. If you're counting the number 
> of code points, though, e.g., to allocate the right buffer size 
> to convert to dstring, then you want to count the zero-width 
> character as 1 rather than 0.  And that's not to mention 
> double-width characters, which should count as 2 if you're 
> outputting to a fixed-width terminal.
>
> Again I say, you need to know how Unicode works. Otherwise you 
> can easily deceive yourself to think that your code (both in D 
> and in Swift and in any other language) is correct, when in 
> fact it will fail miserably when it receives input that you 
> didn't think of.  Unicode is NOT ASCII, and you CANNOT assume 
> there's a 1-to-1 mapping between "characters" and display 
> length. Or 1-to-1 mapping between any of the various concepts 
> of string "length", in fact.
>
> In ASCII, array length == number of code points == number of 
> graphemes == display width.
>
> In Unicode, array length != number of code points != number of 
> graphemes != display width.
>
> Code written by anyone who does not understand this is WRONG, 
> because you will inevitably end up using the wrong value for 
> the wrong thing: e.g., array length for number of code points, 
> or number of code points for display length. Not even 
> .byGrapheme will save you here; you *need* to understand that 
> zero-width and double-width characters exist, and what they 
> imply for display width. You *need* to understand the 
> difference between code points and graphemes.  There is no 
> single default that will work in every case, because there are 
> DIFFERENT CORRECT ANSWERS depending on what your code is trying 
> to accomplish. Pretending that you can just brush all this 
> detail under the rug of a single number is just deceiving 
> yourself, and will inevitably result in wrong code that will 
> fail to handle Unicode input correctly.
>
>
> T

It's a totally fair comparison. .count in swift is the equivalent 
of .length in D, you use that to get the size of an array, etc. 
They've just implemented string.length as 
string.byGrapheme.walkLength. So it's intuitively correct (and 
yes, slower). If you didn't want the default though then you 
could also specify what "view" over characters you want. E.g.

let a = "á̂"
a.count // 1 <-- Yes it is exactly as expected.
a.unicodeScalars // 3
a.utf8.count // 5

I don't really see any issues with a zero-width character. If you 
want to deal with screen width (i.e. pixel space) that's not the 
same as how many characters are in a string. And it doesn't 
matter whether you go byGrapheme or byCodePoint or byCodeUnit 
because none of those represent a single column on screen. A 
zero-width character is 0 *width* but it's still *one* character. 
There's no .length/size/count in any language (that I've heard 
of) that'll give you your screen space from their string type. 
You query the font API for that as that depends on font size, 
kerning, style and face.

And again, I agree you need to know how unicode works. I don't 
argue that at all. I'm just saying that having the default be 
incorrect for application logic is just silly and when people 
have to do things like string.representation.normalize.byGrapheme 
or whatever to search for a character in a string *correctly* ... 
well, just, ARGH!

D makes the code-point case default and hence that becomes the 
simplest to use. But unfortunately, the only thing I can think of 
that requires code point representations is when dealing 
specifically with unicode algorithms (normalization, etc). Here's 
a good read on code points: 
https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/ -

tl;dr: application logic does not need or want to deal with code 
points. For speed units work, and for correctness, graphemes work.

Yes you will fail miserably when you receive input you did not 
expect. That's always true. That's why we have APIs that make it 
easier or harder to fail more or less. Expecting people to be 
unicode experts before using unicode is also unreasonable - or 
just makes it easier to fail, must easier. I sit next to one of 
the guys who worked on unicode in Qt and he couldn't explain the 
difference between a grapheme and an extended grapheme cluster... 
I'm not saying I can btw... I'm just saying unicode is frikkin 
hard. And we don't need APIs making it harder to get right - 
which is exactly what non-correct-by-default APIs do.

I think to boil it down to one sentence is I think it's silly to 
have a string type that is advertised as unicode but optimized 
for latin1 ... ish because people will use it for unicode and get 
incorrect results with its naturally intuitive usage.

Cheers,
- Ali