Dicebot on leaving D: It is anarchy driven development in all its glory.

Joakim dlang at joakim.fea.st
Thu Sep 6 17:19:01 UTC 2018


On Thursday, 6 September 2018 at 16:44:11 UTC, H. S. Teoh wrote:
> On Thu, Sep 06, 2018 at 02:42:58PM +0000, Dukc via 
> Digitalmars-d wrote:
>> On Thursday, 6 September 2018 at 14:17:28 UTC, aliak wrote:
>> > // D
>> > auto a = "á";
>> > auto b = "á";
>> > auto c = "\u200B";
>> > auto x = a ~ c ~ a;
>> > auto y = b ~ c ~ b;
>> > 
>> > writeln(a.length); // 2 wtf
>> > writeln(b.length); // 3 wtf
>> > writeln(x.length); // 7 wtf
>> > writeln(y.length); // 9 wtf
> [...]
>
> This is an unfair comparison.  In the Swift version you used 
> .count, but here you used .length, which is the length of the 
> array, NOT the number of characters or whatever you expect it 
> to be.  You should rather use .count and specify exactly what 
> you want to count, e.g., byCodePoint or byGrapheme.
>
> I suspect the Swift version will give you unexpected results if 
> you did something like compare "á" to "a\u301", for example 
> (which, in case it isn't obvious, are visually identical to 
> each other, and as far as an end user is concerned, should only 
> count as 1 grapheme).
>
> Not even normalization will help you if you have a string like 
> "a\u301\u302": in that case, the *only* correct way to count 
> the number of visual characters is byGrapheme, and I highly 
> doubt Swift's .count will give you the correct answer in that 
> case. (I expect that Swift's .count will count code points, as 
> is the usual default in many languages, which is unfortunately 
> wrong when you're thinking about visual characters, which are 
> called graphemes in Unicode parlance.)

No, Swift counts grapheme clusters by default, so it gives 1. I 
suggest you read the linked Swift chapter above. I think it's the 
wrong choice for performance, but they chose to emphasize 
intuitiveness for the common case.

I agree with most of the rest of what you wrote about programmers 
having no silver bullet to avoid Unicode's and languages' 
complexity.


More information about the Digitalmars-d mailing list