Dicebot on leaving D: It is anarchy driven development in all its glory.

Joakim dlang at joakim.fea.st
Thu Sep 6 10:44:45 UTC 2018


On Thursday, 6 September 2018 at 09:35:27 UTC, Chris wrote:
> On Thursday, 6 September 2018 at 08:44:15 UTC, nkm1 wrote:
>> On Wednesday, 5 September 2018 at 07:48:34 UTC, Chris wrote:
>>> On Tuesday, 4 September 2018 at 21:36:16 UTC, Walter Bright 
>>> wrote:
>>>>
>>>> Autodecode - I've suffered under that, too. The solution was 
>>>> fairly simple. Append .byCodeUnit to strings that would 
>>>> otherwise autodecode. Annoying, but hardly a showstopper.
>>>
>>> import std.array : array;
>>> import std.stdio : writefln;
>>> import std.uni : byCodePoint, byGrapheme;
>>> import std.utf : byCodeUnit;
>>>
>>> void main() {
>>>
>>>   string first = "á";
>>>
>>>   writefln("%d", first.length);  // prints 2
>>>
>>>   auto firstCU = "á".byCodeUnit; // type is `ByCodeUnitImpl` 
>>> (!)
>>>
>>>   writefln("%d", firstCU.length);  // prints 2
>>>
>>>   auto firstGr = "á".byGrapheme.array;  // type is 
>>> `Grapheme[]`
>>>
>>>   writefln("%d", firstGr.length);  // prints 1
>>>
>>>   auto firstCP = "á".byCodePoint.array; // type is `dchar[]`
>>>
>>>   writefln("%d", firstCP.length);  // prints 1
>>>
>>>   dstring second = "á";
>>>
>>>   writefln("%d", second.length);  // prints 1 (That was easy!)
>>>
>>>   // DMD64 D Compiler v2.081.2
>>> }
>>
>> And this has what to do with autodecoding?
>
> Nothing. I was just pointing out how awkward some basic things 
> can be. autodecoding just adds to it in the sense that it's a 
> useless overhead but will keep string handling in a limbo 
> forever and ever and ever.
>
>>
>> TBH, it looks like you're just confused about how Unicode 
>> works. None of that is something particular to D. You should 
>> probably address your concerns to the Unicode Consortium. Not 
>> that they care.
>
> I'm actually not confused since I've been dealing with Unicode 
> (and encodings in general) for quite a while now. Although I'm 
> not a Unicode expert, I know what the operations above do and 
> why. I'd only expect a modern PL to deal with Unicode correctly 
> and have some guidelines as to the nitty-gritty.

Since you understand Unicode well, enlighten us: what's the best 
default format to use for string iteration?

You can argue that D chose the wrong default by having the stdlib 
auto-decode to code points in several places, and Walter and a 
host of the core D team would agree with you, and you can add me 
to the list too. But it's not clear there should be a default 
format at all, other than whatever you started off with, 
particularly for a programming language that values performance 
like D does, as each format choice comes with various speed vs. 
correctness trade-offs.

Therefore, the programmer has to understand that complexity and 
make his own choice. You're acting like there's some obvious 
choice for how to handle Unicode that we're missing here, when 
the truth is that _no programming language knows how to handle 
unicode well_, since handling a host of world languages in a 
single format is _inherently unintuitive_ and has significant 
efficiency tradeoffs between the different formats.

> And once again, it's the user's fault as in having some basic 
> assumptions about how things should work. The user is just too 
> stoooopid to use D properly - that's all. I know this type of 
> behavior from the management of pubs and shops that had to 
> close down, because nobody would go there anymore.
>
> Do you know the book "Crónica de una muerte anunciada" 
> (Chronicle of a Death Foretold) by Gabriel García Márquez?
>
> "The central question at the core of the novella is how the 
> death of Santiago Nasar was foreseen, yet no one tried to stop 
> it."[1]
>
> [1] 
> https://en.wikipedia.org/wiki/Chronicle_of_a_Death_Foretold#Key_themes

You're not being fair here, Chris. I just saw this SO question 
that I think exemplifies how most programmers react to Unicode:

"Trying to understand the subtleties of modern Unicode is making 
my head hurt. In particular, the distinction between code points, 
characters, glyphs and graphemes - concepts which in the simplest 
case, when dealing with English text using ASCII characters, all 
have a one-to-one relationship with each other - is causing me 
trouble.

Seeing how these terms get used in documents like Matthias 
Bynens' JavaScript has a unicode problem or Wikipedia's piece on 
Han unification, I've gathered that these concepts are not the 
same thing and that it's dangerous to conflate them, but I'm kind 
of struggling to grasp what each term means.

The Unicode Consortium offers a glossary to explain this stuff, 
but it's full of "definitions" like this:

Abstract Character. A unit of information used for the 
organization, control, or representation of textual data. ...

...

Character. ... (2) Synonym for abstract character. (3) The basic 
unit of encoding for the Unicode character encoding. ...

...

Glyph. (1) An abstract form that represents one or more glyph 
images. (2) A synonym for glyph image. In displaying Unicode 
character data, one or more glyphs may be selected to depict a 
particular character.

...

Grapheme. (1) A minimally distinctive unit of writing in the 
context of a particular writing system. ...

Most of these definitions possess the quality of sounding very 
academic and formal, but lack the quality of meaning anything, or 
else defer the problem of definition to yet another glossary 
entry or section of the standard.

So I seek the arcane wisdom of those more learned than I. How 
exactly do each of these concepts differ from each other, and in 
what circumstances would they not have a one-to-one relationship 
with each other?"
https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme

Honestly, unicode is a mess, and I believe we will all have to 
dump the Unicode standard and start over one day. Until that fine 
day, there is no neat solution to how to handle it, no matter how 
much you'd like to think so. Also, much of the complexity 
actually comes from the complexity of the various language 
alphabets, so that cannot be waved away no matter what standard 
you come up with, though Unicode certainly adds more unneeded 
complexity on top, which is why it should be dumped.


More information about the Digitalmars-d mailing list