More fun with autodecoding

Mon Sep 10 09:15:09 UTC 2018

On Monday, September 10, 2018 2:45:27 AM MDT Chris via Digitalmars-d wrote:

> After a while your code will be cluttered with absurd stuff like
> this. `.byCodeUnit`, `.byGrapheme`, `.array` etc. Due to my
> experience with `splitter` et. al. I tried to create my own
> parser to have better control over every step. After a few
> *minutes* of testing things I ran into this bug [1] that didn't
> get fixed till early 2018. I never started to write my own
> step-by-step parser. I'm glad I didn't.
>
> [1] https://issues.dlang.org/show_bug.cgi?id=16739
>
> [snip]

I suspect that that that didn't get found sooner simply because using
Unicode in a switch statement is rare. Usually, Unicode characters are found
in program input and not in the program itself. And grammars typically only
involve ASCII characters (even D, which supports Unicode characters in
identfiers, doesn't have any Unicode in any of its symbols). So, while I
completely agree that using Unicode in switch statements should work, it
doesn't really surprise me that it was broken. That's really a large part of
the Unicode problem. Regardless of how a particular language or library
attempst to make using Unicode sane, a large percentage of programmers don't
ever do anything with Unicode characters (even if their programs are often
used in environments where they will end up processing Unicode characters),
and even when a programmer's native tongue requires Unicode characters,
their programs frequently do not. So, it becomes very easy to write code
that doesn't work properly with Unicode and have no clue that it doesn't.

Fortunately, D does provide better tools than many languages for handling
Unicode, but the auto-decoding mess has made it considerably worse.

Still, even if we'd gotten it right, some portion of the code out there have
to have something like byCodeUnit, byCodePoint, or byGrapheme, because
efficient Unicode processing requires that you deal with all of that mess.
The code that doesn't have to do any of that is generally code that treats
strings as opaque data. Once you actually have to do string processing,
you're pretty much screwed.

Doing everything at the grapheme level would eliminate most of the problems
with regards to user-friendliness, but it would kill efficiency. So, as far
as I can tell, there really isn't a great solution to be had. Unicode is
simply too complicated and messy by its very nature. Now, we've definitely
made mistakes with Phobos that make it worse, but the only programs that are
going to avoid this whole mess either do so by not dealing with Unicode,
handling it incorrectly, or by handling it inefficiently. I think that it's
pretty much a pipe dream to be able to have completely sane and efficient
string handling using Unicode as its currently defined.

Regardless, we need to do a better job of it in D than we have been.

- Jonathan M Davis