The Case Against Autodecode

Jonathan M Davis via Digitalmars-d digitalmars-d at puremagic.com
Tue May 31 09:45:45 PDT 2016


On Tuesday, May 31, 2016 11:07:09 Andrei Alexandrescu via Digitalmars-d wrote:
> On 5/31/16 3:56 AM, Walter Bright wrote:
> > If there is an abstraction for strings that is efficient, consistent,
> > useful, and hides the fact that it is UTF, I am not aware of it.
>
> It's been mentioned several times: a string type that does not offer
> range primitives; instead it offers explicit primitives (such as
> byCodeUnit, byCodePoint, byGrapheme etc) that yield appropriate ranges.

Not exactly. Such a string type does not hide the fact that it's UTF.
Rather, it forces you to deal with the fact that its UTF. I have to agree
with Walter in that there really isn't a way to automatically handle Unicode
correctly and efficiently while hiding the fact that it's doing all of the
stuff that has to be done for UTF.

That being said, while an array of code units is really what a string should
be underneath the hood, having a string type that provides byCodeUnit,
byCodePoint, and byGrapheme is an improvement over treating
immutable(char)[] as string, even if byCodeUnit returns immutable(char)[],
because it forces the programmer to decide what they want to do rather than
blindingly operate on immutable(char)[] as if a char were a full character.
And as long as it provides access to each level of Unicode, then it's
possible for programmers who know what they're doing to efficiently operate
on Unicode while simultaneously making it much more obvious to those who
don't know what they're doing that they don't know they're doing rather than
having them blindly act like char is a full character.

There's really no reason why we couldn't define a string type that operated
that way while continuing to treat arrays of char the way that we do now in
the language, though transitioning to such a scheme is not at all
straightforward in terms of avoiding code breakage. Defining a String type
would be simple enough, and any function in Phobos which accepted a string
could be changed to accept a String, but we'd have problems with many
functions which currently returned string, since changing what they returned
would break code.

But even if Phobos were somehow completly changed over to use a new String
type, and even if the string alias were deprecated/removed, we'd still have
to deal with arrays of char, wchar, and dchar and run the risk of someone
using those and having problems, because they didn't treat them as arrays of
code units. We can't really prevent that, just make it so that string/String
is something else that makes the Unicode issue obvious so that folks are
less likely to blindly treat chars as full characters. But even then, it's
not like it would be hard for folks to just use the wrong Unicode level. All
we'd really be doing is shoving the issue in their face so that they'd have
to acknowledge it on some level and maybe then actually learn enough to
operate on Unicode strings correctly.

But then again, since all you're really doing at that point is shoving the
Unicode issues in folks' faces by not treating strings as ranges or
indexable and forcing them to call byCodeUnit, byCodePoint, byGrapheme,
etc., I don't know that it actually solves much over treating
immutable(char)[] as string. Programmers still have to learn Unicode enough
to handle it correctly, just like they do now (whether we have autodecoding
or not). And such a string type really doesn't make the Unicode handling any
easier. It just make it harder to ignore the Unicode issues.

The Unicode problem is a lot like the floating point problems that have been
discussed recently. Programmers want it to "just work" without them having
to worry about the details, but that really doesn't work, and while the
average programmer may not understand either floating point operations or
Unicode properly, the average programmer does actually have to work with
both on a regular basis.

I'm not at all convinced that having string be an alias of immutable(char)[]
was a mistake, but having a struct that's not a range may very well be an
improvement. It _would_ at least make some of the Unicode issues more
obvious, but it doesn't really solve much from what I can see.

- Jonathan M Davis



More information about the Digitalmars-d mailing list