First Impressions!

Jonathan M Davis newsgroup.d at jmdavisprog.com
Thu Nov 30 19:07:36 UTC 2017


On Thursday, November 30, 2017 18:32:46 A Guy With a Question via 
Digitalmars-d wrote:
> On Thursday, 30 November 2017 at 17:56:58 UTC, Jonathan M Davis
>
> wrote:
> > On Thursday, November 30, 2017 03:37:37 Walter Bright via
> > Digitalmars-d wrote:
> > Language-wise, I think that most of the UTF-16 is driven by the
> > fact that Java went with UCS-2 / UTF-16, and C# followed them
> > (both because they were copying Java and because the Win32 API
> > had gone with UCS-2 / UTF-16). So, that's had a lot of
> > influence on folks, though most others have gone with UTF-8 for
> > backwards compatibility and because it typically takes up less
> > space for non-Asian text. But the use of UTF-16 in Windows,
> > Java, and C# does seem to have resulted in some folks thinking
> > that wide characters means Unicode, and narrow characters
> > meaning ASCII.
> >
> > - Jonathan M Davis
>
> I think it also simplifies the logic. You are not always looking
> to represent the codepoints symbolically. You are just trying to
> see what information is in it. Therefore, if you can practically
> treat a codepoint as the unit of data behind the scenes, it
> simplifies the logic.

Even if that were true, UTF-16 code units are not code points. If you want
to operate on code points, you have to go to UTF-32. And even if you're at
UTF-32, you have to worry about Unicode normalization, otherwise the same
information can be represented differently even if all you care about is
code points and not graphemes. And of course, some stuff really does care
about graphemes, since those are the actual characters.

Ultimately, you have to understand how code units, code points, and
graphemes work and what you're doing with a particular algorithm so that you
know at which level you should operate at and where the pitfalls are. Some
code can operate on code units and be fine; some can operate on code points;
and some can operate on graphemes. But there is no one-size-fits-all
solution that makes it all magically easy and efficient to use.

And UTF-16 does _nothing_ to improve any of this over UTF-8. It's just a
different way to encode code points. And really, it makes things worse,
because it usually takes up more space than UTF-8, and it makes it easier to
miss when you screw up your Unicode handling, because more UTF-16 code units
are valid code points than UTF-8 code units are, but they still aren't all
valid code points. So, if you use UTF-8, you're more likely to catch your
mistakes.

Honestly, I think that the only good reason to use UTF-16 is if you're
interacting with existing APIs that use UTF-16, and even then, I think that
in most cases, you're better off using UTF-8 and converting to UTF-16 only
when you have to. Strings eat less memory that way, and mistakes are more
easily caught. And if you're writing cross-platform code in D, then Windows
is really the only place that you're typically going to have to deal with
UTF-16, so it definitely works better in general to favor UTF-8 in D
programs. But regardless, at least D gives you the tools to deal with the
different Unicode encodings relatively cleanly and easily, so you can use
whichever Unicode encoding you need to. Most D code is going to use UTF-8
though.

- Jonathan M Davis



More information about the Digitalmars-d mailing list