Why UTF-8/16 character encodings?

H. S. Teoh hsteoh at quickfur.ath.cx
Sat May 25 21:48:19 PDT 2013


On Sat, May 25, 2013 at 04:14:34PM -0700, Jonathan M Davis wrote:
> On Saturday, May 25, 2013 01:42:20 Walter Bright wrote:
> > On 5/25/2013 12:33 AM, Joakim wrote:
> > > At what cost?  Most programmers completely punt on unicode,
> > > because they just don't want to deal with the complexity. Perhaps
> > > you can deal with it and don't mind the performance loss, but I
> > > suspect you're in the minority.
> > 
> > I think you stand alone in your desire to return to code pages. I
> > have years of experience with code pages and the unfixable misery
> > they produce. This has disappeared with Unicode. I find your
> > arguments unpersuasive when stacked against my experience. And yes,
> > I have made a living writing high performance code that deals with
> > characters, and you are quite off base with claims that UTF-8 has
> > inevitable bad performance - though there is inefficient code in
> > Phobos for it, to be sure.
> > 
> > My grandfather wrote a book that consists of mixed German, French,
> > and Latin words, using special characters unique to those languages.
> > Another failing of code pages is it fails miserably at any such
> > mixed language text.  Unicode handles it with aplomb.
> > 
> > I can't even write an email to Rainer Schütze in English under your
> > scheme.
> > 
> > Code pages simply are no longer practical nor acceptable for a
> > global community. D is never going to convert to a code page system,
> > and even if it did, there's no way D will ever convince the world to
> > abandon Unicode, and so D would be as useless as EBCDIC.
> > 
> > I'm afraid your quest is quixotic.
> 
> All I've got to say on this subject is "Thank you Walter Bright for
> building Unicode into D!"
[...]

Ditto here!

In fact, Unicode support in D (esp. UTF-8) was one of the major factors
that convinced me to adopt D. I had been trying to write
language-agnostic programs in C/C++, and ... let's just say that it was
one gigantic hairy mess, and required lots of system-dependent hacks and
unfounded assumptions ("it appears to work so I think the code's correct
even though according to spec it shouldn't have worked"). I18n support
in libc was spotty and incomplete, with many common functions breaking
in unexpected ways once you step outside ASCII, and libraries like
gettext address some of the issues but not all. Getting *real* i18n
support required using a full-fledged i18n library like libicu, which
required using custom string types. The whole experience was so painful
I've since avoided doing any i18n in C/C++ at all.

Then came along D with native Unicode support built right into the
language. And not just UTF-16 shoved down your throat like Java does (or
was it UTF-32?); UTF-8, UTF-16, and UTF-32 are all equally supported.
You cannot imagine what a happy camper I was since then!! Yes, Phobos
still has a ways to go in terms of performance w.r.t. UTF-8 strings, but
what we have right now is already far, far, superior to the situation in
C/C++, and things can only get better.


T

-- 
Freedom of speech: the whole world has no right *not* to hear my spouting off!


More information about the Digitalmars-d mailing list