The Case Against Autodecode

Marco Leise via Digitalmars-d digitalmars-d at puremagic.com
Mon May 30 12:28:03 PDT 2016


Am Mon, 30 May 2016 17:35:36 +0000
schrieb Chris <wendlec at tcd.ie>:

> I was actually talking about ICU with a colleague today. Could it 
> be that Unicode itself is broken? I've often heard criticism of 
> Unicode but never looked into it.

You have to compare to the situation before, when every
operating system with every localization had its own encoding.
Have some text file with ASCII art in a DOS code page? Doesn't
render on Windows with the same locale. Open Cyrillic text on
a Latin system? Indigestible. Someone wrote a website on
Windows and incorrectly tagged it with an ISO charset? The
browser has to fix it up for them.

One objection I remember was the Han Unification:
https://en.wikipedia.org/wiki/Han_unification
Not everyone liked how Chinese, Japanese, Korean were
represented with a common set of ideograms. At the time
Unicode was still 16-bit and the unified symbols would already
make up 32% of all code points.

In my eyes many of the perceived problems of Unicode are
stemming from the fact that raises awareness to different
writing systems all over the globe in a way that we didn't
have to, when software was developed locally instead of
globally on GitHub, when the target was Windows instead of
cross-platform and mobile, when we were lucky if we localized
for a couple of latin languages, but Asia was a real barrier.

I don't know what you and your colleague discussed about ICU,
but likely if you should add another dependency and what
alternatives there are. In Linux user space, almost everything
is an outside project, an extra library, most of them with
alternatives. My own research lead me to the point where I
came to think that there was one set of libraries without
real alternatives: ICU -> HarfBuff -> Pango
That's the go-to chain for Unicode text. From text processing
over rendering to layouting. Moreover many successful
open-source projects make use of it: LibreOffice, sqlite, Qt,
libxml2, WebKit to name a few.
Unicode is here to stay, no matter what could have been done
better in the past, and I think it is perfectly safe to bet on
ICU on Linux for what e.g. Windows has built-in.

Otherwise just do as Adam Ruppe said:
> Don't mess with strings. Get them from the user, store them
> without modification, spit them back out again.

:p

-- 
Marco



More information about the Digitalmars-d mailing list