Why UTF-8/16 character encodings?

Dmitry Olshansky dmitry.olsh at gmail.com
Sat May 25 10:03:41 PDT 2013


25-May-2013 10:44, Joakim пишет:
> On Friday, 24 May 2013 at 21:21:27 UTC, Dmitry Olshansky wrote:
>> You seem to think that not only UTF-8 is bad encoding but also one
>> unified encoding (code-space) is bad(?).
> Yes, on the encoding, if it's a variable-length encoding like UTF-8, no,
> on the code space.  I was originally going to title my post, "Why
> Unicode?" but I have no real problem with UCS, which merely standardized
> a bunch of pre-existing code pages.  Perhaps there are a lot of problems
> with UCS also, I just haven't delved into it enough to know.  My problem
> is with these dumb variable-length encodings, so I was precise in the
> title.
>

UCS is dead and gone. Next in line to "640K is enough for everyone".
Simply put Unicode decided to take into account all diversity of 
luggages instead of ~80% of these. Hard to add anything else. No offense 
meant but it feels like you actually live in universe that is 5-7 years 
behind current state. UTF-16 (a successor to UCS) is no random-access 
either. And it's shitty beyond measure, UTF-8 is a shining gem in 
comparison.

>> Separate code spaces were the case before Unicode (and utf-8). The
>> problem is not only that without header text is meaningless (no easy
>> slicing) but the fact that encoding of data after header strongly
>> depends a variety of factors -  a list of encodings actually. Now
>> everybody has to keep a (code) page per language to at least know if
>> it's 2 bytes per char or 1 byte per char or whatever. And you still
>> work on a basis that there is no combining marks and regional specific
>> stuff :)
> Everybody is still keeping code pages, UTF-8 hasn't changed that.

Legacy. Hard to switch overnight. There are graphs that indicate that 
few years from now you might never encounter a legacy encoding anymore, 
only UTF-8/UTF-16.

>  Does
> UTF-8 not need "to at least know if it's 2 bytes per char or 1 byte per
> char or whatever?"

It's coherent in its scheme to determine that. You don't need extra 
information synced to text unlike header stuff.

> It has to do that also. Everyone keeps talking about
> "easy slicing" as though UTF-8 provides it, but it doesn't.  Phobos
> turns UTF-8 into UTF-32 internally for all that ease of use, at least
> doubling your string size in the process.  Correct me if I'm wrong, that
> was what I read on the newsgroup sometime back.

Indeed you are - searching for UTF-8 substring in UTF-8 string doesn't 
do any decoding and it does return you a slice of a balance of original.

>
>> In fact it was even "better" nobody ever talked about header they just
>> assumed a codepage with some global setting. Imagine yourself creating
>> a font rendering system these days - a hell of an exercise in
>> frustration (okay how do I render 0x88 ? mm if that is in codepage XYZ
>> then ...).
> I understand that people were frustrated with all the code pages out
> there before UCS standardized them, but that is a completely different
> argument than my problem with UTF-8 and variable-length encodings.  My
> proposed simple, header-based, constant-width encoding could be
> implemented with UCS and there go all your arguments about random code
> pages.

No they don't - have you ever seen native Korean or Chinese codepages? 
Problems with your header based approach are self-evident in a sense 
that there is no single sane way to deal with it on cross-locale basis 
(that you simply ignore as noted below).

>> This just shows you don't care for multilingual stuff at all. Imagine
>> any language tutor/translator/dictionary on the Web. For instance most
>> languages need to intersperse ASCII (also keep in mind e.g. HTML
>> markup). Books often feature citations in native language (or e.g.
>> latin) along with translations.
> This is a small segment of use and it would be handled fine by an
> alternate encoding.

??? Simply makes no sense. There is no intersection between some legacy 
encodings as of now. Or do you want to add N*(N-1) cross-encodings for 
any combination of 2? What about 3 in one string?

>> Now also take into account math symbols, currency symbols and beyond.
>> Also these days cultures are mixing in wild combinations so you might
>> need to see the text even if you can't read it. Unicode is not only
>> "encode characters from all languages". It needs to address universal
>> representation of symbolics used in writing systems at large.
> I take your point that it isn't just languages, but symbols also.  I see
> no reason why UTF-8 is a better encoding for that purpose than the kind
> of simple encoding I've suggested.
>
>> We want monoculture! That is to understand each without all these
>> "par-le-vu-france?" and codepages of various complexity(insanity).
> I hate monoculture, but then I haven't had to decipher some screwed-up
> codepage in the middle of the night. ;)

So you never had trouble of internationalization? What languages do you 
use (read/speak/etc.)?

>That said, you could standardize
> on UCS for your code space without using a bad encoding like UTF-8, as I
> said above.

UCS is a myth as of ~5 years ago. Early adopters of Unicode fell into 
that trap (Java, Windows NT). You shouldn't.

>> Want small - use compression schemes which are perfectly fine and get
>> to the precious 1byte per codepoint with exceptional speed.
>> http://www.unicode.org/reports/tr6/
> Correct me if I'm wrong, but it seems like that compression scheme
> simply adds a header and then uses a single-byte encoding, exactly what
> I suggested! :)

This is it but it's far more flexible in a sense that it allows 
multi-linguagal strings just fine and lone full-with unicode codepoints 
as well.

> But I get the impression that it's only for sending over
> the wire, ie transmision, so all the processing issues that UTF-8
> introduces would still be there.

Use mime-type etc. Standards are always a bit stringy and suboptimal, 
their acceptance rate is one of chief advantages they have. Unicode has 
horrifically large momentum now and not a single organization aside from 
them tries to do this dirty work (=i18n).

>> And borrowing the arguments from from that rant: locale is borked shit
>> when it comes to encodings. Locales should be used for tweaking visual
>> like numbers, date display an so on.
> Is that worse than every API simply assuming UTF-8, as he says? Broken
> locale support in the past, as you and others complain about, doesn't
> invalidate the concept.

It's combinatorial blowup and has some stone-walls to hit into. Consider 
adding another encoding for "Tuva" for isntance. Now you have to add 2*n 
conversion routines to match it to other codepages/locales.

Beyond that - there are many things to consider in internationalization 
and you would have to special case them all by codepage.

> If they're screwing up something so simple,
> imagine how much worse everyone is screwing up something complex like
> UTF-8?

UTF-8 is pretty darn simple. BTW all it does is map [0..10FFFF] to a 
sequence of octets. It does it pretty well and compatible with ASCII, 
even the little rant you posted acknowledged that. Now you are either 
against Unicode as whole or what?

-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list