Why UTF-8/16 character encodings?

Sat May 25 02:05:32 PDT 2013

On Saturday, 25 May 2013 at 08:42:46 UTC, Walter Bright wrote:
> I think you stand alone in your desire to return to code pages.
Nobody is talking about going back to code pages.  I'm talking 
about going to single-byte encodings, which do not imply the 
problems that you had with code pages way back when.

> I have years of experience with code pages and the unfixable 
> misery they produce. This has disappeared with Unicode. I find 
> your arguments unpersuasive when stacked against my experience. 
> And yes, I have made a living writing high performance code 
> that deals with characters, and you are quite off base with 
> claims that UTF-8 has inevitable bad performance - though there 
> is inefficient code in Phobos for it, to be sure.
How can a variable-width encoding possibly compete with a 
constant-width encoding?  You have not articulated a reason for 
this.  Do you believe there is a performance loss with 
variable-width, but that it is not significant and therefore 
worth it?  Or do you believe it can be implemented with no loss?  
That is what I asked above, but you did not answer.

> My grandfather wrote a book that consists of mixed German, 
> French, and Latin words, using special characters unique to 
> those languages. Another failing of code pages is it fails 
> miserably at any such mixed language text. Unicode handles it 
> with aplomb.
I see no reason why single-byte encodings wouldn't do a better 
job at such mixed-language text.  You'd just have to have a 
larger, more complex header or keep all your strings in a single 
language, with a different format to compose them together for 
your book.  This would be so much easier than UTF-8 that I cannot 
see how anyone could argue for a variable-length encoding instead.

> I can't even write an email to Rainer Schütze in English under 
> your scheme.
Why not?  You seem to think that my scheme doesn't implement 
multi-language text at all, whereas I pointed out, from the 
beginning, that it could be trivially done also.

> Code pages simply are no longer practical nor acceptable for a 
> global community. D is never going to convert to a code page 
> system, and even if it did, there's no way D will ever convince 
> the world to abandon Unicode, and so D would be as useless as 
> EBCDIC.
I'm afraid you and others here seem to mentally translate 
"single-byte encodings" to "code pages" in your head, then recoil 
in horror as you remember all your problems with broken 
implementations of code pages, even though those problems are not 
intrinsic to single-byte encodings.

I'm not asking you to consider this for D.  I just wanted to 
discuss why UTF-8 is used at all.  I had hoped for some technical 
evaluations of its merits, but I seem to simply be dredging up a 
bunch of repressed memories about code pages instead. ;)

The world may not "abandon Unicode," but it will abandon UTF-8, 
because it's a dumb idea.  Unfortunately, such dumb ideas- XML 
anyone?- often proliferate until someone comes up with something 
better to show how dumb they are.  Perhaps it won't be the D 
programming language that does that, but it would be easy to 
implement my idea in D, so maybe it will be a D-based library 
someday. :)

> I'm afraid your quest is quixotic.
I'd argue the opposite, considering most programmers still can't 
wrap their head around UTF-8.  If someone can just get a 
single-byte encoding implemented and in front of them, I suspect 
it will be UTF-8 that will be considered quixotic. :D