Why UTF-8/16 character encodings?

Fri May 24 23:44:29 PDT 2013

On Friday, 24 May 2013 at 21:21:27 UTC, Dmitry Olshansky wrote:
> You seem to think that not only UTF-8 is bad encoding but also 
> one unified encoding (code-space) is bad(?).
Yes, on the encoding, if it's a variable-length encoding like 
UTF-8, no, on the code space.  I was originally going to title my 
post, "Why Unicode?" but I have no real problem with UCS, which 
merely standardized a bunch of pre-existing code pages.  Perhaps 
there are a lot of problems with UCS also, I just haven't delved 
into it enough to know.  My problem is with these dumb 
variable-length encodings, so I was precise in the title.

> Separate code spaces were the case before Unicode (and utf-8). 
> The problem is not only that without header text is meaningless 
> (no easy slicing) but the fact that encoding of data after 
> header strongly depends a variety of factors -  a list of 
> encodings actually. Now everybody has to keep a (code) page per 
> language to at least know if it's 2 bytes per char or 1 byte 
> per char or whatever. And you still work on a basis that there 
> is no combining marks and regional specific stuff :)
Everybody is still keeping code pages, UTF-8 hasn't changed that. 
  Does UTF-8 not need "to at least know if it's 2 bytes per char 
or 1 byte per char or whatever?"  It has to do that also.  
Everyone keeps talking about "easy slicing" as though UTF-8 
provides it, but it doesn't.  Phobos turns UTF-8 into UTF-32 
internally for all that ease of use, at least doubling your 
string size in the process.  Correct me if I'm wrong, that was 
what I read on the newsgroup sometime back.

> In fact it was even "better" nobody ever talked about header 
> they just assumed a codepage with some global setting. Imagine 
> yourself creating a font rendering system these days - a hell 
> of an exercise in frustration (okay how do I render 0x88 ? mm 
> if that is in codepage XYZ then ...).
I understand that people were frustrated with all the code pages 
out there before UCS standardized them, but that is a completely 
different argument than my problem with UTF-8 and variable-length 
encodings.  My proposed simple, header-based, constant-width 
encoding could be implemented with UCS and there go all your 
arguments about random code pages.

> This just shows you don't care for multilingual stuff at all. 
> Imagine any language tutor/translator/dictionary on the Web. 
> For instance most languages need to intersperse ASCII (also 
> keep in mind e.g. HTML markup). Books often feature citations 
> in native language (or e.g. latin) along with translations.
This is a small segment of use and it would be handled fine by an 
alternate encoding.

> Now also take into account math symbols, currency symbols and 
> beyond. Also these days cultures are mixing in wild 
> combinations so you might need to see the text even if you 
> can't read it. Unicode is not only "encode characters from all 
> languages". It needs to address universal representation of 
> symbolics used in writing systems at large.
I take your point that it isn't just languages, but symbols also. 
  I see no reason why UTF-8 is a better encoding for that purpose 
than the kind of simple encoding I've suggested.

> We want monoculture! That is to understand each without all 
> these "par-le-vu-france?" and codepages of various 
> complexity(insanity).
I hate monoculture, but then I haven't had to decipher some 
screwed-up codepage in the middle of the night. ;) That said, you 
could standardize on UCS for your code space without using a bad 
encoding like UTF-8, as I said above.

> Want small - use compression schemes which are perfectly fine 
> and get to the precious 1byte per codepoint with exceptional 
> speed.
> http://www.unicode.org/reports/tr6/
Correct me if I'm wrong, but it seems like that compression 
scheme simply adds a header and then uses a single-byte encoding, 
exactly what I suggested! :) But I get the impression that it's 
only for sending over the wire, ie transmision, so all the 
processing issues that UTF-8 introduces would still be there.

> And borrowing the arguments from from that rant: locale is 
> borked shit when it comes to encodings. Locales should be used 
> for tweaking visual like numbers, date display an so on.
Is that worse than every API simply assuming UTF-8, as he says?  
Broken locale support in the past, as you and others complain 
about, doesn't invalidate the concept.  If they're screwing up 
something so simple, imagine how much worse everyone is screwing 
up something complex like UTF-8?