Why UTF-8/16 character encodings?

H. S. Teoh hsteoh at quickfur.ath.cx
Sun May 26 07:35:42 PDT 2013


On Sun, May 26, 2013 at 11:59:19AM +0200, Joakim wrote:
> On Saturday, 25 May 2013 at 20:52:41 UTC, H. S. Teoh wrote:
> >And just how exactly does that help with slicing? If anything, it
> >makes slicing way hairier and error-prone than UTF-8. In fact, this
> >one point alone already defeated any performance gains you may have
> >had with a single-byte encoding. Now you can't do *any* slicing at
> >all without convoluted algorithms to determine what encoding is where
> >at the endpoints of your slice, and the resulting slice must have new
> >headers to indicate the start/end of every different-language
> >substring.  By the time you're done with all that, you're going way
> >slower than processing UTF-8.
>
> There are no convoluted algorithms, it's a simple check if the
> string contains any two-bye encodings, a check which can be done
> once and cached.

IHBT. You said that to handle multilanguage strings, your header would
have a list of starting/ending points indicating which encoding should
be used for which substring(s). That has nothing to do with two-byte
encodings. So, please show us the code: given a string containing, say,
English and French substrings, what will the header look like? And
what's the algorithm to take a slice of such a string?


> If it's single-byte all the way through, no problems whatsoever with
> slicing.

Huh?! How are there no problems with slicing? Let's say you have a
string that contains both English and French. According to your scheme,
you'll have some kind of header format that lets you say bytes 0-123 are
English, bytes 124-129 are French, and bytes 130-200 are English. Now
let's say I want a substring from 120 to 125. How would this be done?
And what about if I want a substring from 120 to 140? Or 126 to 130?
What if the string contains several runs of French?

Please show us the code.


> If there are two-byte languages included, the slice function will have
> to do a little arithmetic calculation before slicing.  You will also
> need a few arithmetic ops to create the new header for the slice.  The
> point is that these operations will be much faster than decoding every
> code point to slice UTF-8.

You haven't proven that this "little arithmetic calculation" will be
faster than manipulating UTF-8. What if I have an English text that
contains quotations of Chinese, French, and Greek snippets? Math
symbols?  Please show us (1) how such a string should be encoded under
your scheme, and (2) the code will slice such a string in an efficient
way, according to your proposed encoding scheme.

(And before you dismiss such a string as unlikely or write it off as
rare, consider a technical math paper that cites the work of Chinese and
French authors -- a rather common thing these days. You'd need the extra
characters just to be able to cite their names, even if none of the
actual Chinese or French is quoted verbatim. Greek in general is used
all over math anyway, since for whatever reason mathematicians just love
Greek symbols, so it pretty much needs to be included by default.)


> >Again I say, I'm not 100% sold on UTF-8, but what you're proposing
> >here is far worse.
> Well, I'm glad you realize some problems with UTF-8, :) even if you
> dismiss my alternative out of hand.

Clearly, we're not seeing what you're seeing here. So instead of making
general statements about the superiority of your scheme, you might want
to show us the actual code.  So far, I haven't seen anything that
convinces me that your scheme is any better.  In fact, from what I can
see, it's a lot worse, and you're just evading pointed questions about
how to address those problems.  Maybe that's a wrong perception, but not
having any actual code to look at, I'm having a hard time believing your
claims. Right now I'm leaning towards agreeing with Walter that you're
just trolling us (and rather successfully at that).

So, please show us the code. Otherwise, I think I should just stop
responding, as we're obviously not on the same page and this discussion
isn't getting anywhere.


T

-- 
Some ideas are so stupid that only intellectuals could believe them. -- George Orwell


More information about the Digitalmars-d mailing list