Why UTF-8/16 character encodings?
Joakim
joakim at airpost.net
Sun May 26 04:31:28 PDT 2013
On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
>> I have noted from the beginning that these large alphabets
>> have to be encoded to
>> two bytes, so it is not a true constant-width encoding if you
>> are mixing one of
>> those languages into a single-byte encoded string. But this
>> "variable length"
>> encoding is so much simpler than UTF-8, there's no comparison.
>
> If it's one byte sometimes, or two bytes sometimes, it's
> variable length. You overlook that I've had to deal with this.
> It isn't "simpler", there's actually more work to write code
> that adapts to one or two byte encodings.
It is variable length, with the advantage that only strings
containing a few Asian languages are variable-length, as opposed
to UTF-8 having every non-English language string be
variable-length. It may be more work to write library code to
handle my encoding, perhaps, but efficiency and ease of use are
paramount.
>> So let's see: first you say that my scheme has to be variable
>> length because I
>> am using two bytes to handle these languages,
>
> Well, it *is* variable length or you have to disregard Chinese.
> You cannot have it both ways. Code to deal with two bytes is
> significantly different than code to deal with one. That means
> you've got a conditional in your generic code - that isn't
> going to be faster than the conditional for UTF-8.
Hah, I have explicitly said several times that I'd use a two-byte
encoding for Chinese and I already acknowledged that such a
predominantly single-byte encoding is still variable-length. The
problem is that _you_ try to have it both ways: first you claimed
it is variable-length because I support Chinese that way, then
you claimed I don't support Chinese.
Yes, there will be conditionals, just as there are several
conditionals in phobos depending on whether a language supports
uppercase or not. The question is whether the conditionals for
single-byte encoding will execute faster than decoding every
UTF-8 character. This is a matter of engineering judgement, I
see no reason why you think decoding every UTF-8 character is
faster.
>> then you claim I don't handle
>> these languages. This kind of blatant contradiction within
>> two posts can only
>> be called... trolling!
>
> You gave some vague handwaving about it, and then dismissed it
> as irrelevant, along with more handwaving about what to do with
> text that has embedded words in multiple languages.
If it was mere "vague handwaving," how did you know I planned to
use two bytes to encode Chinese? I'm not sure why you're
continuing along this contradictory path.
I didn't "handwave" about multi-language strings, I gave specific
ideas about how they might be implemented. I'm not claiming to
have a bullet-proof and detailed single-byte encoding spec, just
spitballing some ideas on how to do it better than the abominable
UTF-8.
> Worse, there are going to be more than 256 of these encodings -
> you can't even have a byte to specify them. Remember, Unicode
> has approximately 256,000 characters in it. How many code pages
> is that?
There are 72 modern scripts in Unicode 6.1, 28 ancient scripts,
maybe another 50 symbolic sets. That leaves space for another
100 or so new scripts. Maybe you are so worried about
future-proofing that you'd use two bytes to signify the alphabet,
but I wouldn't. I think it's more likely that we'll ditch
scripts than add them. ;) Most of those symbol sets should not be
in UCS.
> I was being kind saying you were trolling, as otherwise I'd be
> saying your scheme was, to be blunt, absurd.
I think it's absurd to use a self-synchronizing text encoding
from 20 years ago, that is really only useful when streaming
text, which nobody does today. There may have been a time when
ASCII compatibility was paramount, when nobody cared about
internationalization and almost all libraries only took ASCII
input: that is not the case today.
> I'll be the first to admit that a lot of great ideas have been
> initially dismissed by the experts as absurd. If you really
> believe in this, I recommend that you write it up as a real
> article, taking care to fill in all the handwaving with
> something specific, and include some benchmarks to prove your
> performance claims. Post your article on reddit, stackoverflow,
> hackernews, etc., and look for fertile ground for it. I'm sorry
> you're not finding fertile ground here (so far, nobody has
> agreed with any of your points), and this is the wrong place
> for such proposals anyway, as D is simply not going to switch
> over to it.
Let me admit in return that I might be completely wrong about my
single-byte encoding representing a step forward from UTF-8.
While this argument has produced no argument that I'm wrong, it's
possible we've all missed something salient, some deal-breaker.
As I said before, I'm not proposing that D "switch over." I was
simply asking people who know or at the very least use UTF-8 more
than most, as a result of employing one of the few languages with
Unicode support baked in, why they think UTF-8 is a good idea.
I was hoping for a technical discussion on the merits, before I
went ahead and implemented this single-byte encoding. Since
nobody has been able to point out a reason for why my encoding
wouldn't be much better than UTF-8, I see no reason not to go
forward with my implementation. I may write something up after
implementation: most people don't care about ideas, only results,
to the point where almost nobody can reason at all about ideas.
> Remember, extraordinary claims require extraordinary evidence,
> not handwaving and assumptions disguised as bold assertions.
I don't think my claims are extraordinary or backed by
"handwaving and assumptions." Some people can reason about such
possible encodings, even in the incomplete form I've sketched
out, without having implemented them, if they know what they're
doing.
On Saturday, 25 May 2013 at 22:01:13 UTC, Walter Bright wrote:
> On 5/25/2013 2:51 PM, Walter Bright wrote:
>> On 5/25/2013 12:51 PM, Joakim wrote:
>>> For a multi-language string encoding, the header would
>>> contain a single byte for every language used in the string,
>>> along with multiple
>>> index bytes to signify the start and finish of every run of
>>> single-language
>>> characters in the string. So, a list of languages and a list
>>> of pure
>>> single-language substrings.
>>
>> Please implement the simple C function strstr() with this
>> simple scheme, and
>> post it here.
>>
>> http://www.digitalmars.com/rtl/string.html#strstr
>
> I'll go first. Here's a simple UTF-8 version in C. It's not the
> fastest way to do it, but at least it is correct:
> ----------------------------------
> char *strstr(const char *s1,const char *s2) {
> size_t len1 = strlen(s1);
> size_t len2 = strlen(s2);
> if (!len2)
> return (char *) s1;
> char c2 = *s2;
> while (len2 <= len1) {
> if (c2 == *s1)
> if (memcmp(s2,s1,len2) == 0)
> return (char *) s1;
> s1++;
> len1--;
> }
> return NULL;
> }
There is no question that a UTF-8 implementation of strstr can be
simpler to write in C and D for multi-language strings that
include Korean/Chinese/Japanese. But while the strstr
implementation for my encoding would contain more conditionals
and lines of code, it would be far more efficient. For instance,
because you know where all the language substrings are from the
header, you can potentially rule out searching vast swathes of
the string, because they don't contain the same languages or
lengths as the string you're searching for.
Even if you're searching a single-language string, which won't
have those speedups, your naive implementation checks every byte,
even continuation bytes, in UTF-8 to see if they might match the
first letter of the search string, even though no continuation
byte will match. You can avoid this by partially decoding the
leading bytes of UTF-8 characters and skipping over continuation
bytes, as I've mentioned earlier in this thread, but you've then
added more lines of code to your pretty yet simple function and
added decoding overhead to every iteration of the while loop.
My single-byte encoding has none of these problems, in fact, it's
much faster and uses less memory for the same function, while
providing additional speedups, from the header, that are not
available to UTF-8.
Finally, being able to write simple yet inefficient functions
like this is not the test of a good encoding, as strstr is a
library function, and making library developers' lives easier is
a low priority for any good format. The primary goals are ease
of use for library consumers, ie app developers, and speed and
efficiency of the code. You are trading on the latter two for
the former with this implementation. That is not a good tradeoff.
Perhaps it was a good trade 20 years ago when everyone rolled
their own code and nobody bothered waiting for those floppy disks
to arrive with expensive library code. It is not a good trade
today.
More information about the Digitalmars-d
mailing list