Why UTF-8/16 character encodings?

Sun May 26 04:31:28 PDT 2013

On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
>> I have noted from the beginning that these large alphabets 
>> have to be encoded to
>> two bytes, so it is not a true constant-width encoding if you 
>> are mixing one of
>> those languages into a single-byte encoded string.  But this 
>> "variable length"
>> encoding is so much simpler than UTF-8, there's no comparison.
>
> If it's one byte sometimes, or two bytes sometimes, it's 
> variable length. You overlook that I've had to deal with this. 
> It isn't "simpler", there's actually more work to write code 
> that adapts to one or two byte encodings.
It is variable length, with the advantage that only strings 
containing a few Asian languages are variable-length, as opposed 
to UTF-8 having every non-English language string be 
variable-length.  It may be more work to write library code to 
handle my encoding, perhaps, but efficiency and ease of use are 
paramount.

>> So let's see: first you say that my scheme has to be variable 
>> length because I
>> am using two bytes to handle these languages,
>
> Well, it *is* variable length or you have to disregard Chinese. 
> You cannot have it both ways. Code to deal with two bytes is 
> significantly different than code to deal with one. That means 
> you've got a conditional in your generic code - that isn't 
> going to be faster than the conditional for UTF-8.
Hah, I have explicitly said several times that I'd use a two-byte 
encoding for Chinese and I already acknowledged that such a 
predominantly single-byte encoding is still variable-length.  The 
problem is that _you_ try to have it both ways: first you claimed 
it is variable-length because I support Chinese that way, then 
you claimed I don't support Chinese.

Yes, there will be conditionals, just as there are several 
conditionals in phobos depending on whether a language supports 
uppercase or not.  The question is whether the conditionals for 
single-byte encoding will execute faster than decoding every 
UTF-8 character.  This is a matter of engineering judgement, I 
see no reason why you think decoding every UTF-8 character is 
faster.

>> then you claim I don't handle
>> these languages.  This kind of blatant contradiction within 
>> two posts can only
>> be called... trolling!
>
> You gave some vague handwaving about it, and then dismissed it 
> as irrelevant, along with more handwaving about what to do with 
> text that has embedded words in multiple languages.
If it was mere "vague handwaving," how did you know I planned to 
use two bytes to encode Chinese?  I'm not sure why you're 
continuing along this contradictory path.

I didn't "handwave" about multi-language strings, I gave specific 
ideas about how they might be implemented.  I'm not claiming to 
have a bullet-proof and detailed single-byte encoding spec, just 
spitballing some ideas on how to do it better than the abominable 
UTF-8.

> Worse, there are going to be more than 256 of these encodings - 
> you can't even have a byte to specify them. Remember, Unicode 
> has approximately 256,000 characters in it. How many code pages 
> is that?
There are 72 modern scripts in Unicode 6.1, 28 ancient scripts, 
maybe another 50 symbolic sets.  That leaves space for another 
100 or so new scripts.  Maybe you are so worried about 
future-proofing that you'd use two bytes to signify the alphabet, 
but I wouldn't.  I think it's more likely that we'll ditch 
scripts than add them. ;) Most of those symbol sets should not be 
in UCS.

> I was being kind saying you were trolling, as otherwise I'd be 
> saying your scheme was, to be blunt, absurd.
I think it's absurd to use a self-synchronizing text encoding 
from 20 years ago, that is really only useful when streaming 
text, which nobody does today.  There may have been a time when 
ASCII compatibility was paramount, when nobody cared about 
internationalization and almost all libraries only took ASCII 
input: that is not the case today.

> I'll be the first to admit that a lot of great ideas have been 
> initially dismissed by the experts as absurd. If you really 
> believe in this, I recommend that you write it up as a real 
> article, taking care to fill in all the handwaving with 
> something specific, and include some benchmarks to prove your 
> performance claims. Post your article on reddit, stackoverflow, 
> hackernews, etc., and look for fertile ground for it. I'm sorry 
> you're not finding fertile ground here (so far, nobody has 
> agreed with any of your points), and this is the wrong place 
> for such proposals anyway, as D is simply not going to switch 
> over to it.
Let me admit in return that I might be completely wrong about my 
single-byte encoding representing a step forward from UTF-8.  
While this argument has produced no argument that I'm wrong, it's 
possible we've all missed something salient, some deal-breaker.  
As I said before, I'm not proposing that D "switch over."  I was 
simply asking people who know or at the very least use UTF-8 more 
than most, as a result of employing one of the few languages with 
Unicode support baked in, why they think UTF-8 is a good idea.

I was hoping for a technical discussion on the merits, before I 
went ahead and implemented this single-byte encoding.  Since 
nobody has been able to point out a reason for why my encoding 
wouldn't be much better than UTF-8, I see no reason not to go 
forward with my implementation.  I may write something up after 
implementation: most people don't care about ideas, only results, 
to the point where almost nobody can reason at all about ideas.

> Remember, extraordinary claims require extraordinary evidence, 
> not handwaving and assumptions disguised as bold assertions.
I don't think my claims are extraordinary or backed by 
"handwaving and assumptions."  Some people can reason about such 
possible encodings, even in the incomplete form I've sketched 
out, without having implemented them, if they know what they're 
doing.

On Saturday, 25 May 2013 at 22:01:13 UTC, Walter Bright wrote:
> On 5/25/2013 2:51 PM, Walter Bright wrote:
>> On 5/25/2013 12:51 PM, Joakim wrote:
>>> For a multi-language string encoding, the header would
>>> contain a single byte for every language used in the string, 
>>> along with multiple
>>> index bytes to signify the start and finish of every run of 
>>> single-language
>>> characters in the string. So, a list of languages and a list 
>>> of pure
>>> single-language substrings.
>>
>> Please implement the simple C function strstr() with this 
>> simple scheme, and
>> post it here.
>>
>> http://www.digitalmars.com/rtl/string.html#strstr
>
> I'll go first. Here's a simple UTF-8 version in C. It's not the 
> fastest way to do it, but at least it is correct:
> ----------------------------------
> char *strstr(const char *s1,const char *s2) {
>     size_t len1 = strlen(s1);
>     size_t len2 = strlen(s2);
>     if (!len2)
>         return (char *) s1;
>     char c2 = *s2;
>     while (len2 <= len1) {
>         if (c2 == *s1)
>             if (memcmp(s2,s1,len2) == 0)
>                 return (char *) s1;
>         s1++;
>         len1--;
>     }
>     return NULL;
> }
There is no question that a UTF-8 implementation of strstr can be 
simpler to write in C and D for multi-language strings that 
include Korean/Chinese/Japanese.  But while the strstr 
implementation for my encoding would contain more conditionals 
and lines of code, it would be far more efficient.  For instance, 
because you know where all the language substrings are from the 
header, you can potentially rule out searching vast swathes of 
the string, because they don't contain the same languages or 
lengths as the string you're searching for.

Even if you're searching a single-language string, which won't 
have those speedups, your naive implementation checks every byte, 
even continuation bytes, in UTF-8 to see if they might match the 
first letter of the search string, even though no continuation 
byte will match.  You can avoid this by partially decoding the 
leading bytes of UTF-8 characters and skipping over continuation 
bytes, as I've mentioned earlier in this thread, but you've then 
added more lines of code to your pretty yet simple function and 
added decoding overhead to every iteration of the while loop.

My single-byte encoding has none of these problems, in fact, it's 
much faster and uses less memory for the same function, while 
providing additional speedups, from the header, that are not 
available to UTF-8.

Finally, being able to write simple yet inefficient functions 
like this is not the test of a good encoding, as strstr is a 
library function, and making library developers' lives easier is 
a low priority for any good format.  The primary goals are ease 
of use for library consumers, ie app developers, and speed and 
efficiency of the code.  You are trading on the latter two for 
the former with this implementation.  That is not a good tradeoff.

Perhaps it was a good trade 20 years ago when everyone rolled 
their own code and nobody bothered waiting for those floppy disks 
to arrive with expensive library code.  It is not a good trade 
today.