Why UTF-8/16 character encodings?

Sun May 26 05:49:41 PDT 2013

On Sunday, 26 May 2013 at 11:31:31 UTC, Joakim wrote:
> On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
>>> I have noted from the beginning that these large alphabets 
>>> have to be encoded to
>>> two bytes, so it is not a true constant-width encoding if you 
>>> are mixing one of
>>> those languages into a single-byte encoded string.  But this 
>>> "variable length"
>>> encoding is so much simpler than UTF-8, there's no comparison.
>>
>> If it's one byte sometimes, or two bytes sometimes, it's 
>> variable length. You overlook that I've had to deal with this. 
>> It isn't "simpler", there's actually more work to write code 
>> that adapts to one or two byte encodings.
> It is variable length, with the advantage that only strings 
> containing a few Asian languages are variable-length, as 
> opposed to UTF-8 having every non-English language string be 
> variable-length.  It may be more work to write library code to 
> handle my encoding, perhaps, but efficiency and ease of use are 
> paramount.
>
>>> So let's see: first you say that my scheme has to be variable 
>>> length because I
>>> am using two bytes to handle these languages,
>>
>> Well, it *is* variable length or you have to disregard 
>> Chinese. You cannot have it both ways. Code to deal with two 
>> bytes is significantly different than code to deal with one. 
>> That means you've got a conditional in your generic code - 
>> that isn't going to be faster than the conditional for UTF-8.
> Hah, I have explicitly said several times that I'd use a 
> two-byte encoding for Chinese and I already acknowledged that 
> such a predominantly single-byte encoding is still 
> variable-length.  The problem is that _you_ try to have it both 
> ways: first you claimed it is variable-length because I support 
> Chinese that way, then you claimed I don't support Chinese.
>
> Yes, there will be conditionals, just as there are several 
> conditionals in phobos depending on whether a language supports 
> uppercase or not.  The question is whether the conditionals for 
> single-byte encoding will execute faster than decoding every 
> UTF-8 character.  This is a matter of engineering judgement, I 
> see no reason why you think decoding every UTF-8 character is 
> faster.
>
>>> then you claim I don't handle
>>> these languages.  This kind of blatant contradiction within 
>>> two posts can only
>>> be called... trolling!
>>
>> You gave some vague handwaving about it, and then dismissed it 
>> as irrelevant, along with more handwaving about what to do 
>> with text that has embedded words in multiple languages.
> If it was mere "vague handwaving," how did you know I planned 
> to use two bytes to encode Chinese?  I'm not sure why you're 
> continuing along this contradictory path.
>
> I didn't "handwave" about multi-language strings, I gave 
> specific ideas about how they might be implemented.  I'm not 
> claiming to have a bullet-proof and detailed single-byte 
> encoding spec, just spitballing some ideas on how to do it 
> better than the abominable UTF-8.
>
>> Worse, there are going to be more than 256 of these encodings 
>> - you can't even have a byte to specify them. Remember, 
>> Unicode has approximately 256,000 characters in it. How many 
>> code pages is that?
> There are 72 modern scripts in Unicode 6.1, 28 ancient scripts, 
> maybe another 50 symbolic sets.  That leaves space for another 
> 100 or so new scripts.  Maybe you are so worried about 
> future-proofing that you'd use two bytes to signify the 
> alphabet, but I wouldn't.  I think it's more likely that we'll 
> ditch scripts than add them. ;) Most of those symbol sets 
> should not be in UCS.
>
>> I was being kind saying you were trolling, as otherwise I'd be 
>> saying your scheme was, to be blunt, absurd.
> I think it's absurd to use a self-synchronizing text encoding 
> from 20 years ago, that is really only useful when streaming 
> text, which nobody does today.  There may have been a time when 
> ASCII compatibility was paramount, when nobody cared about 
> internationalization and almost all libraries only took ASCII 
> input: that is not the case today.
>
>> I'll be the first to admit that a lot of great ideas have been 
>> initially dismissed by the experts as absurd. If you really 
>> believe in this, I recommend that you write it up as a real 
>> article, taking care to fill in all the handwaving with 
>> something specific, and include some benchmarks to prove your 
>> performance claims. Post your article on reddit, 
>> stackoverflow, hackernews, etc., and look for fertile ground 
>> for it. I'm sorry you're not finding fertile ground here (so 
>> far, nobody has agreed with any of your points), and this is 
>> the wrong place for such proposals anyway, as D is simply not 
>> going to switch over to it.
> Let me admit in return that I might be completely wrong about 
> my single-byte encoding representing a step forward from UTF-8.
>  While this argument has produced no argument that I'm wrong, 
> it's possible we've all missed something salient, some 
> deal-breaker.  As I said before, I'm not proposing that D 
> "switch over."  I was simply asking people who know or at the 
> very least use UTF-8 more than most, as a result of employing 
> one of the few languages with Unicode support baked in, why 
> they think UTF-8 is a good idea.
>
> I was hoping for a technical discussion on the merits, before I 
> went ahead and implemented this single-byte encoding.  Since 
> nobody has been able to point out a reason for why my encoding 
> wouldn't be much better than UTF-8, I see no reason not to go 
> forward with my implementation.  I may write something up after 
> implementation: most people don't care about ideas, only 
> results, to the point where almost nobody can reason at all 
> about ideas.
>
>> Remember, extraordinary claims require extraordinary evidence, 
>> not handwaving and assumptions disguised as bold assertions.
> I don't think my claims are extraordinary or backed by 
> "handwaving and assumptions."  Some people can reason about 
> such possible encodings, even in the incomplete form I've 
> sketched out, without having implemented them, if they know 
> what they're doing.
>
> On Saturday, 25 May 2013 at 22:01:13 UTC, Walter Bright wrote:
>> On 5/25/2013 2:51 PM, Walter Bright wrote:
>>> On 5/25/2013 12:51 PM, Joakim wrote:
>>>> For a multi-language string encoding, the header would
>>>> contain a single byte for every language used in the string, 
>>>> along with multiple
>>>> index bytes to signify the start and finish of every run of 
>>>> single-language
>>>> characters in the string. So, a list of languages and a list 
>>>> of pure
>>>> single-language substrings.
>>>
>>> Please implement the simple C function strstr() with this 
>>> simple scheme, and
>>> post it here.
>>>
>>> http://www.digitalmars.com/rtl/string.html#strstr
>>
>> I'll go first. Here's a simple UTF-8 version in C. It's not 
>> the fastest way to do it, but at least it is correct:
>> ----------------------------------
>> char *strstr(const char *s1,const char *s2) {
>>    size_t len1 = strlen(s1);
>>    size_t len2 = strlen(s2);
>>    if (!len2)
>>        return (char *) s1;
>>    char c2 = *s2;
>>    while (len2 <= len1) {
>>        if (c2 == *s1)
>>            if (memcmp(s2,s1,len2) == 0)
>>                return (char *) s1;
>>        s1++;
>>        len1--;
>>    }
>>    return NULL;
>> }
> There is no question that a UTF-8 implementation of strstr can 
> be simpler to write in C and D for multi-language strings that 
> include Korean/Chinese/Japanese.  But while the strstr 
> implementation for my encoding would contain more conditionals 
> and lines of code, it would be far more efficient.  For 
> instance, because you know where all the language substrings 
> are from the header, you can potentially rule out searching 
> vast swathes of the string, because they don't contain the same 
> languages or lengths as the string you're searching for.
>
> Even if you're searching a single-language string, which won't 
> have those speedups, your naive implementation checks every 
> byte, even continuation bytes, in UTF-8 to see if they might 
> match the first letter of the search string, even though no 
> continuation byte will match.  You can avoid this by partially 
> decoding the leading bytes of UTF-8 characters and skipping 
> over continuation bytes, as I've mentioned earlier in this 
> thread, but you've then added more lines of code to your pretty 
> yet simple function and added decoding overhead to every 
> iteration of the while loop.
>
> My single-byte encoding has none of these problems, in fact, 
> it's much faster and uses less memory for the same function, 
> while providing additional speedups, from the header, that are 
> not available to UTF-8.
>
> Finally, being able to write simple yet inefficient functions 
> like this is not the test of a good encoding, as strstr is a 
> library function, and making library developers' lives easier 
> is a low priority for any good format.  The primary goals are 
> ease of use for library consumers, ie app developers, and speed 
> and efficiency of the code.  You are trading on the latter two 
> for the former with this implementation.  That is not a good 
> tradeoff.
>
> Perhaps it was a good trade 20 years ago when everyone rolled 
> their own code and nobody bothered waiting for those floppy 
> disks to arrive with expensive library code.  It is not a good 
> trade today.

I suggest you make an attempt at writing strstr and post it. Code 
speaks louder than words.