Why UTF-8/16 character encodings?

Walter Bright newshound2 at digitalmars.com
Sat May 25 14:32:52 PDT 2013


On 5/25/2013 1:03 PM, Joakim wrote:
> On Saturday, 25 May 2013 at 19:30:25 UTC, Walter Bright wrote:
>> On the other hand, Joakim even admits his single byte encoding is variable
>> length, as otherwise he simply dismisses the rarely used (!) Chinese,
>> Japanese, and Korean languages, as well as any text that contains words from
>> more than one language.
> I have noted from the beginning that these large alphabets have to be encoded to
> two bytes, so it is not a true constant-width encoding if you are mixing one of
> those languages into a single-byte encoded string.  But this "variable length"
> encoding is so much simpler than UTF-8, there's no comparison.

If it's one byte sometimes, or two bytes sometimes, it's variable length. You 
overlook that I've had to deal with this. It isn't "simpler", there's actually 
more work to write code that adapts to one or two byte encodings.


>> I suspect he's trolling us, and quite successfully.
> Ha, I wondered who would pull out this insult, quite surprised to see it's
> Walter.  It seems to be the trend on the internet to accuse anybody you disagree
> with of trolling, I am honestly surprised to see Walter stoop so low.
> Considering I'm the only one making any cogent arguments here, perhaps I should
> wonder if you're all trolling me. ;)
>
> On Saturday, 25 May 2013 at 19:35:42 UTC, Walter Bright wrote:
>> I suspect the Chinese, Koreans, and Japanese would take exception to being
>> called irrelevant.
> Irrelevant only because they are a small subset of the UCS.  I have noted that
> they would also be handled by a two-byte encoding.
>
>> Good luck with your scheme that can't handle languages written by billions of
>> people!
> So let's see: first you say that my scheme has to be variable length because I
> am using two bytes to handle these languages,

Well, it *is* variable length or you have to disregard Chinese. You cannot have 
it both ways. Code to deal with two bytes is significantly different than code 
to deal with one. That means you've got a conditional in your generic code - 
that isn't going to be faster than the conditional for UTF-8.


> then you claim I don't handle
> these languages.  This kind of blatant contradiction within two posts can only
> be called... trolling!

You gave some vague handwaving about it, and then dismissed it as irrelevant, 
along with more handwaving about what to do with text that has embedded words in 
multiple languages.

Worse, there are going to be more than 256 of these encodings - you can't even 
have a byte to specify them. Remember, Unicode has approximately 256,000 
characters in it. How many code pages is that?

I was being kind saying you were trolling, as otherwise I'd be saying your 
scheme was, to be blunt, absurd.

---------------------------------------

I'll be the first to admit that a lot of great ideas have been initially 
dismissed by the experts as absurd. If you really believe in this, I recommend 
that you write it up as a real article, taking care to fill in all the 
handwaving with something specific, and include some benchmarks to prove your 
performance claims. Post your article on reddit, stackoverflow, hackernews, 
etc., and look for fertile ground for it. I'm sorry you're not finding fertile 
ground here (so far, nobody has agreed with any of your points), and this is the 
wrong place for such proposals anyway, as D is simply not going to switch over 
to it.

Remember, extraordinary claims require extraordinary evidence, not handwaving 
and assumptions disguised as bold assertions.



More information about the Digitalmars-d mailing list