Why UTF-8/16 character encodings?
Declan
oyscal at 163.com
Sun May 26 05:22:52 PDT 2013
On Sunday, 26 May 2013 at 11:31:31 UTC, Joakim wrote:
> On Saturday, 25 May 2013 at 21:32:55 UTC, Walter Bright wrote:
>>> I have noted from the beginning that these large alphabets
>>> have to be encoded to
>>> two bytes, so it is not a true constant-width encoding if you
>>> are mixing one of
>>> those languages into a single-byte encoded string. But this
>>> "variable length"
>>> encoding is so much simpler than UTF-8, there's no comparison.
>>
>> If it's one byte sometimes, or two bytes sometimes, it's
>> variable length. You overlook that I've had to deal with this.
>> It isn't "simpler", there's actually more work to write code
>> that adapts to one or two byte encodings.
> It is variable length, with the advantage that only strings
> containing a few Asian languages are variable-length, as
> opposed to UTF-8 having every non-English language string be
> variable-length. It may be more work to write library code to
> handle my encoding, perhaps, but efficiency and ease of use are
> paramount.
>
>>> So let's see: first you say that my scheme has to be variable
>>> length because I
>>> am using two bytes to handle these languages,
>>
>> Well, it *is* variable length or you have to disregard
>> Chinese. You cannot have it both ways. Code to deal with two
>> bytes is significantly different than code to deal with one.
>> That means you've got a conditional in your generic code -
>> that isn't going to be faster than the conditional for UTF-8.
> Hah, I have explicitly said several times that I'd use a
> two-byte encoding for Chinese and I already acknowledged that
> such a predominantly single-byte encoding is still
> variable-length. The problem is that _you_ try to have it both
> ways: first you claimed it is variable-length because I support
> Chinese that way, then you claimed I don't support Chinese.
>
> Yes, there will be conditionals, just as there are several
> conditionals in phobos depending on whether a language supports
> uppercase or not. The question is whether the conditionals for
> single-byte encoding will execute faster than decoding every
> UTF-8 character. This is a matter of engineering judgement, I
> see no reason why you think decoding every UTF-8 character is
> faster.
>
>>> then you claim I don't handle
>>> these languages. This kind of blatant contradiction within
>>> two posts can only
>>> be called... trolling!
>>
>> You gave some vague handwaving about it, and then dismissed it
>> as irrelevant, along with more handwaving about what to do
>> with text that has embedded words in multiple languages.
> If it was mere "vague handwaving," how did you know I planned
> to use two bytes to encode Chinese? I'm not sure why you're
> continuing along this contradictory path.
>
> I didn't "handwave" about multi-language strings, I gave
> specific ideas about how they might be implemented. I'm not
> claiming to have a bullet-proof and detailed single-byte
> encoding spec, just spitballing some ideas on how to do it
> better than the abominable UTF-8.
>
>> Worse, there are going to be more than 256 of these encodings
>> - you can't even have a byte to specify them. Remember,
>> Unicode has approximately 256,000 characters in it. How many
>> code pages is that?
> There are 72 modern scripts in Unicode 6.1, 28 ancient scripts,
> maybe another 50 symbolic sets. That leaves space for another
> 100 or so new scripts. Maybe you are so worried about
> future-proofing that you'd use two bytes to signify the
> alphabet, but I wouldn't. I think it's more likely that we'll
> ditch scripts than add them. ;) Most of those symbol sets
> should not be in UCS.
>
>> I was being kind saying you were trolling, as otherwise I'd be
>> saying your scheme was, to be blunt, absurd.
> I think it's absurd to use a self-synchronizing text encoding
> from 20 years ago, that is really only useful when streaming
> text, which nobody does today. There may have been a time when
> ASCII compatibility was paramount, when nobody cared about
> internationalization and almost all libraries only took ASCII
> input: that is not the case today.
>
>> I'll be the first to admit that a lot of great ideas have been
>> initially dismissed by the experts as absurd. If you really
>> believe in this, I recommend that you write it up as a real
>> article, taking care to fill in all the handwaving with
>> something specific, and include some benchmarks to prove your
>> performance claims. Post your article on reddit,
>> stackoverflow, hackernews, etc., and look for fertile ground
>> for it. I'm sorry you're not finding fertile ground here (so
>> far, nobody has agreed with any of your points), and this is
>> the wrong place for such proposals anyway, as D is simply not
>> going to switch over to it.
> Let me admit in return that I might be completely wrong about
> my single-byte encoding representing a step forward from UTF-8.
> While this argument has produced no argument that I'm wrong,
> it's possible we've all missed something salient, some
> deal-breaker. As I said before, I'm not proposing that D
> "switch over." I was simply asking people who know or at the
> very least use UTF-8 more than most, as a result of employing
> one of the few languages with Unicode support baked in, why
> they think UTF-8 is a good idea.
>
> I was hoping for a technical discussion on the merits, before I
> went ahead and implemented this single-byte encoding. Since
> nobody has been able to point out a reason for why my encoding
> wouldn't be much better than UTF-8, I see no reason not to go
> forward with my implementation. I may write something up after
> implementation: most people don't care about ideas, only
> results, to the point where almost nobody can reason at all
> about ideas.
>
>> Remember, extraordinary claims require extraordinary evidence,
>> not handwaving and assumptions disguised as bold assertions.
> I don't think my claims are extraordinary or backed by
> "handwaving and assumptions." Some people can reason about
> such possible encodings, even in the incomplete form I've
> sketched out, without having implemented them, if they know
> what they're doing.
>
> On Saturday, 25 May 2013 at 22:01:13 UTC, Walter Bright wrote:
>> On 5/25/2013 2:51 PM, Walter Bright wrote:
>>> On 5/25/2013 12:51 PM, Joakim wrote:
>>>> For a multi-language string encoding, the header would
>>>> contain a single byte for every language used in the string,
>>>> along with multiple
>>>> index bytes to signify the start and finish of every run of
>>>> single-language
>>>> characters in the string. So, a list of languages and a list
>>>> of pure
>>>> single-language substrings.
>>>
>>> Please implement the simple C function strstr() with this
>>> simple scheme, and
>>> post it here.
>>>
>>> http://www.digitalmars.com/rtl/string.html#strstr
>>
>> I'll go first. Here's a simple UTF-8 version in C. It's not
>> the fastest way to do it, but at least it is correct:
>> ----------------------------------
>> char *strstr(const char *s1,const char *s2) {
>> size_t len1 = strlen(s1);
>> size_t len2 = strlen(s2);
>> if (!len2)
>> return (char *) s1;
>> char c2 = *s2;
>> while (len2 <= len1) {
>> if (c2 == *s1)
>> if (memcmp(s2,s1,len2) == 0)
>> return (char *) s1;
>> s1++;
>> len1--;
>> }
>> return NULL;
>> }
> There is no question that a UTF-8 implementation of strstr can
> be simpler to write in C and D for multi-language strings that
> include Korean/Chinese/Japanese. But while the strstr
> implementation for my encoding would contain more conditionals
> and lines of code, it would be far more efficient. For
> instance, because you know where all the language substrings
> are from the header, you can potentially rule out searching
> vast swathes of the string, because they don't contain the same
> languages or lengths as the string you're searching for.
>
> Even if you're searching a single-language string, which won't
> have those speedups, your naive implementation checks every
> byte, even continuation bytes, in UTF-8 to see if they might
> match the first letter of the search string, even though no
> continuation byte will match. You can avoid this by partially
> decoding the leading bytes of UTF-8 characters and skipping
> over continuation bytes, as I've mentioned earlier in this
> thread, but you've then added more lines of code to your pretty
> yet simple function and added decoding overhead to every
> iteration of the while loop.
>
> My single-byte encoding has none of these problems, in fact,
> it's much faster and uses less memory for the same function,
> while providing additional speedups, from the header, that are
> not available to UTF-8.
>
> Finally, being able to write simple yet inefficient functions
> like this is not the test of a good encoding, as strstr is a
> library function, and making library developers' lives easier
> is a low priority for any good format. The primary goals are
> ease of use for library consumers, ie app developers, and speed
> and efficiency of the code. You are trading on the latter two
> for the former with this implementation. That is not a good
> tradeoff.
>
> Perhaps it was a good trade 20 years ago when everyone rolled
> their own code and nobody bothered waiting for those floppy
> disks to arrive with expensive library code. It is not a good
> trade today.
I服了u,I'm thinking of your name means joking?
More information about the Digitalmars-d
mailing list