Why UTF-8/16 character encodings?
Joakim
joakim at airpost.net
Sat May 25 12:02:42 PDT 2013
On Saturday, 25 May 2013 at 18:09:26 UTC, Diggory wrote:
> On Saturday, 25 May 2013 at 08:07:42 UTC, Joakim wrote:
>> On Saturday, 25 May 2013 at 07:48:05 UTC, Diggory wrote:
>>> I think you are a little confused about what unicode actually
>>> is... Unicode has nothing to do with code pages and nobody
>>> uses code pages any more except for compatibility with legacy
>>> applications (with good reason!).
>> Incorrect.
>>
>> "Unicode is an effort to include all characters from previous
>> code pages into a single character enumeration that can be
>> used with a number of encoding schemes... In practice the
>> various Unicode character set encodings have simply been
>> assigned their own code page numbers, and all the other code
>> pages have been technically redefined as encodings for various
>> subsets of Unicode."
>> http://en.wikipedia.org/wiki/Code_page#Relationship_to_Unicode
>>
>
> That confirms exactly what I just said...
No, that directly _contradicts_ what you said about Unicode
having "nothing to do with code pages." All UCS did is take a
bunch of existing code pages and standardize them into one
massive character set. For example, ISCII was a pre-existing
single-byte encoding and Unicode "largely preserves the ISCII
layout within each block."
http://en.wikipedia.org/wiki/ISCII
All a code page is is a table of mappings, UCS is just a much
larger, standardized table of such mappings.
>>> You said that phobos converts UTF-8 strings to UTF-32 before
>>> operating on them but that's not true. As it iterates over
>>> UTF-8 strings it iterates over dchars rather than chars, but
>>> that's not in any way inefficient so I don't really see the
>>> problem.
>> And what's a dchar? Let's check:
>>
>> dchar : unsigned 32 bit UTF-32
>> http://dlang.org/type.html
>>
>> Of course that's inefficient, you are translating your whole
>> encoding over to a 32-bit encoding every time you need to
>> process it. Walter as much as said so up above.
>
> Given that all the machine registers are at least 32-bits
> already it doesn't make the slightest difference. The only
> additional operations on top of ascii are when it's a
> multi-byte character, and even then it's some simple bit
> manipulation which is as fast as any variable width encoding is
> going to get.
I see you've abandoned without note your claim that phobos
doesn't convert UTF-8 to UTF-32 internally. Perhaps converting
to UTF-32 is "as fast as any variable width encoding is going to
get" but my claim is that single-byte encodings will be faster.
> The only alternatives to a variable width encoding I can see
> are:
> - Single code page per string
> This is completely useless because now you can't concatenate
> strings of different code pages.
I wouldn't be so fast to ditch this. There is a real argument to
be made that strings of different languages are sufficiently
different that there should be no multi-language strings. Is
this the best route? I'm not sure, but I certainly wouldn't
dismiss it out of hand.
> - Multiple code pages per string
> This just makes everything overly complicated and is far slower
> to decode what the actual character is than UTF-8.
I disagree, this would still be far faster than UTF-8,
particularly if you designed your header right.
> - String with escape sequences to change code page
> Can no longer access characters in the middle or end of the
> string, you have to parse the entire string every time which
> completely negates the benefit of a fixed width encoding.
I didn't think of this possibility, but you may be right that
it's sub-optimal.
>>> Also your complaint that UTF-8 reserves the short characters
>>> for the english alphabet is not really relevant - the
>>> characters with longer encodings tend to be rarer (such as
>>> special symbols) or carry more information (such as chinese
>>> characters where the same sentence takes only about 1/3 the
>>> number of characters).
>> The vast majority of non-english alphabets in UCS can be
>> encoded in a single byte. It is your exceptions that are not
>> relevant.
>
> Well obviously... That's like saying "if you know what the
> exact contents of a file are going to be anyway you can
> compress it to a single byte!"
>
> ie. It's possible to devise an encoding which will encode any
> given string to an arbitrarily small size. It's still
> completely useless because you'd have to know the string in
> advance...
No, it's not the same at all. The contents of an
arbitrary-length file cannot be compressed to a single byte, you
would have collisions galore. But since most non-english
alphabets are less than 256 characters, they can all be uniquely
encoded in a single byte per character, with the header
determining what language's code page to use. I don't understand
your analogy whatsoever.
> - A useful encoding has to be able to handle every unicode
> character
> - As I've shown the only space-efficient way to do this is
> using a variable length encoding like UTF-8
You haven't shown this.
> - Given the frequency distribution of unicode characters, UTF-8
> does a pretty good job at encoding higher frequency characters
> in fewer bytes.
No, it does a very bad job of this. Every non-ASCII character
takes at least two bytes to encode, whereas my single-byte
encoding scheme would encode every alphabet with less than 256
characters in a single byte.
> - Yes you COULD encode non-english alphabets in a single byte
> but doing so would be inefficient because it would mean the
> more frequently used characters take more bytes to encode.
Not sure what you mean by this.
More information about the Digitalmars-d
mailing list