Why UTF-8/16 character encodings?

Sun May 26 01:44:23 PDT 2013

On Saturday, 25 May 2013 at 18:56:42 UTC, Diggory wrote:
> "limited success of UTF-8"
>
> Becoming the de-facto standard encoding EVERYWERE except for 
> windows which uses UTF-16 is hardly a failure...
So you admit that UTF-8 isn't used on the vast majority of 
computers since the inception of Unicode.  That's what I call 
limited success, thank you for agreeing with me. :)

> I really don't understand your hatred for UTF-8 - it's simple 
> to decode and encode, fast and space-efficient. Fixed width 
> encodings are not inherently fast, the only thing they are 
> faster at is if you want to randomly access the Nth character 
> instead of the Nth byte. In the rare cases that you need to do 
> a lot of this kind of random access there exists UTF-32...
Space-efficient?  Do you even understand what a single-byte 
encoding is?  Suffice to say, a single-byte encoding beats UTF-8 
on all these measures, not just one.

> Any fixed width encoding which can encode every unicode 
> character must use at least 3 bytes, and using 4 bytes is 
> probably going to be faster because of alignment, so I don't 
> see what the great improvement over UTF-32 is going to be.
Slaps head.  You don't need "at least 3 bytes" because you're 
packing language info in the header.  I don't think you even know 
what I'm talking about.

>> slicing does require decoding
> Nope.
Of course it does, at least partially.  There is no other way to 
know where the code points are.

>> I didn't mean that people are literally keeping code pages.  I 
>> meant that there's not much of a difference between code pages 
>> with 2 bytes per char and the language character sets in UCS.
>
> Unicode doesn't have "language character sets". The different 
> planes only exist for organisational purposes they don't affect 
> how characters are encoded.
Nobody's talking about different planes.  I'm talking about all 
the different language character sets in this list:

http://en.wikipedia.org/wiki/List_of_Unicode_characters

>> ?!  It's okay because you deem it "coherent in its scheme?"  I 
>> deem headers much more coherent. :)
>
> Sure if you change the word "coherent" to mean something 
> completely different... Coherent means that you store related 
> things together, ie. everything that you need to decode a 
> character in the same place, not spread out between part of a 
> character and a header.
Coherent means that the organizational pieces fit together and 
make sense conceptually, not that everything is stored together.  
My point is that putting the language info in a header seems much 
more coherent to me than ramming that info into every character.

>> but I suspect substring search not requiring decoding is the 
>> exception for UTF-8 algorithms, not the rule.
> The only time you need to decode is when you need to do some 
> transformation that depends on the code point such as 
> converting case or identifying which character class a 
> particular character belongs to. Appending, slicing, copying, 
> searching, replacing, etc. basically all the most common text 
> operations can all be done without any encoding or decoding.
Slicing by byte, which is the only way to slice without decoding, 
is useless, I have to laugh that you even include it. :) All 
these basic operations can be done very fast, often faster than 
UTF-8, in a single-byte encoding.  Once you start talking code 
points, it's no contest: UTF-8 flat out loses.

On Saturday, 25 May 2013 at 19:42:41 UTC, Diggory wrote:
>> All a code page is is a table of mappings, UCS is just a much 
>> larger, standardized table of such mappings.
>
> UCS does have nothing to do with code pages, it was designed as 
> a replacement for them. A codepage is a strict subset of the 
> possible characters, UCS is the entire set of possible 
> characters.
"[I]t was designed as a replacement for them" by combining 
several of them into a master code page and removing 
redundancies.  Functionally, they are the same and historically 
they maintain the same layout in at least some cases.  To then 
say, UCS has "nothing to do with code pages" is just dense.

>> I see you've abandoned without note your claim that phobos 
>> doesn't convert UTF-8 to UTF-32 internally.  Perhaps 
>> converting to UTF-32 is "as fast as any variable width 
>> encoding is going to get" but my claim is that single-byte 
>> encodings will be faster.
>
> I haven't "abandoned my claim". It's a simple fact that phobos 
> does not convert UTF-8 string to UTF-32 strings before it uses 
> them.
>
> ie. the difference between this:
> string mystr = ...;
> dstring temp = mystr.to!dstring;
> for (int i = 0; i < temp.length; ++i)
>     process(temp[i]);
>
> and this:
> string mystr = ...;
> size_t i = 0;
> while (i < mystr.length) {
>     dchar current = decode(mystr, i);
>     process(current);
> }
>
> And if you can't see why the latter example is far more 
> efficient I give up...
I take your point that phobos is often decoding by char as it 
iterates through, but there are still functions in std.string 
that convert the entire string, as in your first example.  The 
point is that you are forced to decode everything to UTF-32, 
whether by char or the entire string.  Your latter example may be 
marginally more efficient but it is only useful for functions 
that start from the beginning and walk the string in only one 
direction, which not all operations do.

>>> - Multiple code pages per string
>>> This just makes everything overly complicated and is far 
>>> slower to decode what the actual character is than UTF-8.
>> I disagree, this would still be far faster than UTF-8, 
>> particularly if you designed your header right.
>
> The cache misses alone caused by simply accessing the separate 
> headers would be a larger overhead than decoding UTF-8 which 
> takes a few assembly instructions and has perfect locality and 
> can be efficiently pipelined by the CPU.
Lol, you think a few potential cache misses is going to be slower 
than repeatedly decoding, whether in assembly and pipelined or 
not, every single UTF-8 character? :D

> Then there's all the extra processing involved combining the 
> headers when you concatenate strings. Plus you lose the one 
> benefit a fixed width encoding has because random access is no 
> longer possible without first finding out which header controls 
> the location you want to access.
There would be a few arithmetic operations on substring indices 
when concatenating strings, hardly anything.

Random access is still not only possible, it is incredibly fast 
in most cases: you just have to check first if the header lists 
any two-byte encodings.  This can be done once and cached as a 
property of the string (set a boolean no_two_byte_encoding once 
and simply have the slice operator check it before going ahead), 
just as you could add a property to UTF-8 strings to allow quick 
random access if they happen to be pure ASCII.  The difference is 
that only strings that include the two-byte encoded 
Korean/Chinese/Japanese characters would require a bit more 
calculation for slicing in my scheme, whereas _every_ non-ASCII 
UTF-8 string requires full decoding to allow random access.  This 
is a clear win for my single-byte encoding, though maybe not the 
complete demolition of UTF-8 you were hoping for. ;)

>> No, it's not the same at all.  The contents of an 
>> arbitrary-length file cannot be compressed to a single byte, 
>> you would have collisions galore.  But since most non-english 
>> alphabets are less than 256 characters, they can all be 
>> uniquely encoded in a single byte per character, with the 
>> header determining what language's code page to use.  I don't 
>> understand your analogy whatsoever.
>
> It's very simple - the more information about the type of data 
> you are compressing you have at the time of writing the 
> algorithm the better compression ration you can get, to the 
> point that if you know exactly what the file is going to 
> contain you can compress it to nothing. This is why you have 
> specialised compression algorithms for images, video, audio, 
> etc.
This may be mostly true in general, but your specific example of 
compressing down to a byte is nonsense.  For any arbitrarily long 
data, there are always limits to compression.  What any of this 
has to do with my single-byte encoding, I have no idea.

> It doesn't matter how few characters non-english alphabets have 
> - unless you know WHICH alphabet it is before-hand you can't 
> store it in a single byte. Since any given character could be 
> in any alphabet the best you can do is look at the 
> probabilities of different characters appearing and use shorter 
> representations for more common ones. (This is the basis for 
> all lossless compression) The english alphabet plus 0-9 and 
> basic punctuation are by far the most common characters used on 
> computers so it makes sense to use one byte for those and 
> multiple bytes for rarer characters.
How many times have I said that "you know WHICH alphabet it is 
before-hand" because that info is stored in the header?  That is 
why I specifically said, from my first post, that multi-language 
strings would have more complex headers, which I later pointed 
out could list all the different language substrings within a 
multi-language string.  Your silly exposition of how compression 
works makes me wonder if you understand anything about how a 
single-byte encoding would work.

Perhaps it made sense to use one byte for ASCII characters and 
relegate _every other language_ to multiple bytes two decades 
ago.  It doesn't make sense today.

>>> - As I've shown the only space-efficient way to do this is 
>>> using a variable length encoding like UTF-8
>> You haven't shown this.
> If you had thought through your suggestion of multiple code 
> pages per string you would see that I had.
You are not packaging and transmitting the code pages with the 
string, just as you do not ship the entire UCS with every UTF-8 
string.  A single-byte encoding is going to be more 
space-efficient for the vast majority of strings, everybody knows 
this.

>> No, it does a very bad job of this.  Every non-ASCII character 
>> takes at least two bytes to encode, whereas my single-byte 
>> encoding scheme would encode every alphabet with less than 256 
>> characters in a single byte.
>
> And strings with mixed characters would use lots of memory and 
> be extremely slow. Common when using proper names, quotes, 
> inline translations, graphical characters, etc. etc. Not to 
> mention the added complexity to actually implement the 
> algorithms.
Ah, you have finally stumbled across the path to a good argument, 
though I'm not sure how, given your seeming ignorance of how 
single-byte encodings work. :) There _is_ a degenerate case with 
my particular single-byte encoding (not the ones you list, which 
would still be faster and use less memory than UTF-8): strings 
that use many, if not all, character sets.  So the worst case 
scenario might be something like a string that had 100 
characters, every one from a different language.  In that case, I 
think it would still be smaller than the equivalent UTF-8 string, 
but not by much.

There might be some complexity in implementing the algorithms, 
but on net, likely less than UTF-8, while being much more usable 
for most programmers.

On Saturday, 25 May 2013 at 22:41:59 UTC, Diggory wrote:
> 1) Take the byte at a particular offset in the string
> 2) If it is ASCII then we're done
> 3) Otherwise count the number of '1's at the start of the byte 
> - this is how many bytes make up the character (there's even an 
> ASM instruction to do this)
> 4) This first byte will look like '1110xxxx' for a 3 byte 
> character, '11110xxx' for a 4 byte character, etc.
> 5) All following bytes are of the form '10xxxxxx'
> 6) Now just concatenate all the 'x's together and add an offset 
> to get the code point
Not sure why you chose to write this basic UTF-8 stuff out, other 
than to bluster on without much use.

> Note that this is CONSTANT TIME, O(1) with minimal branching so 
> well suited to pipelining (after the initial byte the other 
> bytes can all be processed in parallel by the CPU) and only 
> sequential memory access so no cache misses, and zero 
> additional memory requirements
It is constant time _per character_.  You have to do it for 
_every_ non-ASCII character in your string, so the decoding adds 
up.

> Now compare your encoding:
> 1) Look up the offset in the header using binary search: O(log 
> N) lots of branching
It is difficult to reason about the header, because it all 
depends on the number of languages used and how many substrings 
there are.  There are worst-case scenarios that could approach 
something like log(n) but extremely unlikely in real-world use.  
Most of the time, this would be O(1).

> 2) Look up the code page ID in a massive array of code pages to 
> work out how many bytes per character
Hardly, this could be done by a simple lookup function that 
simply checked if the language was one of the few alphabets that 
require two bytes.

> 3) Hope this array hasn't been paged out and is still in the 
> cache
> 4) Extract that many bytes from the string and combine them 
> into a number
Lol, I love how you think this is worth listing as a separate 
step for the few two-byte encodings, yet have no problem with 
doing this for every non-ASCII character in UTF-8.

> 5) Look up this new number in yet another large array specific 
> to the code page
Why?  The language byte and number uniquely specify the 
character, just like your Unicode code point above.  If you were 
simply encoding the UCS in a single-byte encoding, you would 
arrange your scheme in such a way to trivially be able to 
generate the UCS code point using these two bytes.

> This is O(log N) has lots of branching so no pipelining (every 
> stage depends on the result of the stage before), lots of 
> random memory access so lots of cache misses, lots of 
> additional memory requirements to store all those tables, and 
> an algorithm that isn't even any easier to understand.
Wrong on practically every count, as detailed above.

> Plus every other algorithm to operate on it except for decoding 
> is insanely complicated.
They are still much _less_ complicated than UTF-8, that's the 
comparison that matters.