Why UTF-8/16 character encodings?

Joakim joakim at airpost.net
Fri May 24 13:45:35 PDT 2013


On Friday, 24 May 2013 at 20:37:58 UTC, Joakim wrote:
>> 3. Even if I have a string that is 99% ASCII then I have to 
>> pay extra bytes for every character just because 1% wasn't 
>> ASCII. With UTF-8, I only pay the extra bytes when needed.
> I don't understand what you mean here.  If your string has a 
> thousand non-ASCII characters, the UTF-8 version will have one 
> or two thousand more characters, ie 1 or 2 KB more.  My format 
> would add a couple bytes in the header for each non-ASCII 
> language character used, that's it.  It's a clear win for my 
> format.
Sorry, I was a bit imprecise.  Here's what I meant to write:

I don't understand what you mean here.  If your string has a
thousand non-ASCII characters, the UTF-8 version will have one
or two thousand more bytes, ie 1 or 2 KB more.  My format
would add a couple bytes in the header for each non-ASCII
language used, that's it.  It's a clear win for my format.


More information about the Digitalmars-d mailing list