Why UTF-8/16 character encodings?
Joakim
joakim at airpost.net
Sat May 25 12:51:42 PDT 2013
On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote:
> You can map a codepage to a subset of UCS :)
> That's what they do internally anyway.
> If I take you right you propose to define string as a header
> that denotes a set of windows in code space? I still fail to
> see how that would scale see below.
Something like that. For a multi-language string encoding, the
header would contain a single byte for every language used in the
string, along with multiple index bytes to signify the start and
finish of every run of single-language characters in the string.
So, a list of languages and a list of pure single-language
substrings. This is just off the top of my head, I'm not
suggesting it is definitive.
> Mm... strictly speaking (let's turn that argument backwards) -
> what are algorithms that require slicing say [5..$] of string
> without ever looking at it left to right, searching etc.?
Don't know, I was just pointing out that all the claims of easy
slicing with UTF-8 are wrong. But a single-byte encoding would
be scanned much faster also, as I've noted above, no decoding
necessary and single bytes will always be faster than multiple
bytes, even without decoding.
> How would it look like? Or how the processing will go?
Detailed a bit above. As I mentioned earlier in this thread,
functions like toUpper would execute much faster because you
wouldn't have to scan substrings containing languages that don't
have uppercase, which you have to scan in UTF-8.
>> long before you mentioned this unicode compression
>> scheme.
>
> It does inline headers or rather tags. That hop between fixed
> char windows. It's not random-access nor claims to be.
I wasn't criticizing it, just saying that it seems to be
superficially similar to my scheme. :)
>> version of my single-byte encoding scheme! You do raise a
>> good point:
>> the only reason why we're likely using such a bad encoding in
>> UTF-8 is
>> that nobody else wants to tackle this hairy problem.
>
> Yup, where have you been say almost 10 years ago? :)
I was in grad school, avoiding writing my thesis. :) I'd never
have thought I'd be discussing Unicode today, didn't even know
what it was back then.
>> Not necessarily. But that is actually one of the advantages of
>> single-byte encodings, as I have noted above. toUpper is a
>> NOP for a
>> single-byte encoding string with an Asian script, you can't do
>> that with
>> a UTF-8 string.
>
> But you have to check what encoding it's in and given that not
> all codepages are that simple to upper case some generic
> algorithm is required.
You have to check the language, but my point is that you can look
at the header and know that toUpper has to do nothing for a
single-byte-encoded string of an Asian script which doesn't have
uppercase characters. With UTF-8, you have to decode the entire
string to find that out.
>> They may seem superficially similar but they're not. For
>> example, from
>> the beginning, I have suggested a more complex header that can
>> enable
>> multi-language strings, as one possible solution. I don't
>> think code
>> pages provided that.
>
> The problem is how would you define an uppercase algorithm for
> multilingual string with 3 distinct 256 codespaces (windows)? I
> bet it's won't be pretty.
How is it done now? It isn't pretty with UTF-8 now either, as
some languages have uppercase characters and others don't. The
version of toUpper for my encoding will be similar, but it will
do less work, because it doesn't have to be invoked for every
character in the string.
> I still don't see how your solution scales to beyond 256
> different codepoints per string (= multiple pages/parts of UCS
> ;) ).
I assume you're talking about Chinese, Korean, etc. alphabets? I
mentioned those to Walter earlier, they would have a two-byte
encoding. No way around that, but they would still be easier to
deal with than UTF-8, because of the header.
More information about the Digitalmars-d
mailing list