Why UTF-8/16 character encodings?

Dmitry Olshansky dmitry.olsh at gmail.com
Sat May 25 12:58:22 PDT 2013


25-May-2013 23:51, Joakim пишет:
> On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote:
>> You can map a codepage to a subset of UCS :)
>> That's what they do internally anyway.
>> If I take you right you propose to define string as a header that
>> denotes a set of windows in code space? I still fail to see how that
>> would scale see below.
> Something like that.  For a multi-language string encoding, the header
> would contain a single byte for every language used in the string, along
> with multiple index bytes to signify the start and finish of every run
> of single-language characters in the string. So, a list of languages and
> a list of pure single-language substrings.  This is just off the top of
> my head, I'm not suggesting it is definitive.
>

Runs away in horror :) It's mess even before you've got to details.

Another point about using sometimes a 2-byte encoding - welcome to the 
nice world of BigEndian/LittleEndian i.e. the very trap UTF-16 has 
stepped into.

-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list