Why UTF-8/16 character encodings?
Dmitry Olshansky
dmitry.olsh at gmail.com
Sat May 25 12:58:22 PDT 2013
25-May-2013 23:51, Joakim пишет:
> On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote:
>> You can map a codepage to a subset of UCS :)
>> That's what they do internally anyway.
>> If I take you right you propose to define string as a header that
>> denotes a set of windows in code space? I still fail to see how that
>> would scale see below.
> Something like that. For a multi-language string encoding, the header
> would contain a single byte for every language used in the string, along
> with multiple index bytes to signify the start and finish of every run
> of single-language characters in the string. So, a list of languages and
> a list of pure single-language substrings. This is just off the top of
> my head, I'm not suggesting it is definitive.
>
Runs away in horror :) It's mess even before you've got to details.
Another point about using sometimes a 2-byte encoding - welcome to the
nice world of BigEndian/LittleEndian i.e. the very trap UTF-16 has
stepped into.
--
Dmitry Olshansky
More information about the Digitalmars-d
mailing list