Why UTF-8/16 character encodings?

Juan Manuel Cabo juanmanuel.cabo at gmail.com
Sat May 25 13:20:09 PDT 2013


On Saturday, 25 May 2013 at 19:51:43 UTC, Joakim wrote:
> On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky 
> wrote:
>> You can map a codepage to a subset of UCS :)
>> That's what they do internally anyway.
>> If I take you right you propose to define string as a header 
>> that denotes a set of windows in code space? I still fail to 
>> see how that would scale see below.
> Something like that.  For a multi-language string encoding, the 
> header would contain a single byte for every language used in 
> the string, along with multiple index bytes to signify the 
> start and finish of every run of single-language characters in 
> the string.  So, a list of languages and a list of pure 
> single-language substrings.  This is just off the top of my 
> head, I'm not suggesting it is definitive.
>

You obviously are not thinking it through. Such encoding would 
have a O(n^2) complexity for appending a character/symbol in a 
different language to the string, since you would have to update 
the beginning of the string, and move the contents forward to 
make room. Not to mention that it wouldn't be backwards 
compatible with ascii routines, and the complexity of such a 
header would be have to be carried all the way to font rendering 
routines in the OS.

Multiple languages/symbols in one string is a blessing of modern 
humane computing. It is the norm more than the exception in most 
of the world.

--jm



More information about the Digitalmars-d mailing list