Why UTF-8/16 character encodings?
Dmitry Olshansky
dmitry.olsh at gmail.com
Fri May 24 14:21:25 PDT 2013
24-May-2013 21:05, Joakim пишет:
> On Friday, 24 May 2013 at 09:49:40 UTC, Jacob Carlborg wrote:
>> toUpper/lower cannot be made in place if it should handle all Unicode.
>> Some characters will change their length when convert to/from
>> uppercase. Examples of these are the German double S and some Turkish I.
>
> This triggered a long-standing bugbear of mine: why are we using these
> variable-length encodings at all? Does anybody really care about UTF-8
> being "self-synchronizing," ie does anybody actually use that in this
> day and age? Sure, it's backwards-compatible with ASCII and the vast
> majority of usage is probably just ASCII, but that means the other
> languages don't matter anyway. Not to mention taking the valuable 8-bit
> real estate for English and dumping the longer encodings on everyone else.
>
> I'd just use a single-byte header to signify the language and then put
> the vast majority of languages in a single byte encoding, with the few
> exceptional languages with more than 256 characters encoded in two
> bytes.
You seem to think that not only UTF-8 is bad encoding but also one
unified encoding (code-space) is bad(?).
Separate code spaces were the case before Unicode (and utf-8). The
problem is not only that without header text is meaningless (no easy
slicing) but the fact that encoding of data after header strongly
depends a variety of factors - a list of encodings actually. Now
everybody has to keep a (code) page per language to at least know if
it's 2 bytes per char or 1 byte per char or whatever. And you still work
on a basis that there is no combining marks and regional specific stuff :)
In fact it was even "better" nobody ever talked about header they just
assumed a codepage with some global setting. Imagine yourself creating a
font rendering system these days - a hell of an exercise in frustration
(okay how do I render 0x88 ? mm if that is in codepage XYZ then ...).
> OK, that doesn't cover multi-language strings, but that is what,
> .000001% of usage?
This just shows you don't care for multilingual stuff at all. Imagine
any language tutor/translator/dictionary on the Web. For instance most
languages need to intersperse ASCII (also keep in mind e.g. HTML
markup). Books often feature citations in native language (or e.g.
latin) along with translations.
Now also take into account math symbols, currency symbols and beyond.
Also these days cultures are mixing in wild combinations so you might
need to see the text even if you can't read it. Unicode is not only
"encode characters from all languages". It needs to address universal
representation of symbolics used in writing systems at large.
> Make your header a little longer and you could handle
> those also. Yes, it wouldn't be strictly backwards-compatible with
> ASCII, but it would be so much easier to internationalize. Of course,
> there's also the monoculture we're creating; love this UTF-8 rant by
> tuomov, author of one the first tiling window managers for linux:
>
We want monoculture! That is to understand each without all these
"par-le-vu-france?" and codepages of various complexity(insanity).
Want small - use compression schemes which are perfectly fine and get to
the precious 1byte per codepoint with exceptional speed.
http://www.unicode.org/reports/tr6/
> http://tuomov.bitcheese.net/b/archives/2006/08/26/T20_16_06
>
> The emperor has no clothes, what am I missing?
And borrowing the arguments from from that rant: locale is borked shit
when it comes to encodings. Locales should be used for tweaking visual
like numbers, date display an so on.
--
Dmitry Olshansky
More information about the Digitalmars-d
mailing list