First Impressions!

Patrick Schluter Patrick.Schluter at bbox.fr
Sun Dec 3 12:36:48 UTC 2017


On Saturday, 2 December 2017 at 22:16:09 UTC, Joakim wrote:
> On Friday, 1 December 2017 at 23:16:45 UTC, H. S. Teoh wrote:
>> On Fri, Dec 01, 2017 at 03:04:44PM -0800, Walter Bright via 
>> Digitalmars-d wrote:
>>> On 11/30/2017 9:23 AM, Kagamin wrote:
>>> > On Tuesday, 28 November 2017 at 03:37:26 UTC, rikki 
>>> > cattermole wrote:
>>> > > Be aware Microsoft is alone in thinking that UTF-16 was 
>>> > > awesome. Everybody else standardized on UTF-8 for Unicode.
>>> > 
>>> > UCS2 was awesome. UTF-16 is used by Java, JavaScript, 
>>> > Objective-C, Swift, Dart and ms tech, which is 28% of tiobe 
>>> > index.
>>> 
>>> "was" :-) Those are pretty much pre-surrogate pair designs, 
>>> or based
>>> on them (Dart compiles to JavaScript, for example).
>>> 
>>> UCS2 has serious problems:
>>> 
>>> 1. Most strings are in ascii, meaning UCS2 doubles memory 
>>> consumption. Strings in the executable file are twice the 
>>> size.
>>
>> This is not true in Asia, esp. where the CJK block is 
>> extensively used. A CJK block character is 3 bytes in UTF-8, 
>> meaning that string sizes are 150% of the UCS2 encoding.  If 
>> your code contains a lot of CJK text, that's a lot of bloat.
>
> Yep, that's why five years back many of the major Chinese sites 
> were still not using UTF-8:
>
> http://xahlee.info/w/what_encoding_do_chinese_websites_use.html

Summary

Taiwan sites almost all use UTF-8. Very old ones still use BIG5.

Mainland China sites mostly still use GBK or GB2312, but a few 
newer ones use UTF-8.

Many top Japan, Korea, sites also use UTF-8, but some uses EUC 
(Extended Unix Code) variants.

This probably means that UTF-8 might dominate in the future.

mmmh
>
> That led that Chinese guy to also rant against UTF-8 a couple 
> years ago:
>
> http://xahlee.info/comp/unicode_utf8_encoding_propaganda.html

A rant from someone reproaching a video it doesn't provide 
reasons why utf-8 is good by not providing any reasons why utf-8 
is bad. I'm not denying the issues with utf-8, only that the 
ranter doesn't provide any useful info on what the issues the 
"Asian" encounter with it, besides legacy reasons (which are 
important but do not enter in judging the technical quality of an 
encoding).
Add to that that he advocates for GB18030 which is quite inferior 
to utf-8 except in the legacy support area (here some of the 
advantages of utf-8 that GB-18030 does not possess: 
auto-synchronization, algorithmic mapping of codepoints, error 
detection).
If his only beef with utf-8 is the size for CJK text then he 
shouldn't argue for UTF-32 as he seems to do at the end.


More information about the Digitalmars-d mailing list