First Impressions!

Joakim dlang at joakim.fea.st
Sat Dec 2 22:16:09 UTC 2017


On Friday, 1 December 2017 at 23:16:45 UTC, H. S. Teoh wrote:
> On Fri, Dec 01, 2017 at 03:04:44PM -0800, Walter Bright via 
> Digitalmars-d wrote:
>> On 11/30/2017 9:23 AM, Kagamin wrote:
>> > On Tuesday, 28 November 2017 at 03:37:26 UTC, rikki 
>> > cattermole wrote:
>> > > Be aware Microsoft is alone in thinking that UTF-16 was 
>> > > awesome. Everybody else standardized on UTF-8 for Unicode.
>> > 
>> > UCS2 was awesome. UTF-16 is used by Java, JavaScript, 
>> > Objective-C, Swift, Dart and ms tech, which is 28% of tiobe 
>> > index.
>> 
>> "was" :-) Those are pretty much pre-surrogate pair designs, or 
>> based
>> on them (Dart compiles to JavaScript, for example).
>> 
>> UCS2 has serious problems:
>> 
>> 1. Most strings are in ascii, meaning UCS2 doubles memory 
>> consumption. Strings in the executable file are twice the size.
>
> This is not true in Asia, esp. where the CJK block is 
> extensively used. A CJK block character is 3 bytes in UTF-8, 
> meaning that string sizes are 150% of the UCS2 encoding.  If 
> your code contains a lot of CJK text, that's a lot of bloat.

Yep, that's why five years back many of the major Chinese sites 
were still not using UTF-8:

http://xahlee.info/w/what_encoding_do_chinese_websites_use.html

That led that Chinese guy to also rant against UTF-8 a couple 
years ago:

http://xahlee.info/comp/unicode_utf8_encoding_propaganda.html

Considering China buys more smartphones than the US and Europe 
combined, it's time people started recognizing their importance 
when it comes to issues like this:

https://www.statista.com/statistics/412108/global-smartphone-shipments-global-region/

Regarding the unique representation issue Jonathan brings up, 
I've heard people say that was to provide an easier path for 
legacy encodings, ie some used combining characters and others 
didn't, so Unicode chose to accommodate both so both groups would 
move to Unicode.  It would be nice if the Unicode people spent 
their time pruning and regularizing what they have, rather than 
adding more useless stuff.

Speaking of which, completely agree with Walter and Jonathan that 
there's no need to add emoji and other such symbols to Unicode, 
should have never been added.  Unicode is supposed to standardize 
long-existing characters, not promote marginal new symbols to 
characters.  If there's a real need for it, chat software will 
figure out a way to do it, no need to add such symbols to the 
Unicode character set.


More information about the Digitalmars-d mailing list