First Impressions!

Patrick Schluter Patrick.Schluter at bbox.fr
Sat Dec 2 10:35:50 UTC 2017


On Friday, 1 December 2017 at 23:16:45 UTC, H. S. Teoh wrote:
> On Fri, Dec 01, 2017 at 03:04:44PM -0800, Walter Bright via 
> Digitalmars-d wrote:
>> On 11/30/2017 9:23 AM, Kagamin wrote:
>> > On Tuesday, 28 November 2017 at 03:37:26 UTC, rikki 
>> > cattermole wrote:
>> > > Be aware Microsoft is alone in thinking that UTF-16 was 
>> > > awesome. Everybody else standardized on UTF-8 for Unicode.
>> > 
>> > UCS2 was awesome. UTF-16 is used by Java, JavaScript, 
>> > Objective-C, Swift, Dart and ms tech, which is 28% of tiobe 
>> > index.
>> 
>> "was" :-) Those are pretty much pre-surrogate pair designs, or 
>> based
>> on them (Dart compiles to JavaScript, for example).
>> 
>> UCS2 has serious problems:
>> 
>> 1. Most strings are in ascii, meaning UCS2 doubles memory 
>> consumption. Strings in the executable file are twice the size.
>
> This is not true in Asia, esp. where the CJK block is 
> extensively used. A CJK block character is 3 bytes in UTF-8, 
> meaning that string sizes are 150% of the UCS2 encoding.  If 
> your code contains a lot of CJK text, that's a lot of bloat.

That's true in theory, in practice it's not that severe as the 
CJK languages are never isolated and appear embedded in a lot of 
ASCII. You can read here a case study [1] which shows 106% for 
Simplified Chinese, 76% for Traditional Chinese, 129% for 
Japanese and 94% for Korean. These numbers for pure text. Publish 
it on the web embedded in bloated html and there goes the size 
advantage of UTF-16



>
> But then again, in non-Latin locales you'd generally store your 
> strings separately of the executable (usually in l10n files), 
> so this may not be that big an issue. But the blanket statement 
> "Most strings are in ASCII" is not correct.
>
False, in the sense that isolated pure text is rare and is 
generally delivered inside some file format, most times ASCII 
based like docx, odf, tmx, xliff, akoma ntoso etc...

[1]: 
https://stackoverflow.com/questions/6883434/at-all-times-text-encoded-in-utf-8-will-never-give-us-more-than-a-50-file-size



More information about the Digitalmars-d mailing list