First Impressions!

Walter Bright newshound2 at digitalmars.com
Sat Dec 2 10:02:55 UTC 2017


On 12/1/2017 3:16 PM, H. S. Teoh wrote:
> This is not true in Asia, esp. where the CJK block is extensively used.
> A CJK block character is 3 bytes in UTF-8, meaning that string sizes are
> 150% of the UCS2 encoding.  If your code contains a lot of CJK text,
> that's a lot of bloat.
> 
> But then again, in non-Latin locales you'd generally store your strings
> separately of the executable (usually in l10n files), so this may not be
> that big an issue. But the blanket statement "Most strings are in ASCII"
> is not correct.

Are you sure about that? I know that Asian languages will be longer in UTF-8. 
But how much data that programs handle is in those languages? The language of 
business, science, programming, aviation, and engineering is english.

Of course, D itself is agnostic about that. The compiler, for example, accepts 
strings, identifiers, and comments in Chinese in UTF-16 format.


More information about the Digitalmars-d mailing list