Wide characters support in D

Walter Bright newshound1 at digitalmars.com
Tue Jun 8 13:26:41 PDT 2010


bearophile wrote:
> Walter Bright:
>> The problem with dchar's is strings of them consume memory at a prodigious
>> rate.
> 
> Warning: lazy musings ahead.
> 
> I hope we'll soon have computers with 200+ GB of RAM where using strings that
> use less than 32-bit chars is in most cases a premature optimization (like
> today is often a silly optimization to use arrays of 16-bit ints instead of
> 32-bit or 64-bit ints. Only special situations found with the profiler can
> justify the use of arrays of shorts in a low level language).
> 
> Even in PCs with 200 GB of RAM the first levels of CPU caches can be very
> small (like 32 KB), and cache misses are costly, so even if huge amounts of
> RAMs are present, to increase performance it can be useful to reduce the size
> of strings.
> 
> A possible solution to this problem can be some kind of real-time hardware
> compression/decompression between the CPU and the RAM. UTF-8 can be a good
> enough way to compress 32-bit strings. So we are back to writing low-level
> programs that have to deal with UTF-8.
> 
> To avoid this, CPUs and RAM can compress/decompress the text transparently to
> the programmer. Unfortunately UTF-8 is a variable-length encoding, so maybe
> it can't be done transparently enough. So a smarter and better compression
> algorithm can be used to keep all this transparent enough (not fully
> transparent, some low-level situations can require code that deals with the
> compression).

I strongly suspect that the encode/decode time for UTF-8 is more than 
compensated for by the 4x reduction in memory usage. I did a large app 10 years 
ago using dchars throughout, and the effects of the memory consumption were 
murderous.

(As the recent article on memory consumption shows, large data structures can 
have huge negative speed consequences due to virtual and cache memory, and 
multiple cores trying to access the same memory.)

https://lwn.net/Articles/250967/

Keep in mind that the overwhelming bulk of UTF-8 text is ascii, and requires 
only one cycle to "decode".


More information about the Digitalmars-d mailing list