Wide characters support in D

Tue Jun 8 07:46:11 PDT 2010

On 2010-06-08 09:22:02 -0400, Ruslan Nikolaev <nruslan_devel at yahoo.com> said:

> you don't need to provide instances for every other character type, and 
> at the same time - use native character encoding available on system.

My opinion is thinking this will work is a fallacy. Here's why...

Generally Linux systems use UTF-8 so I guess the "system encoding" 
there will be UTF-8. But then if you start to use QT you have to use 
UTF-16, but you might have to intermix UTF-8 to work with other 
libraries in the backend (libraries which are not necessarily D 
libraries, nor system libraries). So you may have a UTF-8 backend (such 
as the MySQL library), UTF-8 "system encoding" glue code, and UTF-16 
GUI code (QT). That might be a good or a bad choice, depending on 
various factors, such as whether the glue code send more strings to the 
backend or the GUI.

Now try to port the thing to Windows where you define the "system 
encoding" as UTF-16. Now you still have the same UTF-8 backend, and the 
same UTF-16 GUI code, but for some reason you're changing the glue code 
in the middle to UTF-16? Sure, it can be made to work, but all the 
string conversions will start to happen elsewhere, which may change the 
performance characteristics and add some potential for bugs, and all 
this for no real reason.

The problem is that what you call "system encoding" is only the 
encoding used by the system frameworks. It is relevant when working 
with the system frameworks, but when you're working with any other API, 
you'll probably want to use the same character type as that API does, 
not necessarily the "system encoding". Not all programs are based on 
extensive use of the system frameworks. In some situations you'll want 
to use UTF-16 on Linux, or UTF-8 on Windows, because you're dealing 
with libraries that expect that (QT, MySQL).

A compiler switch is a poor choice there, because you can't mix 
libraries compiled with a different compiler switches when that switch 
changes the default character type.

In most cases, it's much better in my opinion if the programmer just 
uses the same character type as one of the libraries it uses, stick to 
that, and is aware of what he's doing. If someone really want to deal 
with the complexity of supporting both character types depending on the 
environment it runs on, it's easy to create a "tchar" and "tstring" 
alias that depends on whether it's Windows or Linux, or on a custom 
version flag from a compiler switch, but that'll be his choice and his 
responsibility to make everything work. But I think in this case a 
better option might be to abstract all those 'strings' under a single 
type that work with all UTF encodings (something like [mtext]).

[mtext]: http://www.dprogramming.com/mtext.php

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/