Wide characters support in D

Steven Schveighoffer schveiguy at yahoo.com
Mon Jun 7 18:54:46 PDT 2010


On Mon, 07 Jun 2010 17:48:09 -0400, Ruslan Nikolaev  
<nruslan_devel at yahoo.com> wrote:

> Note: I posted this already on runtime D list, but I think that list was  
> a wrong one for this question. Sorry for duplication :-)
>
> Hi. I am new to D. It looks like D supports 3 types of characters: char,  
> wchar, dchar. This is cool, however, I have some questions about it:
>
> 1. When we have 2 methods (one with wchar[] and another with char[]),  
> how D will determine which one to use if I pass a string "hello world"?
> 2. Many libraries (e.g. tango or phobos) don't provide functions/methods  
> (or have incomplete support) for wchar/dchar
> e.g. writefln probably assumes char[] for strings like "Number %d..."
> 3. Even if they do support, it is kind of annoying to provide methods  
> for all 3 types of chars. Especially, if we want to use native mode  
> (e.g. for Windows wchar is better, for Linux char is better). E.g.  
> Windows has _wopen, _wdirent, _wreaddir, _wopenddir, _wmain(int argc,  
> wchar_t[] argv) and so on, and they should be native (in a sense that no  
> conversion is necessary when we do, for instance, _wopen). Linux doesn't  
> have them as UTF-8 is used widely there.
>
> Since D language is targeted on system programming, why not to try to  
> use whatever works better on a particular system (e.g. char will be 2  
> bytes on Windows and 1 byte on Linux; it can be a compiler switch, and  
> all libraries can be compiled properly on a particular system). It's  
> still necessary to have all 3 types of char for cooperation with C. But  
> in those cases byte, short and int will do their work. For this kind of  
> situation, it would be nice to have some built-in functions for  
> transparent conversion from char to byte/short/int and vice versa  
> (especially, if conversion only happens if needed on a particular  
> platform).
>
> In my opinion, to separate notion of character from byte would be nice,  
> and it makes sense as a particular platform uses either UTF-8 or UTF-16  
> natively. Programmers may write universal code (like TCHAR on Windows).  
> Unfortunately, C uses 'char' and 'byte' interchangeably but why D has to  
> make this mistake again?

One thing that may not be clear from your interpretation of D's docs, all  
strings representable by one character type are also representable by all  
the other character types.  This means that a function that takes a char[]  
can also take a dchar[] if it is sent through a converter (i.e. toUtf8 on  
Tango I think).

So D's char is decidedly not like byte or ubyte, or C's char.

In general, I use char (utf8) because I am used to C and ASCII (which is  
exactly represented in utf-8).  But because char is utf-8, it could  
potentially accept any unicode string.

-Steve


More information about the Digitalmars-d mailing list