Wide characters support in D
Robert Clipsham
robert at octarineparrot.com
Mon Jun 7 15:46:34 PDT 2010
On 07/06/10 22:48, Ruslan Nikolaev wrote:
> Note: I posted this already on runtime D list, but I think that list
> was a wrong one for this question. Sorry for duplication :-)
>
> Hi. I am new to D. It looks like D supports 3 types of characters:
> char, wchar, dchar. This is cool, however, I have some questions
> about it:
>
> 1. When we have 2 methods (one with wchar[] and another with char[]),
> how D will determine which one to use if I pass a string "hello
> world"?
If you pass "Hello World", this is always a string (char[] in D1,
immutable(char)[] in D2). If you want to specify a type with a string
literal, you can use "Hello World"w or "Hello World"d for wstring
anddstringrespectively.
> 2. Many libraries (e.g. tango or phobos) don't provide
> functions/methods (or have incomplete support) for wchar/dchar e.g.
> writefln probably assumes char[] for strings like "Number %d..."
In tango most, if not all string functions are templated, so work with
all string types, char[], wchar[] and dchar[]. I don't know how well
phobos supports other string types, I know phobos 1 is extremely limited
for types other than char[], I don't know about Phobos 2
> 3.
> Even if they do support, it is kind of annoying to provide methods
> for all 3 types of chars. Especially, if we want to use native mode
> (e.g. for Windows wchar is better, for Linux char is better). E.g.
> Windows has _wopen, _wdirent, _wreaddir, _wopenddir, _wmain(int argc,
> wchar_t[] argv) and so on, and they should be native (in a sense that
> no conversion is necessary when we do, for instance, _wopen). Linux
> doesn't have them as UTF-8 is used widely there.
Enter templates! You can write the function once and have it work with
all three string types with little effort involved. All the lower level
functions that interact with the operating system are abstracted away
nicely for you in both Tango and Phobos, so you'll never have to deal
with this for basic functions. For your own it's a simple matter of
templating them in most cases.
> Since D language is targeted on system programming, why not to try to
> use whatever works better on a particular system (e.g. char will be 2
> bytes on Windows and 1 byte on Linux; it can be a compiler switch,
> and all libraries can be compiled properly on a particular system).
> It's still necessary to have all 3 types of char for cooperation with
> C. But in those cases byte, short and int will do their work. For
> this kind of situation, it would be nice to have some built-in
> functions for transparent conversion from char to byte/short/int and
> vice versa (especially, if conversion only happens if needed on a
> particular platform).
This is something C did wrong. If compilers are free to choose their own
width for the string type you end up with the mess C has where every
library introduces their own custom types to make sure they're the
expected length, eg uint32_t etc. Having things the other way around
makes life far easier - int is always 32bits signed for example, the
same applies to strings. You can use version blocks if you want to
specify a type which changes based on platform, I wouldn't recommend it
though, it just makes life harder in the long run.
> In my opinion, to separate notion of character from byte would be
> nice, and it makes sense as a particular platform uses either UTF-8
> or UTF-16 natively. Programmers may write universal code (like TCHAR
> on Windows). Unfortunately, C uses 'char' and 'byte' interchangeably
> but why D has to make this mistake again?
They are different types in D, so I'm not sure what you mean. byte/ubyte
have no encoding associated with them, char is always UTF-8, wchar
UTF-16 etc.
Robert
More information about the Digitalmars-d
mailing list