Wide characters support in D

Mon Jun 7 18:57:11 PDT 2010

"Ruslan Nikolaev" <nruslan_devel at yahoo.com> wrote in message 
news:mailman.122.1275952601.24349.digitalmars-d at puremagic.com...
> Ok, ok... that was just a suggestion... Thanks, for reply about "Hello 
> world" representation. Was postfix "w" and "d" added initially or just 
> recently? I did not know about it. I thought D does automatic conversion 
> for string literals.
>

The postfix 'c', 'w' and 'd' have been in there a long time. But D does have 
a little bit of automatic conversion. Let me try to clarify:

    "hello"c  // string, UTF-8
    "hello"w  // wstring, UTF-16
    "hello"d  // dstring, UTF-32
    "hello"   // Depends how you use it

Suppose I have a function that takes a UTF-8 string, and I call it:

    void cfoo(string a) {}

    cfoo("hello"c); // Works
    cfoo("hello"w); // Error, wrong type
    cfoo("hello"d); // Error, wrong type
    cfoo("hello");  // Works, assumed to be UTF-8 string

If I make a different function that takes a UTF-16 wstring instead:

    void wfoo(wstring a) {}

    wfoo("hello"c); // Error, wrong type
    wfoo("hello"w); // Works
    wfoo("hello"d); // Error, wrong type
    wfoo("hello");  // Works, assumed to be UTF-16 wstring

And then, a UTF-32 dstring version would be similar:

    void dfoo(dstring a) {}

    dfoo("hello"c); // Error, wrong type
    dfoo("hello"w); // Error, wrong type
    dfoo("hello"d); // Works
    dfoo("hello");  // Works, assumed to be UTF-32 dstring

As you can see, the literals with postfixes are always the exact type you 
specify. If you have no postfix, then you get whatever the compiler expects 
it to be.

But, then the question is, what happens if any of those types can be used? 
Which does the compiler choose?

    void Tfoo(T)(T a)
    {
        // When compiling, display the type used.
        pragma(msg, T.stringof);
    }

    Tfoo("hello");

(Normally you'd want to add in a constraint that T must be one of the string 
types, so that no one tries to pass in an int or float or something. I 
skipped that in there.)

In that, Tfoo isn't expecting any particular type of string, it can take any 
type. And "hello" doesn't have a postfix, so the compiler uses the default: 
UTF-8 string.

> Yes, templates may help. However, that unnecessary make code bigger (since 
> we have to compile it for every char type).<

It only generates code for the types that are actually needed. If, for 
instance, your progam never uses anything except UTF-8, then only one 
version of the function will be made - the UTF-8 version.  If you don't use 
every char type, then it doesn't generate it for every char type - just the 
ones you choose to use.

>The other problem is that it allows programmer to choose which one to use. 
>He or she may just prefer char[] as UTF-8 (or wchar[] as UTF-16). That will 
>be fine on platform that supports this encoding natively (e.g. for file 
>system operations, screen output, etc.), whereas it will cause conversion 
>overhead on the other. I don't think there is any problem with having 
>different size of char. In fact, that would make programs better (since 
>application programmers will have to think in terms of characters as 
>opposed to bytes). Not to say that it's a big overhead, but unnecessary 
>one. Having said this, I do agree that there must be some flexibility (e.g. 
>in Java char[] is always 2 bytes), however, I don't believe that this 
>flexibility should be available for application programmer.
<

That's not good. First of all, UTF-16 is a lousy encoding, it combines the 
worst of both UTF-8 and UTF-32: It's multibyte and non-word-aligned like 
UTF-8, but it still wastes a lot of space like UTF-32. So even if your OS 
uses it natively, it's still best to do most internal processing in either 
UTF-8 or UTF-32. (And with templated string functions, if the programmer 
actually does want to use the native type in the *rare* cases where he's 
making enough OS calls that it would actually matter, he can still do so.)

Secondly, the programmer *should* be able to use whatever type he decides is 
appropriate. If he wants to stick with native, he can do so, but he 
shouldn't be forced into choosing between "use the native encoding" and 
"abuse the type system by pretending that an int is a character". For 
instance, complex low-level text processing *relies* on knowing exactly what 
encoding is being used and coding specifically to that encoding. As an 
example, I'm currently working on a generalized parser library ( 
http://www.dsource.org/projects/goldie ). Something like that is complex 
enough already that implementing the internal lexer natively for each 
possible native text encoding is just not worthwhile, expecially since the 
text hardly every gets passed to or from any OS calls that expect any 
particular encoding. Or maybe you're on a fancy OS that can handle any 
encoding natively. Or maybe the programmer is in a low-memory (or 
very-large-data) situation and needs the space savings of UTF-8 regardless 
of OS and doesn't care about speed. Or maybe they're actually *writing* an 
OS (Most moderns languages are completely useless for writing an OS. D 
isn't). A language or a library should *never* assume it knows the 
programmer's needs better than the programmer does.

Also, C already tried the approach of multi-sized types (ex, C's "int"), and 
it ended up being a big PITA disaster that everyone ended up having to make 
up hacks to work around.

>>
> System programmers (i.e. OS programmers) may choose to think as they 
> expect it to be (since char width option can be added to compiler).<

See that's the thing, D is intended as a systems language, so a D programmer 
must be able to easily handle it that way whenever they need to.

>TCHAR in Windows is a good example of it. Whenever you need to determine 
>size of element (e.g. for allocation), you can use 'sizeof'. Again, it does 
>not mean that you're deprived of char/wchar/dchar capability. It still can 
>be supported (e.g. via ubyte/ushort/uint) for the sake of interoperability 
>or some special cases. Special string constants (e.g. ""b, ""w, ""d) can be 
>supported, too. My only point is that it would be good to have universal 
>char type that depends on platform.

You can have that easily:

version(Windows)
    alias wstring tstring;
else
    alias string tstring;

Besides, just because you *can* get a job done a certain way doesn't mean 
languages should never try to allow a better way for those who want a better 
way.

> That, in turns, allows to have unified char for all libraries on this 
> platform.
>

With templated text functions, there is very little benefit to be gained 
from having a unified char. Just wouldn't serve any real purpose. All it 
would do is cause problems for anyone who needs to work at the low-level.

-------------------------------
Not sent from an iPhone.