Wide characters support in D
Nick Sabalausky
a at a.a
Mon Jun 7 18:57:11 PDT 2010
"Ruslan Nikolaev" <nruslan_devel at yahoo.com> wrote in message
news:mailman.122.1275952601.24349.digitalmars-d at puremagic.com...
> Ok, ok... that was just a suggestion... Thanks, for reply about "Hello
> world" representation. Was postfix "w" and "d" added initially or just
> recently? I did not know about it. I thought D does automatic conversion
> for string literals.
>
The postfix 'c', 'w' and 'd' have been in there a long time. But D does have
a little bit of automatic conversion. Let me try to clarify:
"hello"c // string, UTF-8
"hello"w // wstring, UTF-16
"hello"d // dstring, UTF-32
"hello" // Depends how you use it
Suppose I have a function that takes a UTF-8 string, and I call it:
void cfoo(string a) {}
cfoo("hello"c); // Works
cfoo("hello"w); // Error, wrong type
cfoo("hello"d); // Error, wrong type
cfoo("hello"); // Works, assumed to be UTF-8 string
If I make a different function that takes a UTF-16 wstring instead:
void wfoo(wstring a) {}
wfoo("hello"c); // Error, wrong type
wfoo("hello"w); // Works
wfoo("hello"d); // Error, wrong type
wfoo("hello"); // Works, assumed to be UTF-16 wstring
And then, a UTF-32 dstring version would be similar:
void dfoo(dstring a) {}
dfoo("hello"c); // Error, wrong type
dfoo("hello"w); // Error, wrong type
dfoo("hello"d); // Works
dfoo("hello"); // Works, assumed to be UTF-32 dstring
As you can see, the literals with postfixes are always the exact type you
specify. If you have no postfix, then you get whatever the compiler expects
it to be.
But, then the question is, what happens if any of those types can be used?
Which does the compiler choose?
void Tfoo(T)(T a)
{
// When compiling, display the type used.
pragma(msg, T.stringof);
}
Tfoo("hello");
(Normally you'd want to add in a constraint that T must be one of the string
types, so that no one tries to pass in an int or float or something. I
skipped that in there.)
In that, Tfoo isn't expecting any particular type of string, it can take any
type. And "hello" doesn't have a postfix, so the compiler uses the default:
UTF-8 string.
> Yes, templates may help. However, that unnecessary make code bigger (since
> we have to compile it for every char type).<
It only generates code for the types that are actually needed. If, for
instance, your progam never uses anything except UTF-8, then only one
version of the function will be made - the UTF-8 version. If you don't use
every char type, then it doesn't generate it for every char type - just the
ones you choose to use.
>The other problem is that it allows programmer to choose which one to use.
>He or she may just prefer char[] as UTF-8 (or wchar[] as UTF-16). That will
>be fine on platform that supports this encoding natively (e.g. for file
>system operations, screen output, etc.), whereas it will cause conversion
>overhead on the other. I don't think there is any problem with having
>different size of char. In fact, that would make programs better (since
>application programmers will have to think in terms of characters as
>opposed to bytes). Not to say that it's a big overhead, but unnecessary
>one. Having said this, I do agree that there must be some flexibility (e.g.
>in Java char[] is always 2 bytes), however, I don't believe that this
>flexibility should be available for application programmer.
<
That's not good. First of all, UTF-16 is a lousy encoding, it combines the
worst of both UTF-8 and UTF-32: It's multibyte and non-word-aligned like
UTF-8, but it still wastes a lot of space like UTF-32. So even if your OS
uses it natively, it's still best to do most internal processing in either
UTF-8 or UTF-32. (And with templated string functions, if the programmer
actually does want to use the native type in the *rare* cases where he's
making enough OS calls that it would actually matter, he can still do so.)
Secondly, the programmer *should* be able to use whatever type he decides is
appropriate. If he wants to stick with native, he can do so, but he
shouldn't be forced into choosing between "use the native encoding" and
"abuse the type system by pretending that an int is a character". For
instance, complex low-level text processing *relies* on knowing exactly what
encoding is being used and coding specifically to that encoding. As an
example, I'm currently working on a generalized parser library (
http://www.dsource.org/projects/goldie ). Something like that is complex
enough already that implementing the internal lexer natively for each
possible native text encoding is just not worthwhile, expecially since the
text hardly every gets passed to or from any OS calls that expect any
particular encoding. Or maybe you're on a fancy OS that can handle any
encoding natively. Or maybe the programmer is in a low-memory (or
very-large-data) situation and needs the space savings of UTF-8 regardless
of OS and doesn't care about speed. Or maybe they're actually *writing* an
OS (Most moderns languages are completely useless for writing an OS. D
isn't). A language or a library should *never* assume it knows the
programmer's needs better than the programmer does.
Also, C already tried the approach of multi-sized types (ex, C's "int"), and
it ended up being a big PITA disaster that everyone ended up having to make
up hacks to work around.
>>
> System programmers (i.e. OS programmers) may choose to think as they
> expect it to be (since char width option can be added to compiler).<
See that's the thing, D is intended as a systems language, so a D programmer
must be able to easily handle it that way whenever they need to.
>TCHAR in Windows is a good example of it. Whenever you need to determine
>size of element (e.g. for allocation), you can use 'sizeof'. Again, it does
>not mean that you're deprived of char/wchar/dchar capability. It still can
>be supported (e.g. via ubyte/ushort/uint) for the sake of interoperability
>or some special cases. Special string constants (e.g. ""b, ""w, ""d) can be
>supported, too. My only point is that it would be good to have universal
>char type that depends on platform.
You can have that easily:
version(Windows)
alias wstring tstring;
else
alias string tstring;
Besides, just because you *can* get a job done a certain way doesn't mean
languages should never try to allow a better way for those who want a better
way.
> That, in turns, allows to have unified char for all libraries on this
> platform.
>
With templated text functions, there is very little benefit to be gained
from having a unified char. Just wouldn't serve any real purpose. All it
would do is cause problems for anyone who needs to work at the low-level.
-------------------------------
Not sent from an iPhone.
More information about the Digitalmars-d
mailing list