[Issue 5016] to!() can not convert from wide characters to char

Sun Jan 9 15:18:20 PST 2011

http://d.puremagic.com/issues/show_bug.cgi?id=5016

Jonathan M Davis <jmdavisProg at gmx.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jmdavisProg at gmx.com

--- Comment #2 from Jonathan M Davis <jmdavisProg at gmx.com> 2011-01-09 15:16:32 PST ---
char is explictly defined to be a UTF-8 code unit. wchar is explicitly defined
to be a UTF-16 code unit. dchar is explicitly defined to be a UTF-32 code unit.
In UTF-8 and UTF-16, it can take multiple code units to make up a code point,
whereas it always takes one code one UTF-32 code unit to make a code point. A
code point is what you would normally think of as a character. This is all
standard unicode stuff and getting rid of it would be foolish. It's used all
over the place in computing, not just in D.

Part of the trick to dealing with char and wchar correctly is that if you wish
to deal with code points / characters (_not_ code units), then _never_ deal
with char and wchar individually. That's why most of std.string deals with
entire strings at time. If you want to deal with an individual character, you
either use a dchar or one of the string types - e.g. 'a' as a dchar or "a" as a
string type. You shouldn't be converting from dchar to char and vice versa (or
between either of those and wchar). It really doesn't make sense. What makes
sense is converting between string types.

On the whole, what D does works fantastically, but you need to understand the
basics of unicode. The best place to look would probably be The D Programming
Language by Andrei Alexandrescu, since it applies directly to D, but there are
plenty of places online to find info on unicode, and you can look at the online
docs on arrays for more info about them: http://is.gd/krYRH .

What it comes down to really is that you use whatever string type you need
based on size - string, wstring, or dstring - or the need to be able to treat
an individual array index as a character. If you need to be able to use random
access on a string (including using them in algorithms in std.algorithm which
require random access ranges), or if you need to be able to alter individual
characters in place, then use dstring or dchar[]. Otherwise, save space and use
either string or wstring (string would generally be better unless you're using
primarily asian characters, since they tend to take 3 bytes in UTF-8 and 2 in
UTF-16).

There are functions which specifically take a dchar, so you can give them a
character then, but most deal entirely in strings, even if what you really care
about is an individual character. So, generally just treat individual
characters as strings with one character.

Take a look at the functions in std.utf: http://is.gd/krZLW . e.g.
std.utf.count() can be used to tell you how many code points / characters there
are in a string, and std.utf.stride() will tell you how many code units a
particular character is so that you can index into a string or wstring if you
have to.

When using foreach, make sure that you give the type as dchar. e.g.

string str = "hello world";

foreach(dchar c; str)
    writeln(c);

will print out each character individually, whereas as using char (which is the
default if you don't give a type) or wchar would print out the individual code
units (which isn't generally very useful). foreach is smart enough to convert
the string to the appropriate type on the fly while iterating over it, so if
you give it dchar, it'll take each code point at a time instead of each code
unit.

I'm sure that there are other things that would be useful to point out, but
that's all that comes to mind at the moment. On the whole, the way D handles
strings is fantastic. You just have to realize that you're dealing with UTF-8,
UTF-16, and UTF-32 code units instead of code points when you have a char,
wchar, or dchar respectively. dchar/UTF-32 is the only type where code units
and code points are the same size.

There has been some talk of various improvements to how all of this works (like
possibly making dchar the default type for foreach with string types), so some
incremental improvements may be made to iron out some of the wrinkles, but
strings in D are designed the way that they are on purpose, and it's not likely
to be drastically changed. For the most part, the problem is not the design but
rather understanding what the design is so that you can use it properly.

If you want to avoid the whole issue, then you can just use dstring everywhere,
but that _will_ result in using about 4 times the amount of memory as you would
need with string if you're dealing primarily with ASCII characters.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------