Wide characters support in D

Ruslan Nikolaev nruslan_devel at yahoo.com
Mon Jun 7 19:26:02 PDT 2010


> It only generates code for the types that are actually
> needed. If, for 
> instance, your progam never uses anything except UTF-8,
> then only one 
> version of the function will be made - the UTF-8
> version.  If you don't use 
> every char type, then it doesn't generate it for every char
> type - just the 
> ones you choose to use.

Not quite right. If we create system dynamic libraries or dynamic libraries commonly used, we will have to compile every instance unless we want to burden user with this. Otherwise, the same code will be duplicated in users program over and over again.

> That's not good. First of all, UTF-16 is a lousy encoding,
> it combines the 
> worst of both UTF-8 and UTF-32: It's multibyte and
> non-word-aligned like 
> UTF-8, but it still wastes a lot of space like UTF-32. So
> even if your OS 
> uses it natively, it's still best to do most internal
> processing in either 
> UTF-8 or UTF-32. (And with templated string functions, if
> the programmer 
> actually does want to use the native type in the *rare*
> cases where he's 
> making enough OS calls that it would actually matter, he
> can still do so.)
>

First of all, UTF-16 is not a lousy encoding. It requires for most characters 2 bytes (not so big wastage especially if you consider other languages). Only for REALLY rare chars do you need 4 bytes. Whereas UTF-8 will require from 1 to 3 bytes for the same common characters. And also 4 chars for REALLY rare ones. In UTF-16 surrogate is an exception whereas in UTF-8 it is a rule (when something is an exception, it won't affect performance in most cases; when something is a rule - it will affect).

Finally, UTF-16 is used by a variety of systems/tools: Windows, Java, C#, Qt and many others. Developers of these systems chose to use UTF-16 even though some of them (e.g. Java, C#, Qt) were developed in the era of UTF-8


> Secondly, the programmer *should* be able to use whatever
> type he decides is 
> appropriate. If he wants to stick with native, he can do

Why? He/She can just use conversion to UTF-32 (dchar) whenever better understanding of character is needed. At least, that's what should be done anyway.

> 
> You can have that easily:
> 
> version(Windows)
>     alias wstring tstring;
> else
>     alias string tstring;
> 

See that's my point. Nobody is going to do this unless the above is standardized by the language. Everybody will stick to something particular (either char or wchar). 


> 
> With templated text functions, there is very little benefit
> to be gained 
> from having a unified char. Just wouldn't serve any real

see my comment above about templates and dynamic libraries 

Ruslan


      


More information about the Digitalmars-d mailing list