Wide characters support in D

Mon Jun 7 22:55:40 PDT 2010

"Ruslan Nikolaev" <nruslan_devel at yahoo.com> wrote in message 
news:mailman.124.1275963971.24349.digitalmars-d at puremagic.com...

Nick wrote:
> It only generates code for the types that are actually
> needed. If, for
> instance, your progam never uses anything except UTF-8,
> then only one
> version of the function will be made - the UTF-8
> version. If you don't use
> every char type, then it doesn't generate it for every char
> type - just the
> ones you choose to use.

>Not quite right. If we create system dynamic libraries or dynamic libraries 
>commonly used, we will have to compile every instance unless we want to 
>burden user with this. Otherwise, the same code will be duplicated in users 
>program over and over again.<

That's a rather minor issue. I think you're overestimating the amount of 
bloat that occurs from having one string type versus three string types. 
Absolute worst case scenario would be a library that contains nothing but 
text-processing functions. That would triple in size, but what's the biggest 
such lib you've ever seen anyway? And for most libs, only a fraction is 
going to be taken up by text processing, so the difference won't be 
particularly large. In fact, the difference would likely be dwarfed anyway 
by the bloat incurred from all the other templated code (ie which would be 
largely unaffected by number of string types), and yes, *that* can get to be 
a problem, but it's an entirely separate one.

> That's not good. First of all, UTF-16 is a lousy encoding,
> it combines the
> worst of both UTF-8 and UTF-32: It's multibyte and
> non-word-aligned like
> UTF-8, but it still wastes a lot of space like UTF-32. So
> even if your OS
> uses it natively, it's still best to do most internal
> processing in either
> UTF-8 or UTF-32. (And with templated string functions, if
> the programmer
> actually does want to use the native type in the *rare*
> cases where he's
> making enough OS calls that it would actually matter, he
> can still do so.)
>
>First of all, UTF-16 is not a lousy encoding. It requires for most 
>characters 2 bytes (not so big wastage especially if you consider other 
>languages). Only for REALLY rare chars do you need 4 bytes. Whereas UTF-8 
>will require from 1 to 3 bytes for the same common characters. And also 4 
>chars for REALLY rare ones. In UTF-16 surrogate is an exception whereas in 
>UTF-8 it is a rule (when something is an exception, it won't affect 
>performance in most cases; when something is a rule - it will affect).<

Maybe "lousy" is too strong a word, but aside from compatibility with other 
libs/software that use it (which I'll address separately), UTF-16 is not 
particularly useful compared to UTF-8 and UTF-32:

Non-latin-alphabet language: UTF-8 vs UTF-16:

The real-word difference in sizes is minimal. But UTF-8 has some advantages: 
The nature of the the encoding makes backwards-scanning cheaper and easier. 
Also, as Walter said, bugs in the handling of multi-code-unit characters 
become fairly obvious. Advantages of UTF-16: None.

Latin-alphabet language: UTF-8 vs UTF-16:

All the same UTF-8 advantages for non-latin-alphabet languages still apply, 
plus there's a space savings: Under UTF-8, *most* characters are going to be 
1 byte. Yes, there will be the occasional 2+ byte character, but they're so 
much less common that the overhead compared to ASCII (I'm only using ASCII 
as a baseline here, for the sake of comparisons) would only be around 0% to 
15% depending on the language. UTF-16, however, has a consistent 100% 
overhead (slightly more when you count surrogate pairs, but I'll just leave 
it at 100%). So, depending on language, UTF-16 would be around 70%-100% 
larger than UTF-8. That's not insignificant.

Any language: UTF-32 vs UTF-16:

Using UTF-32 takes up extra space, but when that matters, UTF-8 already has 
the advantage over UTF-16 anyway regardless of whether or not UTF-8 is 
providing a space savings (see above), so the question of UTF-32 vs UTF-16 
becomes useless. The rest of the time, UTF-32 has these advantages: 
Guaranteed one code-unit per character. And, the code-unit size is faster on 
typical CPUs (typical CPUs generally handle 32-bits faster than they handle 
8- or 16-bits). Advantages of UTF-16: None.

So compatibility with certain tools/libs is really the only reason ever to 
choose UTF-16.

>Finally, UTF-16 is used by a variety of systems/tools: Windows, Java, C#, 
>Qt and many others. Developers of these systems chose to use UTF-16 even 
>though some of them (e.g. Java, C#, Qt) were developed in the era of UTF-8<

First of all, it's not exactly unheard of for big projects to make a 
sub-optimal decision.

Secondly, Java and Windows adapted 16-bit encodings back when many people 
were still under the mistaken impression that would allow them to hold any 
character in one code-unit. If that had been true, then it would indeed have 
had at least certain advantages over UTF-8. But by the time the programming 
world at large knew better, it was too late for Java or Windows to 
re-evaluate the decision; they'd already jumped in with both feet. C# and 
.NET use UTF-16 because Windows does. I don't know about Qt, but judging by 
how long Wikipedia says it's been around, I'd say it's probably the same 
story.

As for choosing to use UTF-16 because of interfacing with other tools and 
libs that use it: That's certainly a good reason to use UTF-16. But it's 
about the only reason. And it's a big mistake to just assume that the 
overhead of converting to/from UTF-16 when crossing those API borders is 
always going to outweigh all other concerns:

For instance, if you're writing an app that does a large amount of 
text-processing on relatively small amounts of text and only deals a little 
bit with a UTF-16 API, then the overhead of operating on 16-bits at a time 
can easily outweigh the overhead from the UTF-16 <-> UTF-32 conversions.

Or, maybe the app you're writing is more memory-limited than speed-limited.

There are perfectly legitimate reasons to want to use an encoding other than 
the OS-native. Why force those people to circumvent the type system to do 
it? Especially in a language that's intended to be usable as a systems 
language. Just to potentially save a couple megs on some .dll or .so?

> Secondly, the programmer *should* be able to use whatever
> type he decides is
> appropriate. If he wants to stick with native, he can do
>Why? He/She can just use conversion to UTF-32 (dchar) whenever better 
>understanding of character is needed. At least, that's what should be done 
>anyway.<

Weren't you saying that the main point of just having one string type (the 
OS-native string) was to avoid unnecessary conversions? But now you're 
arguing that's it's fine to do unnecessary conversions and to have the 
multiple string types?

>
> You can have that easily:
>
> version(Windows)
> alias wstring tstring;
> else
> alias string tstring;
>
>See that's my point. Nobody is going to do this unless the above is 
>standardized by the language. Everybody will stick to something particular 
>(either char or wchar).<

True enough. I don't have anything against having something like that in the 
std library as long as the others are still available too. Could be useful 
in a few cases. I do think having it *instead* of the three types is far too 
presumptuous, though.