To Walter, about char[] initialization by FF

Mon Jul 31 15:51:52 PDT 2006

Andrew Fedoniouk wrote:
> The problem as I can see is this:
> D propose to use transport encoding for manipulation purposes
> which is main problem imo here - transport encodings are not
> designed for the manipulation - it is extremely difficult to use
> them for manipualtion in practice as we may see.

I disagree the characterization that it is "extremely difficult" to use 
for manipulation. foreach's direct support for it, as well as the 
functions in std.utf, make it straightforward. DMDScript is built around 
UTF-8, and manipulating multibyte characters in it has not turned out to 
be a significant problem.

It's also certainly easier than codepage based multibyte designs like 
shift-JIS (I used to write code for shift-JIS).

> Encoding like UTF-8 and UTF-16 are almost useless
> with let's say Windows API, say TextOutA and TextOutW functions.
> Neither one of them will accept D's char[] and wchar[] directly.
> 
> - ***A  functions in Windows take byte string (LPSTR) and current
>   codepage id  to render text. ( byte + codepage = Unicode Code Point )

Win9x only supports the A functions, and Phobos does a translation of 
the output into the Win9x code page when running on Win9x. Of course, 
this fails when one has characters not supported by Win9x, but code 
pages aren't going to help that either.

Win9x is obsolete anyway, and there's no reason to cripple a new 
language by accommodating the failures of an obsolete system.

When running on NT or later Windows, the W functions are used instead 
which work directly with UTF-16. Later Windows also support UTF-8 with 
the A functions.

> - ***W functions in Windows use LPWSTR things which are
>   sequence of codepoints from Unicode Basic Multilingual Plane (BMP).
>   (  cast(dword) word  = Unicode Code Point )
>   Only few functions in Windows API treat LPWSTR as UTF-16.

BMP is a proper subset of UTF-16. The only difference is that BMP 
doesn't do the 2-word surrogate pair encodings. But those are reserved 
in BMP anyway, so there is no conflict. Windows has been upgraded to 
handle them. Early versions of NT that couldn't handle surrogate pairs 
didn't work with those code points anyway, so nothing is gained by going 
to code pages.

So, the W functions can and do take UTF-16 directly, and in fact the 
Phobos implementation does use the W functions, transmitting wchar[] to 
them, and it works fine.

The neat thing about Phobos is it adapts to whether you are using Win9x, 
full 32 bit Windows, or Linux, and adjusts the char output accordingly 
so it "just works."

> -----------------
> "D strings are utf encoded sequences only" is a design mistake, IMO.
> On disk (serialized form) - yes. But not in memory for manipulation please.

There isn't any better method of handling international character sets 
in a portable way. Code pages have serious, crippling, unfixable 
problems - including all the downsides of multibyte systems (because the 
asian code pages are multibyte).