To Walter, about char[] initialization by FF
Walter Bright
newshound at digitalmars.com
Mon Jul 31 15:51:52 PDT 2006
Andrew Fedoniouk wrote:
> The problem as I can see is this:
> D propose to use transport encoding for manipulation purposes
> which is main problem imo here - transport encodings are not
> designed for the manipulation - it is extremely difficult to use
> them for manipualtion in practice as we may see.
I disagree the characterization that it is "extremely difficult" to use
for manipulation. foreach's direct support for it, as well as the
functions in std.utf, make it straightforward. DMDScript is built around
UTF-8, and manipulating multibyte characters in it has not turned out to
be a significant problem.
It's also certainly easier than codepage based multibyte designs like
shift-JIS (I used to write code for shift-JIS).
> Encoding like UTF-8 and UTF-16 are almost useless
> with let's say Windows API, say TextOutA and TextOutW functions.
> Neither one of them will accept D's char[] and wchar[] directly.
>
> - ***A functions in Windows take byte string (LPSTR) and current
> codepage id to render text. ( byte + codepage = Unicode Code Point )
Win9x only supports the A functions, and Phobos does a translation of
the output into the Win9x code page when running on Win9x. Of course,
this fails when one has characters not supported by Win9x, but code
pages aren't going to help that either.
Win9x is obsolete anyway, and there's no reason to cripple a new
language by accommodating the failures of an obsolete system.
When running on NT or later Windows, the W functions are used instead
which work directly with UTF-16. Later Windows also support UTF-8 with
the A functions.
> - ***W functions in Windows use LPWSTR things which are
> sequence of codepoints from Unicode Basic Multilingual Plane (BMP).
> ( cast(dword) word = Unicode Code Point )
> Only few functions in Windows API treat LPWSTR as UTF-16.
BMP is a proper subset of UTF-16. The only difference is that BMP
doesn't do the 2-word surrogate pair encodings. But those are reserved
in BMP anyway, so there is no conflict. Windows has been upgraded to
handle them. Early versions of NT that couldn't handle surrogate pairs
didn't work with those code points anyway, so nothing is gained by going
to code pages.
So, the W functions can and do take UTF-16 directly, and in fact the
Phobos implementation does use the W functions, transmitting wchar[] to
them, and it works fine.
The neat thing about Phobos is it adapts to whether you are using Win9x,
full 32 bit Windows, or Linux, and adjusts the char output accordingly
so it "just works."
> -----------------
> "D strings are utf encoded sequences only" is a design mistake, IMO.
> On disk (serialized form) - yes. But not in memory for manipulation please.
There isn't any better method of handling international character sets
in a portable way. Code pages have serious, crippling, unfixable
problems - including all the downsides of multibyte systems (because the
asian code pages are multibyte).
More information about the Digitalmars-d
mailing list