To Walter, about char[] initialization by FF

Mon Jul 31 18:23:19 PDT 2006

"Walter Bright" <newshound at digitalmars.com> wrote in message 
news:eam1ec$10e1$1 at digitaldaemon.com...
> Andrew Fedoniouk wrote:
>> The problem as I can see is this:
>> D propose to use transport encoding for manipulation purposes
>> which is main problem imo here - transport encodings are not
>> designed for the manipulation - it is extremely difficult to use
>> them for manipualtion in practice as we may see.
>
> I disagree the characterization that it is "extremely difficult" to use 
> for manipulation. foreach's direct support for it, as well as the 
> functions in std.utf, make it straightforward. DMDScript is built around 
> UTF-8, and manipulating multibyte characters in it has not turned out to 
> be a significant problem.

Sorry but strings in DMDScript are quite different in terms of
0) there are no such thing as char in JavaScript.
1) strings are Strings - not vectors of octets - js::string[] and d::char[] 
are different things.
2) are not supposed to be used by any OS API.
3) there are 12 or so methods of String class in JS - limited perimeter -
what model you've choosen to store them is irrelevant -
in some implementations they represented even by list of fixed runs.

>
> It's also certainly easier than codepage based multibyte designs like 
> shift-JIS (I used to write code for shift-JIS).
>
>> Encoding like UTF-8 and UTF-16 are almost useless
>> with let's say Windows API, say TextOutA and TextOutW functions.
>> Neither one of them will accept D's char[] and wchar[] directly.
>>
>> - ***A  functions in Windows take byte string (LPSTR) and current
>>   codepage id  to render text. ( byte + codepage = Unicode Code Point )
>
> Win9x only supports the A functions,

You are not right here.

TextOutA and TextOutW are both supported by Win98.
And intention in Harmonia was to use only those ***W
functions which come out of the box on Win98 (without need of MSLU)

> and Phobos does a translation of the output into the Win9x code page when 
> running on Win9x. Of course, this fails when one has characters not 
> supported by Win9x, but code pages aren't going to help that either.
>
> Win9x is obsolete anyway, and there's no reason to cripple a new language 
> by accommodating the failures of an obsolete system.

There is a huge market of embedded devices.
If you think that computer evolution expands only in more-ram-speed
direction than you are in trouble.

http://www.litepc.com/graphics/eossystem.jpg

>
> When running on NT or later Windows, the W functions are used instead 
> which work directly with UTF-16. Later Windows also support UTF-8 with the 
> A functions.

http://blogs.msdn.com/michkap/archive/2005/05/11/416552.aspx

>
>> - ***W functions in Windows use LPWSTR things which are
>>   sequence of codepoints from Unicode Basic Multilingual Plane (BMP).
>>   (  cast(dword) word  = Unicode Code Point )
>>   Only few functions in Windows API treat LPWSTR as UTF-16.
>
> BMP is a proper subset of UTF-16. The only difference is that BMP doesn't 
> do the 2-word surrogate pair encodings. But those are reserved in BMP 
> anyway, so there is no conflict. Windows has been upgraded to handle them. 
> Early versions of NT that couldn't handle surrogate pairs didn't work with 
> those code points anyway, so nothing is gained by going to code pages.

Sorry this scares me "BMP is a proper subset of UTF-16"
UTF-16 is a group name of *byte stream encodings*
(UTF-16LE and UTF-16BE) of Unicode Code Set.

BTW: which one of this UTFs D uses? Platform dependent I beleive.

>
> So, the W functions can and do take UTF-16 directly, and in fact the 
> Phobos implementation does use the W functions, transmitting wchar[] to 
> them, and it works fine.
>
> The neat thing about Phobos is it adapts to whether you are using Win9x, 
> full 32 bit Windows, or Linux, and adjusts the char output accordingly so 
> it "just works."
>

It should work well. Efficent I mean.
The language shall be agnostic to the meaning of char as much as possible.
It shall not prevent you to write effective algorithms.

>> -----------------
>> "D strings are utf encoded sequences only" is a design mistake, IMO.
>> On disk (serialized form) - yes. But not in memory for manipulation 
>> please.
>
> There isn't any better method of handling international character sets in 
> a portable way. Code pages have serious, crippling, unfixable problems - 
> including all the downsides of multibyte systems (because the asian code 
> pages are multibyte).

We are speaking in different languages:

A: "strings are utf encoded sequences only" is a design mistake.
W: "use any encoding other that utf" is a design mistake.

Different meaning, eh?

Forget about codepages.
Let those who aware about them to deal with them efficiently.
"Codepage" (c) Walter  (e.g. ASCII) is an efficient way of
representing text. That is it.

Others who can afford full set will work with full 21bit values.
Practically it is enough to have 16 (BMP) but...

Andrew Fedoniouk.
http://terrainformatica.com