To Walter, about char[] initialization by FF

Mon Jul 31 23:02:14 PDT 2006

Andrew Fedoniouk wrote:
> "Walter Bright" <newshound at digitalmars.com> wrote in message 
> news:eam1ec$10e1$1 at digitaldaemon.com...
>> Andrew Fedoniouk wrote:
>>> The problem as I can see is this:
>>> D propose to use transport encoding for manipulation purposes
>>> which is main problem imo here - transport encodings are not
>>> designed for the manipulation - it is extremely difficult to use
>>> them for manipualtion in practice as we may see.
>> I disagree the characterization that it is "extremely difficult" to use 
>> for manipulation. foreach's direct support for it, as well as the 
>> functions in std.utf, make it straightforward. DMDScript is built around 
>> UTF-8, and manipulating multibyte characters in it has not turned out to 
>> be a significant problem.
> 
> Sorry but strings in DMDScript are quite different in terms of
> 0) there are no such thing as char in JavaScript.

ECMAScript 262-3 (Javascript) defines the source character set to be 
UTF-16, and the source character set is what JS programs manipulate for 
strings and characters.

> 1) strings are Strings - not vectors of octets - js::string[] and d::char[] 
> are different things.
> 2) are not supposed to be used by any OS API.
> 3) there are 12 or so methods of String class in JS - limited perimeter -
> what model you've choosen to store them is irrelevant -
> in some implementations they represented even by list of fixed runs.

I agree how it's stored in the JS implementation is irrelevant. My point 
was that in DMDScript they are stored as utf-8 strings, and they work 
with only minor extra effort - DMDScript implements all the string 
handling functions JS defines.

>>> - ***A  functions in Windows take byte string (LPSTR) and current
>>>   codepage id  to render text. ( byte + codepage = Unicode Code Point )
>> Win9x only supports the A functions,
> 
> You are not right here.
> 
> TextOutA and TextOutW are both supported by Win98.
> And intention in Harmonia was to use only those ***W
> functions which come out of the box on Win98 (without need of MSLU)

You're right in that Win98 exports a small handful of W functions 
without MSLU - but what those W functions actually do under the hood is 
translate the data based on the current code page and then call the 
corresponding A function. In other words, the Win9x W functions are 
rather pointless and don't support characters that are not in the 
current code page anyway. MSLU extends the same poor behavior to a bunch 
more pseudo W functions. This is why Phobos does not call W functions 
under Win9x.

Conversely, the A functions under NT and later translate the characters 
to - you guessed it - UTF-16 and then call the corresponding W function. 
This is why Phobos under NT does not call the A functions.

>> Win9x is obsolete anyway, and there's no reason to cripple a new language 
>> by accommodating the failures of an obsolete system.
> 
> There is a huge market of embedded devices.
> If you think that computer evolution expands only in more-ram-speed
> direction than you are in trouble.
> 
> http://www.litepc.com/graphics/eossystem.jpg

I agree there's a huge ecosystem of 32 bit embedded processors. And D 
works fine with Win9x - it just isn't crippled by Win9x's shortcomings.

>> When running on NT or later Windows, the W functions are used instead 
>> which work directly with UTF-16. Later Windows also support UTF-8 with the 
>> A functions.
> http://blogs.msdn.com/michkap/archive/2005/05/11/416552.aspx

That is consistent with what I wrote about it.

>>> - ***W functions in Windows use LPWSTR things which are
>>>   sequence of codepoints from Unicode Basic Multilingual Plane (BMP).
>>>   (  cast(dword) word  = Unicode Code Point )
>>>   Only few functions in Windows API treat LPWSTR as UTF-16.
>> BMP is a proper subset of UTF-16. The only difference is that BMP doesn't 
>> do the 2-word surrogate pair encodings. But those are reserved in BMP 
>> anyway, so there is no conflict. Windows has been upgraded to handle them. 
>> Early versions of NT that couldn't handle surrogate pairs didn't work with 
>> those code points anyway, so nothing is gained by going to code pages.
> 
> Sorry this scares me "BMP is a proper subset of UTF-16"
> UTF-16 is a group name of *byte stream encodings*
> (UTF-16LE and UTF-16BE) of Unicode Code Set.
> 
> BTW: which one of this UTFs D uses? Platform dependent I beleive.

D has been used for many years with foreign languages under Windows. If 
UTF-16 didn't work with Windows, I think it would have come up by now <g>.

As for whether it is LE or BE, it is whatever the local platform is, 
just like ints, shorts, longs, etc. are.

>> So, the W functions can and do take UTF-16 directly, and in fact the 
>> Phobos implementation does use the W functions, transmitting wchar[] to 
>> them, and it works fine.
>>
>> The neat thing about Phobos is it adapts to whether you are using Win9x, 
>> full 32 bit Windows, or Linux, and adjusts the char output accordingly so 
>> it "just works."
> 
> It should work well. Efficent I mean.

Yes.

> The language shall be agnostic to the meaning of char as much as possible.

That's C/C++'s approach, and it does not work very well. Check out 
tchar.h, there's a lovely disaster <g>. For another, just try using 
std::string with shift-JIS.

> It shall not prevent you to write effective algorithms.

Does UTF-8 prevent writing effective algorithms? I don't see how. 
DMDScript works, and is faster than any other JS implementation out 
there, including my own C++ version <g>. And frankly, my struggles with 
trying to internationalize C++ code for DMDScript is what led to D's 
support for UTF. The D implementation is shorter, simpler, and faster 
than the C++ one (which uses wchar's).

> Practically it is enough to have 16 (BMP) but...

I agree you can write code using BMP and ignore surrogate pairs today, 
and you'll probably never notice the bugs. But sooner or later, the 
surrogate pair problem is going to show up. Windows, Java, and 
Javascript have all had to go back and redo to deal with surrogate pairs.