To Walter, about char[] initialization by FF

Andrew Fedoniouk news at terrainformatica.com
Tue Aug 1 01:44:55 PDT 2006


"Walter Bright" <newshound at digitalmars.com> wrote in message 
news:eamql8$1jgc$1 at digitaldaemon.com...
> Andrew Fedoniouk wrote:
>> "Walter Bright" <newshound at digitalmars.com> wrote in message 
>> news:eam1ec$10e1$1 at digitaldaemon.com...
>>> Andrew Fedoniouk wrote:
>>>> The problem as I can see is this:
>>>> D propose to use transport encoding for manipulation purposes
>>>> which is main problem imo here - transport encodings are not
>>>> designed for the manipulation - it is extremely difficult to use
>>>> them for manipualtion in practice as we may see.
>>> I disagree the characterization that it is "extremely difficult" to use 
>>> for manipulation. foreach's direct support for it, as well as the 
>>> functions in std.utf, make it straightforward. DMDScript is built around 
>>> UTF-8, and manipulating multibyte characters in it has not turned out to 
>>> be a significant problem.
>>
>> Sorry but strings in DMDScript are quite different in terms of
>> 0) there are no such thing as char in JavaScript.
>
> ECMAScript 262-3 (Javascript) defines the source character set to be 
> UTF-16, and the source character set is what JS programs manipulate for 
> strings and characters.

Walter, please, forget about such thing as "character set is UTF-16"
it is a non-sense.

Regarding ECMA-262:

"A conforming implementation of this International standard shall interpret 
characters in conformance with the
Unicode Standard, Version 2.1 or later, and ISO/IEC 10646-1 with either 
UCS-2 or UTF-16 as the adopted

encoding form..."

It is quite different from your interpretation. Compiler accepts input 
stream as either BMP codes or full unicode set encoded using UTF-16. There 
is no mentioning that String[n] will return you utf-16 code unit. That will 
be weird.


>
>> 1) strings are Strings - not vectors of octets - js::string[] and 
>> d::char[] are different things.
>> 2) are not supposed to be used by any OS API.
>> 3) there are 12 or so methods of String class in JS - limited perimeter -
>> what model you've choosen to store them is irrelevant -
>> in some implementations they represented even by list of fixed runs.
>
> I agree how it's stored in the JS implementation is irrelevant. My point 
> was that in DMDScript they are stored as utf-8 strings, and they work with 
> only minor extra effort - DMDScript implements all the string handling 
> functions JS defines.

Again it is up to you how they are stored internally and what you did there.

In D situation is completely different - there is a char and char[] opened 
to all winds.


>
>
>>>> - ***A  functions in Windows take byte string (LPSTR) and current
>>>>   codepage id  to render text. ( byte + codepage = Unicode Code Point )
>>> Win9x only supports the A functions,
>>
>> You are not right here.
>>
>> TextOutA and TextOutW are both supported by Win98.
>> And intention in Harmonia was to use only those ***W
>> functions which come out of the box on Win98 (without need of MSLU)
>
> You're right in that Win98 exports a small handful of W functions without 
> MSLU - but what those W functions actually do under the hood is translate 
> the data based on the current code page and then call the corresponding A 
> function. In other words, the Win9x W functions are rather pointless and 
> don't support characters that are not in the current code page anyway. 
> MSLU extends the same poor behavior to a bunch more pseudo W functions. 
> This is why Phobos does not call W functions under Win9x.

I wouldn't be so pessimistic about Win98 :)


>
> Conversely, the A functions under NT and later translate the characters 
> to - you guessed it - UTF-16 and then call the corresponding W function. 
> This is why Phobos under NT does not call the A functions.
>

Ok. And how do you call A functions?

Do you use proposed koi8chars, latin1chars, etc.?

You are using char for that. But wait, char cannot contain anything other 
than utf-8 :-P


>
>>> Win9x is obsolete anyway, and there's no reason to cripple a new 
>>> language by accommodating the failures of an obsolete system.
>>
>> There is a huge market of embedded devices.
>> If you think that computer evolution expands only in more-ram-speed
>> direction than you are in trouble.
>>
>> http://www.litepc.com/graphics/eossystem.jpg
>
> I agree there's a huge ecosystem of 32 bit embedded processors. And D 
> works fine with Win9x - it just isn't crippled by Win9x's shortcomings.
>
>
>>> When running on NT or later Windows, the W functions are used instead 
>>> which work directly with UTF-16. Later Windows also support UTF-8 with 
>>> the A functions.
>> http://blogs.msdn.com/michkap/archive/2005/05/11/416552.aspx
>
> That is consistent with what I wrote about it.
>

No doubts about it.
>
>>>> - ***W functions in Windows use LPWSTR things which are
>>>>   sequence of codepoints from Unicode Basic Multilingual Plane (BMP).
>>>>   (  cast(dword) word  = Unicode Code Point )
>>>>   Only few functions in Windows API treat LPWSTR as UTF-16.
>>> BMP is a proper subset of UTF-16. The only difference is that BMP 
>>> doesn't do the 2-word surrogate pair encodings. But those are reserved 
>>> in BMP anyway, so there is no conflict. Windows has been upgraded to 
>>> handle them. Early versions of NT that couldn't handle surrogate pairs 
>>> didn't work with those code points anyway, so nothing is gained by going 
>>> to code pages.
>>
>> Sorry this scares me "BMP is a proper subset of UTF-16"
>> UTF-16 is a group name of *byte stream encodings*
>> (UTF-16LE and UTF-16BE) of Unicode Code Set.
>>
>> BTW: which one of this UTFs D uses? Platform dependent I beleive.
>
> D has been used for many years with foreign languages under Windows. If 
> UTF-16 didn't work with Windows, I think it would have come up by now <g>.
>
> As for whether it is LE or BE, it is whatever the local platform is, just 
> like ints, shorts, longs, etc. are.

>
>>> So, the W functions can and do take UTF-16 directly, and in fact the 
>>> Phobos implementation does use the W functions, transmitting wchar[] to 
>>> them, and it works fine.
>>>
>>> The neat thing about Phobos is it adapts to whether you are using Win9x, 
>>> full 32 bit Windows, or Linux, and adjusts the char output accordingly 
>>> so it "just works."
>>
>> It should work well. Efficent I mean.
>
> Yes.
>
>> The language shall be agnostic to the meaning of char as much as 
>> possible.
>
> That's C/C++'s approach, and it does not work very well. Check out 
> tchar.h, there's a lovely disaster <g>. For another, just try using 
> std::string with shift-JIS.
>
>> It shall not prevent you to write effective algorithms.
>
> Does UTF-8 prevent writing effective algorithms? I don't see how. 
> DMDScript works, and is faster than any other JS implementation out there, 
> including my own C++ version <g>. And frankly, my struggles with trying to 
> internationalize C++ code for DMDScript is what led to D's support for 
> UTF. The D implementation is shorter, simpler, and faster than the C++ one 
> (which uses wchar's).
>
>
>> Practically it is enough to have 16 (BMP) but...
>
> I agree you can write code using BMP and ignore surrogate pairs today, and 
> you'll probably never notice the bugs. But sooner or later, the surrogate 
> pair problem is going to show up. Windows, Java, and Javascript have all 
> had to go back and redo to deal with surrogate pairs.

Why? JavaScript for example has no such things as char.

String.charAt() returns guess what? Correct - String object.

No char - no problem :D

Why do they need to redefine anything then?

Again - let people decide of what char is and how to interpret it And that 
will be it.

Phobos can work with utf-8/16 and satisfy you and other UTF-masochists (no 
offence implied).  Ordinary people will do their own strings anyway. Just 
give them opAssign and dtor in structs and you will see explosion of perfect 
strings. That char#[] (read-only arrays) will also benefit here. oh.....

Changing char init value to 0 will not harm anybody but will allow to use 
char for other than

utf-8 purposes - it is only one from 40 in active use encodings anyway.

For persistence purposes (in compiled EXE) utf is the best choice probably. 
But in runtime - please not on language level.

Educated IMO, of course.

Andrew.





More information about the Digitalmars-d mailing list