To Walter, about char[] initialization by FF

Andrew Fedoniouk news at terrainformatica.com
Tue Aug 1 19:57:08 PDT 2006


(Hope this long dialog will help all of us to better understand what UNICODE 
is)

"Walter Bright" <newshound at digitalmars.com> wrote in message 
news:eao5st$2r1f$1 at digitaldaemon.com...
> Andrew Fedoniouk wrote:
> > Compiler accepts input stream as either BMP codes or full unicode set
> encoded using UTF-16.
>
> BMP is a subset of UTF-16.

Walter with deepest respect but it is not. Two different things.

UTF-16 is a variable-length enconding - byte stream.
Unicode BMP is a range of numbers strictly speaking.

If you will treat utf-16 sequence as a sequence of UCS-2 (BMP) codes you
are in trouble. See:

Sequence of two words D834 DD1E as UTF-16 will give you
one unicode character with code 0x1D11E  ( musical G clef ).
And the same sequence interpretted as UCS-2 sequence will
give you two (invlaid, non-printable but still) character codes.

You will get different length of the string at least.

>
> > There is no mentioning that String[n] will return you utf-16 code
> > unit. That will be weird.
>
> String.charCodeAt() will give you the utf-16 code unit.
>
>>> Conversely, the A functions under NT and later translate the characters 
>>> to - you guessed it - UTF-16 and then call the corresponding W function. 
>>> This is why Phobos under NT does not call the A functions.
>> Ok. And how do you call A functions?
>
> Take a look at std.file for an example.

You mean here?:

char* namez = toMBSz(name);
 h = CreateFileA(namez,GENERIC_WRITE,0,null,CREATE_ALWAYS,
     FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN,cast(HANDLE)null);

char* here is far from UTF-8 sequence.

>
>
>>> Windows, Java, and Javascript have all had to go back and redo to deal 
>>> with surrogate pairs.
>> Why? JavaScript for example has no such things as char.
>> String.charAt() returns guess what? Correct - String object.
>> No char - no problem :D
>
> See String.fromCharCode() and String.charCodeAt()

ECMA-262

String.prototype.charCodeAt (pos)
Returns a number (a nonnegative integer less than 2^16) representing the 
code point value of the
character at position pos in the string....

As you may see it is returning (unicode) *code point* from BMP set
but it is far from UTF-16 code unit you've declared above.

Relaxing "a nonnegative integer less than 2^16" to
"a nonnegative integer less than 2^21" will not harm anybody. Or at least
such probability is vanishingly small.

>
>> Again - let people decide of what char is and how to interpret it And 
>> that will be it.
>
> I've already explained the problems C/C++ have with that. They're real 
> problems, bad and unfixable enough that there are official proposals to 
> add new UTF basic types to to C++.

Basic types of what?

>
>> Phobos can work with utf-8/16 and satisfy you and other UTF-masochists 
>> (no offence implied).
>
> C++'s experience with this demonstrates that char* does not work very well 
> with UTF-8. It's not just my experience, it's why new types were proposed 
> for C++ (and not by me).

Because char in C is not supposed  to hold multy-byte encodings.
At least std::string is strictly single byte thing by definition. And this
is perfectly fine. There is wchar_t for holding OS supported range in full.
On Win32 - wchar_t is 16bit (UCS-2 legacy) and in GCC/*nix it is 32bit.

>
>> Ordinary people will do their own strings anyway. Just give them opAssign 
>> and dtor in structs and you will see explosion of perfect strings. That 
>> char#[] (read-only arrays) will also benefit here. oh.....
>>
>> Changing char init value to 0 will not harm anybody but will allow to use 
>> char for other than
>>
>> utf-8 purposes - it is only one from 40 in active use encodings anyway.
>>
>> For persistence purposes (in compiled EXE) utf is the best choice 
>> probably. But in runtime - please not on language level.
>
> ubyte[] will enable you to use any encoding you wish - and that's what 
> it's there for.

Thus the whole set of Windows API headers (and std.c.string for example)
seen in D has to be rewrited to accept ubyte[]. As char in D is not char in 
C
Is this the idea?

Andrew.







More information about the Digitalmars-d mailing list