To Walter, about char[] initialization by FF
Andrew Fedoniouk
news at terrainformatica.com
Tue Aug 1 19:57:08 PDT 2006
(Hope this long dialog will help all of us to better understand what UNICODE
is)
"Walter Bright" <newshound at digitalmars.com> wrote in message
news:eao5st$2r1f$1 at digitaldaemon.com...
> Andrew Fedoniouk wrote:
> > Compiler accepts input stream as either BMP codes or full unicode set
> encoded using UTF-16.
>
> BMP is a subset of UTF-16.
Walter with deepest respect but it is not. Two different things.
UTF-16 is a variable-length enconding - byte stream.
Unicode BMP is a range of numbers strictly speaking.
If you will treat utf-16 sequence as a sequence of UCS-2 (BMP) codes you
are in trouble. See:
Sequence of two words D834 DD1E as UTF-16 will give you
one unicode character with code 0x1D11E ( musical G clef ).
And the same sequence interpretted as UCS-2 sequence will
give you two (invlaid, non-printable but still) character codes.
You will get different length of the string at least.
>
> > There is no mentioning that String[n] will return you utf-16 code
> > unit. That will be weird.
>
> String.charCodeAt() will give you the utf-16 code unit.
>
>>> Conversely, the A functions under NT and later translate the characters
>>> to - you guessed it - UTF-16 and then call the corresponding W function.
>>> This is why Phobos under NT does not call the A functions.
>> Ok. And how do you call A functions?
>
> Take a look at std.file for an example.
You mean here?:
char* namez = toMBSz(name);
h = CreateFileA(namez,GENERIC_WRITE,0,null,CREATE_ALWAYS,
FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN,cast(HANDLE)null);
char* here is far from UTF-8 sequence.
>
>
>>> Windows, Java, and Javascript have all had to go back and redo to deal
>>> with surrogate pairs.
>> Why? JavaScript for example has no such things as char.
>> String.charAt() returns guess what? Correct - String object.
>> No char - no problem :D
>
> See String.fromCharCode() and String.charCodeAt()
ECMA-262
String.prototype.charCodeAt (pos)
Returns a number (a nonnegative integer less than 2^16) representing the
code point value of the
character at position pos in the string....
As you may see it is returning (unicode) *code point* from BMP set
but it is far from UTF-16 code unit you've declared above.
Relaxing "a nonnegative integer less than 2^16" to
"a nonnegative integer less than 2^21" will not harm anybody. Or at least
such probability is vanishingly small.
>
>> Again - let people decide of what char is and how to interpret it And
>> that will be it.
>
> I've already explained the problems C/C++ have with that. They're real
> problems, bad and unfixable enough that there are official proposals to
> add new UTF basic types to to C++.
Basic types of what?
>
>> Phobos can work with utf-8/16 and satisfy you and other UTF-masochists
>> (no offence implied).
>
> C++'s experience with this demonstrates that char* does not work very well
> with UTF-8. It's not just my experience, it's why new types were proposed
> for C++ (and not by me).
Because char in C is not supposed to hold multy-byte encodings.
At least std::string is strictly single byte thing by definition. And this
is perfectly fine. There is wchar_t for holding OS supported range in full.
On Win32 - wchar_t is 16bit (UCS-2 legacy) and in GCC/*nix it is 32bit.
>
>> Ordinary people will do their own strings anyway. Just give them opAssign
>> and dtor in structs and you will see explosion of perfect strings. That
>> char#[] (read-only arrays) will also benefit here. oh.....
>>
>> Changing char init value to 0 will not harm anybody but will allow to use
>> char for other than
>>
>> utf-8 purposes - it is only one from 40 in active use encodings anyway.
>>
>> For persistence purposes (in compiled EXE) utf is the best choice
>> probably. But in runtime - please not on language level.
>
> ubyte[] will enable you to use any encoding you wish - and that's what
> it's there for.
Thus the whole set of Windows API headers (and std.c.string for example)
seen in D has to be rewrited to accept ubyte[]. As char in D is not char in
C
Is this the idea?
Andrew.
More information about the Digitalmars-d
mailing list