To Walter, about char[] initialization by FF

Tue Aug 1 21:31:40 PDT 2006

Andrew Fedoniouk wrote:
> "Walter Bright" <newshound at digitalmars.com> wrote in message 
>> BMP is a subset of UTF-16.
> 
> Walter with deepest respect but it is not. Two different things.
> 
> UTF-16 is a variable-length enconding - byte stream.
> Unicode BMP is a range of numbers strictly speaking.
> 
> If you will treat utf-16 sequence as a sequence of UCS-2 (BMP) codes you
> are in trouble. See:
> 
> Sequence of two words D834 DD1E as UTF-16 will give you
> one unicode character with code 0x1D11E  ( musical G clef ).
> And the same sequence interpretted as UCS-2 sequence will
> give you two (invlaid, non-printable but still) character codes.
> You will get different length of the string at least.

The only thing that UTF-16 adds are semantics for characters that are 
invalid for BMP. That makes UTF-16 a superset. It doesn't matter if 
you're strictly speaking, or if the jargon is different. UTF-16 is a 
superset of BMP, once you cut past the jargon and look at the underlying 
reality.

>>> Ok. And how do you call A functions?
>> Take a look at std.file for an example.
> 
> You mean here?:
> 
> char* namez = toMBSz(name);
>  h = CreateFileA(namez,GENERIC_WRITE,0,null,CREATE_ALWAYS,
>      FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN,cast(HANDLE)null);
> char* here is far from UTF-8 sequence.

You could argue that for clarity namez should have been written as a 
ubyte*, but in the above code it would make no difference.

>>>> Windows, Java, and Javascript have all had to go back and redo to deal 
>>>> with surrogate pairs.
>>> Why? JavaScript for example has no such things as char.
>>> String.charAt() returns guess what? Correct - String object.
>>> No char - no problem :D
>> See String.fromCharCode() and String.charCodeAt()
> 
> ECMA-262
> 
> String.prototype.charCodeAt (pos)
> Returns a number (a nonnegative integer less than 2^16) representing the 
> code point value of the
> character at position pos in the string....
> 
> As you may see it is returning (unicode) *code point* from BMP set
> but it is far from UTF-16 code unit you've declared above.

There is no difference.

> Relaxing "a nonnegative integer less than 2^16" to
> "a nonnegative integer less than 2^21" will not harm anybody.
 > Or at least such probability is vanishingly small.

It'll break any code trying to deal with surrogate pairs.

>>> Again - let people decide of what char is and how to interpret it And 
>>> that will be it.
>> I've already explained the problems C/C++ have with that. They're real 
>> problems, bad and unfixable enough that there are official proposals to 
>> add new UTF basic types to to C++.
> 
> Basic types of what?

Basic types for utf-8 and utf-16. Ironically, they wind up being very 
much like D's char and wchar types.

>>> Phobos can work with utf-8/16 and satisfy you and other UTF-masochists 
>>> (no offence implied).
>> C++'s experience with this demonstrates that char* does not work very well 
>> with UTF-8. It's not just my experience, it's why new types were proposed 
>> for C++ (and not by me).
> Because char in C is not supposed  to hold multy-byte encodings.

Standard functions in the C standard library to deal with multibyte 
encodings have been there since 1989. Compiler extensions to deal with 
shift-JIS and other multibyte encodings have been there since the mid 
80's. They don't work very well, but nevertheless, are there and supported.

> At least std::string is strictly single byte thing by definition. And this
> is perfectly fine.

As long as you're dealing with ASCII only <g>. That world has been left 
behind, though.

> There is wchar_t for holding OS supported range in full.
> On Win32 - wchar_t is 16bit (UCS-2 legacy) and in GCC/*nix it is 32bit.

That's just the trouble with wchar_t. It's implementation defined, which 
means its use is non-portable. The Win32 version cannot handle surrogate 
pairs as a single character. Linux has the opposite problem - you can't 
have UTF-16 strings in any non-kludgy way. Trying to write 
internationalized code with wchar_t that works correctly on both Win32 
and Linux is an exercise in frustration. What you wind up doing is 
abstracting away the char type - giving up on help from the standard 
libraries and writing your own text processing code from scratch.

I've been through this with real projects. It doesn't work just fine, 
and is a lot of extra work. Translating the code to D is nice, you 
essentially give that whole mess a heave-ho.

BTW, you talked earlier a lot about memory efficiency. Linux's 32 bit 
wchar_t eats memory like nothing else.

> Thus the whole set of Windows API headers (and std.c.string for example)
> seen in D has to be rewrited to accept ubyte[]. As char in D is not char in 
> C

You're right that a C char isn't a D char. All that means is one must be 
careful when calling C functions that take char*'s to pass it data in 
the form that particular C function expects. This is true for all C's 
data types - even int.

> Is this the idea?

The vast majority (perhaps even all) of C standard string handling 
functions that accept char* will work with UTF-8 without modification. 
No rewrite required.

You've implied all this doesn't work, by saying things must be 
rewritten, that it's extremely difficult to deal with UTF-8, that BMP is 
  not a subset of UTF-16, etc. This is not my experience at all. If 
you've got some persuasive code examples, I'd like to see them.