To Walter, about char[] initialization by FF

Tue Aug 1 22:17:56 PDT 2006

>> As you may see it is returning (unicode) *code point* from BMP set
>> but it is far from UTF-16 code unit you've declared above.
>
> There is no difference.
>
>> Relaxing "a nonnegative integer less than 2^16" to
>> "a nonnegative integer less than 2^21" will not harm anybody.
> > Or at least such probability is vanishingly small.
>
> It'll break any code trying to deal with surrogate pairs.
>

There is no such thing as surrogate pair in UCS-2.
 JS string is not holding UTF-16 code units - only full code points.
See spec.

>>>>> Phobos can work with utf-8/16 and satisfy you and other UTF-masochists
>>>> (no offence implied).
>>> C++'s experience with this demonstrates that char* does not work very 
>>> well with UTF-8. It's not just my experience, it's why new types were 
>>> proposed for C++ (and not by me).
>> Because char in C is not supposed  to hold multy-byte encodings.
>
> Standard functions in the C standard library to deal with multibyte 
> encodings have been there since 1989. Compiler extensions to deal with 
> shift-JIS and other multibyte encodings have been there since the mid 
> 80's. They don't work very well, but nevertheless, are there and 
> supported.
>
>> At least std::string is strictly single byte thing by definition. And 
>> this
>> is perfectly fine.
>
> As long as you're dealing with ASCII only <g>. That world has been left 
> behind, though.

C string functions can be used with mutibyte encodings for the
sole reason: all byte encodings has char with code 0 defined
as NUL character. All encodings in practical use has no code
byte with code 0 appear in the middle of sequence.
They all built with C string processing in mind.

>
>> There is wchar_t for holding OS supported range in full.
>> On Win32 - wchar_t is 16bit (UCS-2 legacy) and in GCC/*nix it is 32bit.
>
> That's just the trouble with wchar_t. It's implementation defined, which 
> means its use is non-portable. The Win32 version cannot handle surrogate 
> pairs as a single character. Linux has the opposite problem - you can't 
> have UTF-16 strings in any non-kludgy way. Trying to write 
> internationalized code with wchar_t that works correctly on both Win32 and 
> Linux is an exercise in frustration. What you wind up doing is abstracting 
> away the char type - giving up on help from the standard libraries and 
> writing your own text processing code from scratch.
>
> I've been through this with real projects. It doesn't work just fine, and 
> is a lot of extra work. Translating the code to D is nice, you essentially 
> give that whole mess a heave-ho.
>
> BTW, you talked earlier a lot about memory efficiency. Linux's 32 bit 
> wchar_t eats memory like nothing else.

Agree. As I said - if you need efficiency use byte/word encodings + 
mapping.

dchar is no better than wchar_t/linux.
Please don't say that I shall use urf-8 for that - simply does not work in 
my cases -
too expencive.

>
>> Thus the whole set of Windows API headers (and std.c.string for example)
>> seen in D has to be rewrited to accept ubyte[]. As char in D is not char 
>> in C
>
> You're right that a C char isn't a D char. All that means is one must be 
> careful when calling C functions that take char*'s to pass it data in the 
> form that particular C function expects. This is true for all C's data 
> types - even int.
>
>> Is this the idea?
>
> The vast majority (perhaps even all) of C standard string handling 
> functions that accept char* will work with UTF-8 without modification. No 
> rewrite required.

Correct. As I said because of 0 is NUL in UTF-8 too. Not
0xFF or anything else exotic.

>
> You've implied all this doesn't work, by saying things must be rewritten, 
> that it's extremely difficult to deal with UTF-8, that BMP is not a subset 
> of UTF-16, etc. This is not my experience at all. If you've got some 
> persuasive code examples, I'd like to see them.

I am not saying that "must be rewritten". Sorry but this is you who propose
to rewrite all string processing functions of standard library mankind has 
for today.

Or I don't quite understand your idea with UTFs.

Java did change string world by introducing just char (single UCS-2 code 
point)
And no variations. Good it or bad? From uniformity point of view - good.
For efficiency - bad. I've seen a lot of reinvented char as byte wheels in 
professional
packages.

Andrew.