To Walter, about char[] initialization by FF
Andrew Fedoniouk
news at terrainformatica.com
Tue Aug 1 22:17:56 PDT 2006
>> As you may see it is returning (unicode) *code point* from BMP set
>> but it is far from UTF-16 code unit you've declared above.
>
> There is no difference.
>
>> Relaxing "a nonnegative integer less than 2^16" to
>> "a nonnegative integer less than 2^21" will not harm anybody.
> > Or at least such probability is vanishingly small.
>
> It'll break any code trying to deal with surrogate pairs.
>
There is no such thing as surrogate pair in UCS-2.
JS string is not holding UTF-16 code units - only full code points.
See spec.
>>>>> Phobos can work with utf-8/16 and satisfy you and other UTF-masochists
>>>> (no offence implied).
>>> C++'s experience with this demonstrates that char* does not work very
>>> well with UTF-8. It's not just my experience, it's why new types were
>>> proposed for C++ (and not by me).
>> Because char in C is not supposed to hold multy-byte encodings.
>
> Standard functions in the C standard library to deal with multibyte
> encodings have been there since 1989. Compiler extensions to deal with
> shift-JIS and other multibyte encodings have been there since the mid
> 80's. They don't work very well, but nevertheless, are there and
> supported.
>
>> At least std::string is strictly single byte thing by definition. And
>> this
>> is perfectly fine.
>
> As long as you're dealing with ASCII only <g>. That world has been left
> behind, though.
C string functions can be used with mutibyte encodings for the
sole reason: all byte encodings has char with code 0 defined
as NUL character. All encodings in practical use has no code
byte with code 0 appear in the middle of sequence.
They all built with C string processing in mind.
>
>> There is wchar_t for holding OS supported range in full.
>> On Win32 - wchar_t is 16bit (UCS-2 legacy) and in GCC/*nix it is 32bit.
>
> That's just the trouble with wchar_t. It's implementation defined, which
> means its use is non-portable. The Win32 version cannot handle surrogate
> pairs as a single character. Linux has the opposite problem - you can't
> have UTF-16 strings in any non-kludgy way. Trying to write
> internationalized code with wchar_t that works correctly on both Win32 and
> Linux is an exercise in frustration. What you wind up doing is abstracting
> away the char type - giving up on help from the standard libraries and
> writing your own text processing code from scratch.
>
> I've been through this with real projects. It doesn't work just fine, and
> is a lot of extra work. Translating the code to D is nice, you essentially
> give that whole mess a heave-ho.
>
> BTW, you talked earlier a lot about memory efficiency. Linux's 32 bit
> wchar_t eats memory like nothing else.
Agree. As I said - if you need efficiency use byte/word encodings +
mapping.
dchar is no better than wchar_t/linux.
Please don't say that I shall use urf-8 for that - simply does not work in
my cases -
too expencive.
>
>> Thus the whole set of Windows API headers (and std.c.string for example)
>> seen in D has to be rewrited to accept ubyte[]. As char in D is not char
>> in C
>
> You're right that a C char isn't a D char. All that means is one must be
> careful when calling C functions that take char*'s to pass it data in the
> form that particular C function expects. This is true for all C's data
> types - even int.
>
>> Is this the idea?
>
> The vast majority (perhaps even all) of C standard string handling
> functions that accept char* will work with UTF-8 without modification. No
> rewrite required.
Correct. As I said because of 0 is NUL in UTF-8 too. Not
0xFF or anything else exotic.
>
> You've implied all this doesn't work, by saying things must be rewritten,
> that it's extremely difficult to deal with UTF-8, that BMP is not a subset
> of UTF-16, etc. This is not my experience at all. If you've got some
> persuasive code examples, I'd like to see them.
I am not saying that "must be rewritten". Sorry but this is you who propose
to rewrite all string processing functions of standard library mankind has
for today.
Or I don't quite understand your idea with UTFs.
Java did change string world by introducing just char (single UCS-2 code
point)
And no variations. Good it or bad? From uniformity point of view - good.
For efficiency - bad. I've seen a lot of reinvented char as byte wheels in
professional
packages.
Andrew.
More information about the Digitalmars-d
mailing list