To Walter, about char[] initialization by FF

Andrew Fedoniouk news at terrainformatica.com
Sat Jul 29 23:47:46 PDT 2006


"Unknown W. Brackets" <unknown at simplemachines.org> wrote in message 
news:eahcqu$4d$1 at digitaldaemon.com...
> It really sounds to me like you're looking for UCS-2, then (e.g. as used 
> in JavaScript, etc.)  For that, length calculation (which is what I 
> presume you mean) is inexpensive.
>

Well, lets speak in terms of javascript if it is easier:

String.substr(start, end)...

What these start, end means for you?
I don't think that you will be interested in indexes
of bytes in utf-8 sequence.

> As to your below assertion, I disagree.  What I think you meant was:
>
> "char[] is not designed for effective multi-byte text processing."

What is "multi-byte text processing"?
processing of text - sequence of codepoints of the alphabet?
What is 'multi-byte' there doing? Multi-byte I beleive you mean is
a method of encoding of codepoints for transmission. Is this correct?

You need real codepoints to do something meaningfull with them...
How these codepoints are stored in memory: as byte, word or dword
depends on your task, amount of memory you have and alphabet
you are using.
E.g. if you are counting frequency of russian words used in internet
you'd better do not do this in Java - twice as expensive as in C
without any need.

So phrase "multi-byte text processing" is fuzzy on this end.

(Seems like I am not clear enough with my subset of English.)

>
> I will agree that wchar[] would be much better in that case, and even that 
> limiting it to UCS-2 (which is, afaik, a subset of UTF-16) would probably 
> make things significantly easier to work with.
>
> Nonetheless, I was only commenting on how D is currently designed and 
> implemented.  Perhaps there was some misunderstanding here.
>
> Even so, I don't see how initializing it to FF makes any problem.  I think 
> everyone understands that char[] is meant to hold UTF-8, and if you don't 
> like that or don't want to use it, there are other methods available to 
> you (heh, you can even use UTF-32!)
>
> I don't see that the initialization of these variables will cause anyone 
> any problems.  The only time I want such a variable initialized to 0 is 
> when I use a numeric type, not a character type (and then, I try to use = 
> 0 anyway.)
>
> It seems like what you may want to do is simply this:
>
> typedef ushort ucs2_t = 0;
>
> And use that type.  Mission accomplished.  Or, use various different 
> encodings - in which case I humbly suggest:
>
> typedef ubyte latin1_t = 0;
> typedef ushort ucs2_t = 0;
> typedef ubyte koi8r_t = 0;
> typedef ubyte big5_t = 0;
>
> And so on, so on, so on...
>
> -[Unknown]

I like the last statement "..., so on, so on..."
Sounds promissing enough.

Just for information:
strlen(const char* str)  works with *all*
single byte encodings in C.
For multi-bytes (e.g. utf-8 )  it returns
length of the sequence in octets.
But these are not chars in terms of C
strictly speaking but bytes -
unsigned chars.


>
>
>> So statement: "char[] in D supposed to hold only UTF-8 encoded text"
>> immediately leads us to "D is not designed for effective text 
>> processing".
>>
>> Is this logic clear? 





More information about the Digitalmars-d mailing list