To Walter, about char[] initialization by FF

Unknown W. Brackets unknown at simplemachines.org
Sun Jul 30 09:51:32 PDT 2006


Yes, you're right, most of the time I wouldn't (although a significant 
portion of the time, I would.)  Even so, this is why I would use UCS-2, 
and not UTF-8.  Why are you held up on char[]?

My point is that char[] is only trouble when you're dealing with text 
that is not ISO-8859-1.  I'm a great fan of localization and 
internationalization, but in all honesty the largest part of my text 
processing/analysis is with such strings.

Generally, user input I don't analyze.  Caret placement I leave to be 
handled by the libraries I use.  That is, when I use char[].

So again, I will agree that, in D, char[] is not a good choice for 
strings you are expecting to contain possibly-internationalized data.

I'm perfectly aware of what strlen (and str.length in D) do... it's 
similar to what they do in practically all other languages (unless you 
know the encoding is UCS-2, etc.)  For example, I work with PHP a lot 
and it doesn't even have (with the versions I support) built-in support 
for Unicode.  This makes text processing fun!

-[Unknown]


> "Unknown W. Brackets" <unknown at simplemachines.org> wrote in message 
> news:eahcqu$4d$1 at digitaldaemon.com...
>> It really sounds to me like you're looking for UCS-2, then (e.g. as used 
>> in JavaScript, etc.)  For that, length calculation (which is what I 
>> presume you mean) is inexpensive.
>>
> 
> Well, lets speak in terms of javascript if it is easier:
> 
> String.substr(start, end)...
> 
> What these start, end means for you?
> I don't think that you will be interested in indexes
> of bytes in utf-8 sequence.
> 
>> As to your below assertion, I disagree.  What I think you meant was:
>>
>> "char[] is not designed for effective multi-byte text processing."
> 
> What is "multi-byte text processing"?
> processing of text - sequence of codepoints of the alphabet?
> What is 'multi-byte' there doing? Multi-byte I beleive you mean is
> a method of encoding of codepoints for transmission. Is this correct?
> 
> You need real codepoints to do something meaningfull with them...
> How these codepoints are stored in memory: as byte, word or dword
> depends on your task, amount of memory you have and alphabet
> you are using.
> E.g. if you are counting frequency of russian words used in internet
> you'd better do not do this in Java - twice as expensive as in C
> without any need.
> 
> So phrase "multi-byte text processing" is fuzzy on this end.
> 
> (Seems like I am not clear enough with my subset of English.)
> 
>> I will agree that wchar[] would be much better in that case, and even that 
>> limiting it to UCS-2 (which is, afaik, a subset of UTF-16) would probably 
>> make things significantly easier to work with.
>>
>> Nonetheless, I was only commenting on how D is currently designed and 
>> implemented.  Perhaps there was some misunderstanding here.
>>
>> Even so, I don't see how initializing it to FF makes any problem.  I think 
>> everyone understands that char[] is meant to hold UTF-8, and if you don't 
>> like that or don't want to use it, there are other methods available to 
>> you (heh, you can even use UTF-32!)
>>
>> I don't see that the initialization of these variables will cause anyone 
>> any problems.  The only time I want such a variable initialized to 0 is 
>> when I use a numeric type, not a character type (and then, I try to use = 
>> 0 anyway.)
>>
>> It seems like what you may want to do is simply this:
>>
>> typedef ushort ucs2_t = 0;
>>
>> And use that type.  Mission accomplished.  Or, use various different 
>> encodings - in which case I humbly suggest:
>>
>> typedef ubyte latin1_t = 0;
>> typedef ushort ucs2_t = 0;
>> typedef ubyte koi8r_t = 0;
>> typedef ubyte big5_t = 0;
>>
>> And so on, so on, so on...
>>
>> -[Unknown]
> 
> I like the last statement "..., so on, so on..."
> Sounds promissing enough.
> 
> Just for information:
> strlen(const char* str)  works with *all*
> single byte encodings in C.
> For multi-bytes (e.g. utf-8 )  it returns
> length of the sequence in octets.
> But these are not chars in terms of C
> strictly speaking but bytes -
> unsigned chars.
> 
> 
>>
>>> So statement: "char[] in D supposed to hold only UTF-8 encoded text"
>>> immediately leads us to "D is not designed for effective text 
>>> processing".
>>>
>>> Is this logic clear? 
> 
> 



More information about the Digitalmars-d mailing list