To Walter, about char[] initialization by FF

Thomas Kuehne thomas-dloop at kuehne.cn
Wed Aug 2 12:10:37 PDT 2006


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Fedoniouk schrieb am 2006-07-31:
>
> "Thomas Kuehne" <thomas-dloop at kuehne.cn> wrote in message 
> news:ls52q3-3o8.ln1 at birke.kuehne.cn...
>>
>> Oskar Linde schrieb am 2006-07-31:
>>> Serg Kovrov wrote:
>>
>>>> For example,
>>>> char[] str = "????";
>>>> word "test" in russian - 4 cyrillic characters, would give you
>>>> str.length 8, which make no use of this length property if you not sure
>>>> that string is latin characters only.
>>>
>>> It is actually not very often that you need to count the number of
>>> characters as opposed to the number of (UTF-8) code units. Counting the
>>> number of characters is also a rather expensive operation. All the
>>> ordinary operations (searching, slicing, concatenation, sub-string
>>> search, etc) operate on code units rather than characters.
>>>
>>> It is easy to implement your own character count though:
>>>
>>> size_t count(char[] arr) {
>>> size_t c = 0;
>>> foreach(dchar c;arr)
>>> c++;
>>> return c;
>>> }
>>>
>>> assert("????".count() == 4);
>>>
>>> Also note that:
>>>
>>> assert("????"d.length == 4);
>>
>> I hate to be pedantic but dchar[] can only be used to count the code
>> points - not the characters. A "character" can be composed by more than
>> one code point/dchar. This feature is frequent used for accents, marks
>> and some Asian scripts.
>>
>> - -> http://www.unicode.org
>>
>
>
> Right, Thomas,
>
> umlaut as a separate code point can exist
> so A with umlaut can be represented by two code points.
> But as far as I remember the intention was and is
> to have in Unicode also all full forms like "A-with-umlaut"

http://www.unicode.org/faq/char_combmark.html#13

I won't argue about the intention here.
Post this statement on 
<unicode at unicode.org> (http://www.unicode.org/consortium/distlist.html)
an let's see the various responces ;)


> So you can always "compress" multi code point forms into
> single point counterparts.

Not allways. For a common use case see
http://www.unicode.org/faq/han_cjk.html#7
http://www.unicode.org/faq/han_cjk.html#9

Thomas


-----BEGIN PGP SIGNATURE-----

iD8DBQFE0QYbLK5blCcjpWoRArZiAJ4mVulttOK6bafuCZLt2Ini2lx4JACgjdC7
1DH/6rvW8qaSzRX5W0i+7jk=
=2pt0
-----END PGP SIGNATURE-----



More information about the Digitalmars-d mailing list