std.string will get the boot

Sat Jan 30 19:31:42 PST 2010

On 2010-01-30 22:06:06 -0500, Lionello Lunesu <lio at lunesu.remove.com> said:

> On 30-1-2010 1:59, Andrei Alexandrescu wrote:
>> bearophile wrote:
>>> Andrei Alexandrescu:
>>>> Currently arrays of characters count as random-access ranges, which
>>>> is not true for arrays of char and wchar. I plan to make std.range
>>>> aware of that and only characterize char[] and wchar[] (and their
>>>> qualified versions) as bidirectional ranges.
>>> 
>>> 32 bits are not enough to represent certain "characters", they need
>>> more than one of such dchar. So dchar too may be a bidirectional range.
>> 
>> [citation needed]
> 
> I also doubt 32-bit is not enough. In fact, Unicode has 0x10FFFF
> as the highest code point.

32-bit is enough to cover all code points. But there are many combining 
code points in Unicode, allowing you to combine diacritic with various 
other characters, such as an acute accent with a 'k'. Some of these 
combinations exists in precombined form and are considered equivalent. 
So if you want to count the number of characters the user actually see 
instead of counting code points, then you need to take these combining 
code points into account.

But if you really wanted to iterate over "characters" instead of code 
points, note that it can become quite hard if you take into account 
double diacritics, combining diacritic signs placed across two letters. 
So I think it's reasonable to have dchar, a code point, as the base 
unit for iterating over a string.

http://en.wikipedia.org/wiki/Combining_character
http://en.wikipedia.org/wiki/Unicode_normalization

Another interesting case:
http://en.wikipedia.org/wiki/Combining_grapheme_joiner

Unicode, isn't it great?

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/