UTF-8 issues

Tue Sep 16 00:00:48 PDT 2008

Eldar Insafutdinov wrote:
> Benji Smith Wrote:
>> D has the array slice syntax, not possible with C++:
> 
>> char[] s1 = "hello world";
>> char[] s2 = s1[6 .. 11];	// s2 is "world"
> 
> So this example is only correct in case of latin chars, but in general it is wrong for UTF-8 strings.

It is not wrong for UTF-8 strings. It just won't work for arbitrary 
indices. But I don't think you will ever use arbitrary indices. All 
indices will be the result of other string functions (such as find) 
which behave correctly for UTF-8 strings. Incrementing/decrementing can 
be done using std.utf or similar. UTF-8 also makes it very easy to 
determine if an arbitrary position in a UTF-8 sequence lies at the start 
or in the middle of a multi-byte encoded character.

Indexing a UTF-8 string by character rather than byte index is horribly 
inefficient. As others have said, if you really need to do that, use 
dchar[](1). Although, I've never personally come across a place where I 
needed that.

1) Be aware that you will need to make sure your data is of a composed 
unicode normal form, otherwise it could still use several code points(2) 
to represent a single grapheme.

2) A code point is a point in the Unicode codespace, which is what a 
dchar encodes.

-- 
Oskar