UTF-8 issues
    Oskar Linde 
    oskar.lindeREM at OVEgmail.com
       
    Tue Sep 16 00:00:48 PDT 2008
    
    
  
Eldar Insafutdinov wrote:
> Benji Smith Wrote:
>> D has the array slice syntax, not possible with C++:
> 
>> char[] s1 = "hello world";
>> char[] s2 = s1[6 .. 11];	// s2 is "world"
> 
> So this example is only correct in case of latin chars, but in general it is wrong for UTF-8 strings.
It is not wrong for UTF-8 strings. It just won't work for arbitrary 
indices. But I don't think you will ever use arbitrary indices. All 
indices will be the result of other string functions (such as find) 
which behave correctly for UTF-8 strings. Incrementing/decrementing can 
be done using std.utf or similar. UTF-8 also makes it very easy to 
determine if an arbitrary position in a UTF-8 sequence lies at the start 
or in the middle of a multi-byte encoded character.
Indexing a UTF-8 string by character rather than byte index is horribly 
inefficient. As others have said, if you really need to do that, use 
dchar[](1). Although, I've never personally come across a place where I 
needed that.
1) Be aware that you will need to make sure your data is of a composed 
unicode normal form, otherwise it could still use several code points(2) 
to represent a single grapheme.
2) A code point is a point in the Unicode codespace, which is what a 
dchar encodes.
-- 
Oskar
    
    
More information about the Digitalmars-d
mailing list