UTF-8 issues
Oskar Linde
oskar.lindeREM at OVEgmail.com
Tue Sep 16 00:00:48 PDT 2008
Eldar Insafutdinov wrote:
> Benji Smith Wrote:
>> D has the array slice syntax, not possible with C++:
>
>> char[] s1 = "hello world";
>> char[] s2 = s1[6 .. 11]; // s2 is "world"
>
> So this example is only correct in case of latin chars, but in general it is wrong for UTF-8 strings.
It is not wrong for UTF-8 strings. It just won't work for arbitrary
indices. But I don't think you will ever use arbitrary indices. All
indices will be the result of other string functions (such as find)
which behave correctly for UTF-8 strings. Incrementing/decrementing can
be done using std.utf or similar. UTF-8 also makes it very easy to
determine if an arbitrary position in a UTF-8 sequence lies at the start
or in the middle of a multi-byte encoded character.
Indexing a UTF-8 string by character rather than byte index is horribly
inefficient. As others have said, if you really need to do that, use
dchar[](1). Although, I've never personally come across a place where I
needed that.
1) Be aware that you will need to make sure your data is of a composed
unicode normal form, otherwise it could still use several code points(2)
to represent a single grapheme.
2) A code point is a point in the Unicode codespace, which is what a
dchar encodes.
--
Oskar
More information about the Digitalmars-d
mailing list