UTF-8 issues
Eldar Insafutdinov
e.insafutdinov at mail.ru
Mon Sep 15 22:53:14 PDT 2008
Benji Smith Wrote:
> Eldar Insafutdinov wrote:
> > I faced some issues with utf-8 support in D.
>
> The important thing to remember is that a string is absolutely NOT an
> array of characters, and you can't treat it as such.
>
> As you've noticed, a char[] string is actually an array of UTF-8 encoded
> bytes. Iterating directly through that array is extremely touchy and
> error-prone. Instead, always use the standard library functions.
>
> D1/Tango:
>
> http://dsource.org/projects/tango/docs/current/tango.text.Util.html
> http://dsource.org/projects/tango/docs/current/tango.text.convert.Utf.html
>
> D1/Phobos:
>
> http://digitalmars.com/d/1.0/phobos/std_utf.html
>
> D2/Phobos:
>
> http://digitalmars.com/d/2.0/phobos/std_utf.html
>
> Although the libraries do a decent job of hiding the ugly details, my
> opinion (which is not very popular around here) is that D's string
> processing is a major design flaw.
>
> > As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings support slicing and length-calculation. Since strings are char arrays this is correct only for latin strings. So when the strings for example cyrillic chars - length is wrong, indexing also doesn't work, and slicing too.
>
> Indexing, slicing, and lengh-calculation of D strings is based on
> byte-position, not character position.
>
> Character-position indexing and slicing is only possible by iterating
> from the beginning of the string, decoding the characters on-the-fly,
> and keeping track of the number of bytes used by each character.
>
> That's what the standard library functions basically do.
>
> Calculating the actual character-length of the string is fundamentally
> the same as in C, where strings are null-terminated (e.g., you can't
> determine the actual length of the string until you've iterated from the
> beginning to the end).
>
> The Phobox & Tango libraries handle all of those details for you, but I
> think it's important to know what's going on behind the scenes, so that
> you have a rough idea of the true cost of each operation.
>
> --benji
Yeah - I know that this operations works with bytes rather than chars[]. But it is stated here http://www.digitalmars.com/d/2.0/cppstrings.html explicitly, that strings support slicing:
>D has the array slice syntax, not possible with C++:
>char[] s1 = "hello world";
>char[] s2 = s1[6 .. 11]; // s2 is "world"
So this example is only correct in case of latin chars, but in general it is wrong for UTF-8 strings.
More information about the Digitalmars-d
mailing list