UTF-8 issues

Mon Sep 15 22:53:14 PDT 2008

Benji Smith Wrote:

> Eldar Insafutdinov wrote:
> > I faced some issues with utf-8 support in D.
> 
> The important thing to remember is that a string is absolutely NOT an 
> array of characters, and you can't treat it as such.
> 
> As you've noticed, a char[] string is actually an array of UTF-8 encoded 
> bytes. Iterating directly through that array is extremely touchy and 
> error-prone. Instead, always use the standard library functions.
> 
> D1/Tango:
> 
> http://dsource.org/projects/tango/docs/current/tango.text.Util.html
> http://dsource.org/projects/tango/docs/current/tango.text.convert.Utf.html
> 
> D1/Phobos:
> 
> http://digitalmars.com/d/1.0/phobos/std_utf.html
> 
> D2/Phobos:
> 
> http://digitalmars.com/d/2.0/phobos/std_utf.html
> 
> Although the libraries do a decent job of hiding the ugly details, my 
> opinion (which is not very popular around here) is that D's string 
> processing is a major design flaw.
> 
> > As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings support slicing and length-calculation. Since strings are char arrays this is correct only for latin strings. So when the strings for example cyrillic chars - length is wrong, indexing also doesn't work, and slicing too.
> 
> Indexing, slicing, and lengh-calculation of D strings is based on 
> byte-position, not character position.
> 
> Character-position indexing and slicing is only possible by iterating 
> from the beginning of the string, decoding the characters on-the-fly, 
> and keeping track of the number of bytes used by each character.
> 
> That's what the standard library functions basically do.
> 
> Calculating the actual character-length of the string is fundamentally 
> the same as in C, where strings are null-terminated (e.g., you can't 
> determine the actual length of the string until you've iterated from the 
> beginning to the end).
> 
> The Phobox & Tango libraries handle all of those details for you, but I 
> think it's important to know what's going on behind the scenes, so that 
> you have a rough idea of the true cost of each operation.
> 
> --benji

Yeah - I know that this operations works with bytes rather than chars[]. But it is stated here http://www.digitalmars.com/d/2.0/cppstrings.html explicitly, that strings support slicing:

>D has the array slice syntax, not possible with C++:

>char[] s1 = "hello world";
>char[] s2 = s1[6 .. 11];	// s2 is "world"

So this example is only correct in case of latin chars, but in general it is wrong for UTF-8 strings.