UTF-8 issues

Benji Smith dlanguage at benjismith.net
Mon Sep 15 21:53:37 PDT 2008


Eldar Insafutdinov wrote:
> I faced some issues with utf-8 support in D.

The important thing to remember is that a string is absolutely NOT an 
array of characters, and you can't treat it as such.

As you've noticed, a char[] string is actually an array of UTF-8 encoded 
bytes. Iterating directly through that array is extremely touchy and 
error-prone. Instead, always use the standard library functions.

D1/Tango:

http://dsource.org/projects/tango/docs/current/tango.text.Util.html
http://dsource.org/projects/tango/docs/current/tango.text.convert.Utf.html

D1/Phobos:

http://digitalmars.com/d/1.0/phobos/std_utf.html

D2/Phobos:

http://digitalmars.com/d/2.0/phobos/std_utf.html

Although the libraries do a decent job of hiding the ugly details, my 
opinion (which is not very popular around here) is that D's string 
processing is a major design flaw.

> As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings support slicing and length-calculation. Since strings are char arrays this is correct only for latin strings. So when the strings for example cyrillic chars - length is wrong, indexing also doesn't work, and slicing too.

Indexing, slicing, and lengh-calculation of D strings is based on 
byte-position, not character position.

Character-position indexing and slicing is only possible by iterating 
from the beginning of the string, decoding the characters on-the-fly, 
and keeping track of the number of bytes used by each character.

That's what the standard library functions basically do.

Calculating the actual character-length of the string is fundamentally 
the same as in C, where strings are null-terminated (e.g., you can't 
determine the actual length of the string until you've iterated from the 
beginning to the end).

The Phobox & Tango libraries handle all of those details for you, but I 
think it's important to know what's going on behind the scenes, so that 
you have a rough idea of the true cost of each operation.

--benji



More information about the Digitalmars-d mailing list