UTF-8 issues
Benji Smith
dlanguage at benjismith.net
Mon Sep 15 21:53:37 PDT 2008
Eldar Insafutdinov wrote:
> I faced some issues with utf-8 support in D.
The important thing to remember is that a string is absolutely NOT an
array of characters, and you can't treat it as such.
As you've noticed, a char[] string is actually an array of UTF-8 encoded
bytes. Iterating directly through that array is extremely touchy and
error-prone. Instead, always use the standard library functions.
D1/Tango:
http://dsource.org/projects/tango/docs/current/tango.text.Util.html
http://dsource.org/projects/tango/docs/current/tango.text.convert.Utf.html
D1/Phobos:
http://digitalmars.com/d/1.0/phobos/std_utf.html
D2/Phobos:
http://digitalmars.com/d/2.0/phobos/std_utf.html
Although the libraries do a decent job of hiding the ugly details, my
opinion (which is not very popular around here) is that D's string
processing is a major design flaw.
> As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings support slicing and length-calculation. Since strings are char arrays this is correct only for latin strings. So when the strings for example cyrillic chars - length is wrong, indexing also doesn't work, and slicing too.
Indexing, slicing, and lengh-calculation of D strings is based on
byte-position, not character position.
Character-position indexing and slicing is only possible by iterating
from the beginning of the string, decoding the characters on-the-fly,
and keeping track of the number of bytes used by each character.
That's what the standard library functions basically do.
Calculating the actual character-length of the string is fundamentally
the same as in C, where strings are null-terminated (e.g., you can't
determine the actual length of the string until you've iterated from the
beginning to the end).
The Phobox & Tango libraries handle all of those details for you, but I
think it's important to know what's going on behind the scenes, so that
you have a rough idea of the true cost of each operation.
--benji
More information about the Digitalmars-d
mailing list