string is rarely useful as a function argument
Michel Fortin
michel.fortin at michelf.com
Sat Dec 31 12:48:14 PST 2011
On 2011-12-31 16:47:40 +0000, Sean Kelly <sean at invisibleduck.org> said:
> I don't know that Unicode expertise is really required here anyway. All one
> has to know is that UTF8 is a multibyte encoding and built-in string attrib
> utes talk in bytes. Knowing when one wants bytes vs characters isn't rocket s
> cience.
It's not bytes vs. characters, it's code units vs. code points vs. user
perceived characters (grapheme clusters). One character can span
multiple code points, and can be represented in various ways depending
on which Unicode normalization you pick. But most people don't know
that.
If you want to count the number of *characters*, counting code points
isn't really it, as you should avoid counting the combining ones. If
you want to search for a substring, you need to be sure both strings
use the same normalization first, and if not normalize them
appropriately so that equivalent code point combinations are always
represented the same.
That said, if you are implementing an XML or JSON parser, since those
specs are defined in term of code points you should probably write your
code in term of code points (hopefully without decoding code points
when you don't need to). On the other hand, if you're writing something
that processes text (like counting the average number of *character*
per word in a document), then you should be aware of combining
characters.
How to pack all this into an easy to use package is most challenging.
--
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/
More information about the Digitalmars-d
mailing list