string is rarely useful as a function argument

Sat Dec 31 12:48:14 PST 2011

On 2011-12-31 16:47:40 +0000, Sean Kelly <sean at invisibleduck.org> said:

> I don't know that Unicode expertise is really required here anyway.  All one
>  has to know is that UTF8 is a multibyte encoding and built-in string attrib
> utes talk in bytes. Knowing when one wants bytes vs characters isn't rocket s
> cience.

It's not bytes vs. characters, it's code units vs. code points vs. user 
perceived characters (grapheme clusters). One character can span 
multiple code points, and can be represented in various ways depending 
on which Unicode normalization you pick. But most people don't know 
that.

If you want to count the number of *characters*, counting code points 
isn't really it, as you should avoid counting the combining ones. If 
you want to search for a substring, you need to be sure both strings 
use the same normalization first, and if not normalize them 
appropriately so that equivalent code point combinations are always 
represented the same.

That said, if you are implementing an XML or JSON parser, since those 
specs are defined in term of code points you should probably write your 
code in term of code points (hopefully without decoding code points 
when you don't need to). On the other hand, if you're writing something 
that processes text (like counting the average number of *character* 
per word in a document), then you should be aware of combining 
characters.

How to pack all this into an easy to use package is most challenging.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/