string is rarely useful as a function argument

Sat Dec 31 17:23:15 PST 2011

Sorry, I was simplifying. The distinction I was trying to make was between generic operations (in my experience the majority) vs. encoding-aware ones. 

Sent from my iPhone

On Dec 31, 2011, at 12:48 PM, Michel Fortin <michel.fortin at michelf.com> wrote:

> On 2011-12-31 16:47:40 +0000, Sean Kelly <sean at invisibleduck.org> said:
> 
>> I don't know that Unicode expertise is really required here anyway.  All one
>> has to know is that UTF8 is a multibyte encoding and built-in string attrib
>> utes talk in bytes. Knowing when one wants bytes vs characters isn't rocket s
>> cience.
> 
> It's not bytes vs. characters, it's code units vs. code points vs. user perceived characters (grapheme clusters). One character can span multiple code points, and can be represented in various ways depending on which Unicode normalization you pick. But most people don't know that.
> 
> If you want to count the number of *characters*, counting code points isn't really it, as you should avoid counting the combining ones. If you want to search for a substring, you need to be sure both strings use the same normalization first, and if not normalize them appropriately so that equivalent code point combinations are always represented the same.
> 
> That said, if you are implementing an XML or JSON parser, since those specs are defined in term of code points you should probably write your code in term of code points (hopefully without decoding code points when you don't need to). On the other hand, if you're writing something that processes text (like counting the average number of *character* per word in a document), then you should be aware of combining characters.
> 
> How to pack all this into an easy to use package is most challenging.
> 
> 
> -- 
> Michel Fortin
> michel.fortin at michelf.com
> http://michelf.com/
>