Unicode's proper level of abstraction? [was: Re: VLERange:...]

Thu Jan 13 05:53:40 PST 2011

On 2011-01-13 07:10:09 -0500, Jonathan M Davis <jmdavisProg at gmx.com> said:

> However, regardless of what the best way to handle unicode is in 
> general, I think that it's painfully clear that your average programmer 
> doesn't know much about unicode. Even understanding the nuances between 
> char, wchar, and dchar is more than your average programmer seems to 
> understand at first. The idea that a char wouldn't be guaranteed to be 
> an actual character is not something that many
> programmers take to immediately. It's quite foreign to how chars are typically
> dealt with in other languages, and many programmers never worry about 
> unicode at
> all, only dealing with ASCII. So, not only is unicode a rather 
> disgusting problem, but it's not one that your average programmer 
> begins to grasp as far as I've seen. Unless the issue is abstracted 
> away completely, it takes a fair bit of explaining to understand how to 
> deal with unicoder properly.

What's nice about Cocoa's way of handling strings is that even 
programmers not bothering about it get things right most of the time. 
Strings are compared in their canonical form (graphemes), unless you 
request a literal compression; and they are sorted and compared 
case-insensitively according to the user's locale, unless you specify 
your own locale settings. Its only major pitfall is that indexing is 
done on UTF-16 code units.

The cost for this correctness is a small performance penalty, but I 
think it's the right path to take. For when performance or access to 
code points is important, the programmer should still be able to go 
down one layer and play with code points directly.

That said, we need to make sure the performance drop is minimal. I 
somewhat doubt much that spir's approach of storing strings as an array 
of piles of characters is the right approach for most usage scenarios, 
but this area would need a little more research. spir's approach is 
certainly the ultimate step in correctness as it allows O(1) indexing 
of graphemes, but personally I'd favor not to have indexing and just do 
on-the-fly decoding at the grapheme level when performing various 
string operations.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/