Unicode's proper level of abstraction? [was: Re: VLERange:...]

Thu Jan 13 02:16:34 PST 2011

On Thursday 13 January 2011 01:49:31 spir wrote:
> On 01/13/2011 01:45 AM, Michel Fortin wrote:
> > On 2011-01-12 14:57:58 -0500, spir <denis.spir at gmail.com> said:
> >> On 01/12/2011 08:28 PM, Don wrote:
> >>> I think the only problem that we really have, is that "char[]",
> >>> "dchar[]" implies that code points is always the appropriate level of
> >>> abstraction.
> >> 
> >> I'd like to know when it happens that codepoint is the appropriate
> >> level of abstraction.
> > 
> > I agree with you. I don't see many use for code points.
> > 
> > One of these uses is writing a parser for a format defined in term of
> > code points (XML for instance). But beyond that, I don't see one.
> 
> Actually, I had once a real use case for codepoint beeing the proper
> level of abstraction: a linguistic app of which one operational func
> counts occurrences of "scripting marks" like 'a' & '¨' in "ä". hope you
> see what I mean.
> Once the text is properly NFD decomposed, each of those marks in coded
> as a codepoint. (But if it's not decomposed, then most of those marks
> are probably hidden by precomposed codes coding characters like "ä".) So
> that even such an app benefits from a higher-level type basically
> operating on normalised (NFD) characters.

There's also the question of efficiency. On the whole, string operations can be 
very expensive - particularly when you're doing a lot of them. The fact that D's 
arrays are so powerful may reduce the problem in D, but in general, if you're 
doing a lot with strings, it can get costly, performance-wise.

The question then is what is the cost of actually having strings abstracted to 
the point that they really are ranges of characters rather than code units or 
code points or whatever? If the cost is large enough, then dealing with strings 
as arrays as they currently are and having the occasional unicode issue could 
very well be worth it. As it is, there are plenty of people who don't want to 
have to care about unicode in the first place, since the programs that they write 
only deal with ASCII characters. The fact that D makes it so easy to deal with 
unicode code points is a definite improvement, but taking the abstraction to the 
point that you're definitely dealing with characters rather than code units or 
code points could be too costly.

Now, if it can be done efficiently, then having unicode dealt with properly 
without the programmer having to worry about it would be a big boon. As it is, 
D's handling of unicode is a big boon, even if it doesn't deal with graphemes 
and the like.

So, I think that we definitely should have an abstraction for unicode which uses 
characters as the elements in the range and doesn't have to care about the 
underlying encoding of the characters (except perhaps picking whether char, 
wchar, or dchar is use internally, and therefore how much space it requires). 
However, I'm not at all convinced that such an abstraction can be done efficiently 
enough to make it the default way of handling strings.

- Jonathan M Davis