Unicode's proper level of abstraction? [was: Re: VLERange:...]

Thu Jan 13 04:10:09 PST 2011

On Thursday 13 January 2011 03:48:46 spir wrote:
> On 01/13/2011 11:16 AM, Jonathan M Davis wrote:
> > On Thursday 13 January 2011 01:49:31 spir wrote:
> >> On 01/13/2011 01:45 AM, Michel Fortin wrote:
> >>> On 2011-01-12 14:57:58 -0500, spir<denis.spir at gmail.com>  said:
> >>>> On 01/12/2011 08:28 PM, Don wrote:
> >>>>> I think the only problem that we really have, is that "char[]",
> >>>>> "dchar[]" implies that code points is always the appropriate level of
> >>>>> abstraction.
> >>>> 
> >>>> I'd like to know when it happens that codepoint is the appropriate
> >>>> level of abstraction.
> >>> 
> >>> I agree with you. I don't see many use for code points.
> >>> 
> >>> One of these uses is writing a parser for a format defined in term of
> >>> code points (XML for instance). But beyond that, I don't see one.
> >> 
> >> Actually, I had once a real use case for codepoint beeing the proper
> >> level of abstraction: a linguistic app of which one operational func
> >> counts occurrences of "scripting marks" like 'a'&  '¨' in "ä". hope you
> >> see what I mean.
> >> Once the text is properly NFD decomposed, each of those marks in coded
> >> as a codepoint. (But if it's not decomposed, then most of those marks
> >> are probably hidden by precomposed codes coding characters like "ä".) So
> >> that even such an app benefits from a higher-level type basically
> >> operating on normalised (NFD) characters.
> > 
> > There's also the question of efficiency. On the whole, string operations
> > can be very expensive - particularly when you're doing a lot of them.
> > The fact that D's arrays are so powerful may reduce the problem in D,
> > but in general, if you're doing a lot with strings, it can get costly,
> > performance-wise.
> 
> D's arrays (even dchar[] & dstring) do not allow having correct results
> when dealing with UCS/Unicode text in the general case. See Michel's
> example (and several ones I posted on this list, and the text at
> https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20le
> vel%20of%20abstraction for a very lengthy explanation).
> You and some other people seem to still mistake Unicode's low level
> issue of codepoint vs code unit, with the higher-level issue of codes
> _not_ representing characters in the commmon sense ("graphemes").
> 
> The above pointed text was written precisely to introduce to this issue
> because obviously no-one wants to face it... (Eg each time I evoke it on
> this list it is ignored, except by Michel, but the same is true
> everywhere else, including on the Unicode mailing list!). The core of
> the problem is the misleading term "abstract character" which
> deceivingly lets programmers believe that a codepoints codes a
> character, like in historic character sets -- which is *wrong*. No
> Unicode document AFAIK explains this. This is a case of unsaid lie.
> Compared to legacy charsets, dealing with Unicode actually requires *2*
> levels of abstraction... (one to decode codepoints from code units, one
> to construct characters from codepoints)
> 
> Note that D's stdlib currently provides no means to do this, not even on
> the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode
> library) (good luck ;-). But even ICU, as well as supposed unicode-aware
> typse or librarys for any language, would give you an abstraction
> producing correct results for Michel's example. For instance, Python3
> code fails as miserably as any other. AFAIK, D is the first and only
> language having such a tool (Text.d at
> https://bitbucket.org/denispir/denispir-d/src/a005424f60f3).
> 
> > The question then is what is the cost of actually having strings
> > abstracted to the point that they really are ranges of characters rather
> > than code units or code points or whatever? If the cost is large enough,
> > then dealing with strings as arrays as they currently are and having the
> > occasional unicode issue could very well be worth it. As it is, there
> > are plenty of people who don't want to have to care about unicode in the
> > first place, since the programs that they write only deal with ASCII
> > characters. The fact that D makes it so easy to deal with unicode code
> > points is a definite improvement, but taking the abstraction to the
> > point that you're definitely dealing with characters rather than code
> > units or code points could be too costly.
> 
> When _manipulating_ text (indexing, search, changing), you have the
> choice between:
> * On the fly abstraction (composing characters on the fly, and/or
> normalising them), for each operation for each piece of text (including
> parameters, including literals).
> * Use of a type that constructs this abstraction once only for each
> piece of text.
> Note that a single count operation is forced to construct this
> abstraction on the fly for the whole text... (and for the searched
> snippet). Also note that optimisation is probably easier is the second
> case, for the abstraction operation is then standard.
> 
> > Now, if it can be done efficiently, then having unicode dealt with
> > properly without the programmer having to worry about it would be a big
> > boon. As it is, D's handling of unicode is a big boon, even if it
> > doesn't deal with graphemes and the like.
> 
> It has a cost at intial Text construction time. Currently, on my very
> slow computer, 1MB source text requires ~ 500 ms (decoding +
> decomposition + ordering + "piling" codes into characters). Decoding
> only using D's builtin std.utf.decode takes about 100 ms.
> The bottle neck is piling: 70% of the time in average, on a test case
> melting texts from a dozen natural languages. We would be very glad to
> get the community's help in optimising this phase :-)
> (We have progressed very much already in terms of speed, but now reach
> limits of our competences.)
> 
> > So, I think that we definitely should have an abstraction for unicode
> > which uses characters as the elements in the range and doesn't have to
> > care about the underlying encoding of the characters (except perhaps
> > picking whether char, wchar, or dchar is use internally, and therefore
> > how much space it requires). However, I'm not at all convinced that such
> > an abstraction can be done efficiently enough to make it the default way
> > of handling strings.
> 
> If you only have ASCII, or if you don't manipulate text at all, then as
> said in a previous post any string representation works fine (whatever
> the encoding it possibly uses under the hood).
> D's builtin char/dchar/wchar and string/dstring/wstring are very nice
> and well done, but they are not necessary in such a use case. Actually,
> as shown by Steven's repeted complaints, they rather get in the way when
> dealing with non-unicode source data (IIUC, by assuming string elements
> are utf codes).
> 
> And they do not even try to solve the real issues one necessarily meets
> when manipulating unicode texts, which are due to UCS's coding format.
> Thus my previous statement: the level of codepoints is nearly never the
> proper level of abstraction.

I wasn't saying that code points are guaranteed to be characters. I was saying 
that in most cases they are, so if efficiency is an issue, then having properly 
abstract characters could be too costly. However, having a range type which 
properly abstracts characters and deals with whatever graphemes and 
normalization and whatnot that it has to would be a very good thing to have. The 
real question is whether it can be made efficient enough to even consider using it 
normally instead of just when you know that you're really going to need it.

The fact that you're seeing such a large drop in performance with your Text type 
definitely would support the idea that it could be just plain too expensive to 
use such a type in the average case. Even something like a 20% drop in 
performance could be devastating if you're dealing with code which does a lot of 
string processing. Regardless though, there will obviously be cases where you'll 
need something like your Text type if you want to process unicode correctly.

However, regardless of what the best way to handle unicode is in general, I 
think that it's painfully clear that your average programmer doesn't know much 
about unicode. Even understanding the nuances between char, wchar, and dchar is 
more than your average programmer seems to understand at first. The idea that a 
char wouldn't be guaranteed to be an actual character is not something that many 
programmers take to immediately. It's quite foreign to how chars are typically 
dealt with in other languages, and many programmers never worry about unicode at 
all, only dealing with ASCII. So, not only is unicode a rather disgusting 
problem, but it's not one that your average programmer begins to grasp as far as 
I've seen. Unless the issue is abstracted away completely, it takes a fair bit 
of explaining to understand how to deal with unicoder properly.

- Jonathan M Davis