Unicode's proper level of abstraction? [was: Re: VLERange:...]

Thu Jan 13 03:48:46 PST 2011

On 01/13/2011 11:16 AM, Jonathan M Davis wrote:
> On Thursday 13 January 2011 01:49:31 spir wrote:
>> On 01/13/2011 01:45 AM, Michel Fortin wrote:
>>> On 2011-01-12 14:57:58 -0500, spir<denis.spir at gmail.com>  said:
>>>> On 01/12/2011 08:28 PM, Don wrote:
>>>>> I think the only problem that we really have, is that "char[]",
>>>>> "dchar[]" implies that code points is always the appropriate level of
>>>>> abstraction.
>>>>
>>>> I'd like to know when it happens that codepoint is the appropriate
>>>> level of abstraction.
>>>
>>> I agree with you. I don't see many use for code points.
>>>
>>> One of these uses is writing a parser for a format defined in term of
>>> code points (XML for instance). But beyond that, I don't see one.
>>
>> Actually, I had once a real use case for codepoint beeing the proper
>> level of abstraction: a linguistic app of which one operational func
>> counts occurrences of "scripting marks" like 'a'&  '¨' in "ä". hope you
>> see what I mean.
>> Once the text is properly NFD decomposed, each of those marks in coded
>> as a codepoint. (But if it's not decomposed, then most of those marks
>> are probably hidden by precomposed codes coding characters like "ä".) So
>> that even such an app benefits from a higher-level type basically
>> operating on normalised (NFD) characters.
>
> There's also the question of efficiency. On the whole, string operations can be
> very expensive - particularly when you're doing a lot of them. The fact that D's
> arrays are so powerful may reduce the problem in D, but in general, if you're
> doing a lot with strings, it can get costly, performance-wise.

D's arrays (even dchar[] & dstring) do not allow having correct results 
when dealing with UCS/Unicode text in the general case. See Michel's 
example (and several ones I posted on this list, and the text at 
https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20level%20of%20abstraction 
for a very lengthy explanation).
You and some other people seem to still mistake Unicode's low level 
issue of codepoint vs code unit, with the higher-level issue of codes 
_not_ representing characters in the commmon sense ("graphemes").

The above pointed text was written precisely to introduce to this issue 
because obviously no-one wants to face it... (Eg each time I evoke it on 
this list it is ignored, except by Michel, but the same is true 
everywhere else, including on the Unicode mailing list!). The core of 
the problem is the misleading term "abstract character" which 
deceivingly lets programmers believe that a codepoints codes a 
character, like in historic character sets -- which is *wrong*. No 
Unicode document AFAIK explains this. This is a case of unsaid lie.
Compared to legacy charsets, dealing with Unicode actually requires *2* 
levels of abstraction... (one to decode codepoints from code units, one 
to construct characters from codepoints)

Note that D's stdlib currently provides no means to do this, not even on 
the fly. You'd have to interface with eg ICU (a C/C++/Java Unicode 
library) (good luck ;-). But even ICU, as well as supposed unicode-aware 
typse or librarys for any language, would give you an abstraction 
producing correct results for Michel's example. For instance, Python3 
code fails as miserably as any other. AFAIK, D is the first and only 
language having such a tool (Text.d at 
https://bitbucket.org/denispir/denispir-d/src/a005424f60f3).

> The question then is what is the cost of actually having strings abstracted to
> the point that they really are ranges of characters rather than code units or
> code points or whatever? If the cost is large enough, then dealing with strings
> as arrays as they currently are and having the occasional unicode issue could
> very well be worth it. As it is, there are plenty of people who don't want to
> have to care about unicode in the first place, since the programs that they write
> only deal with ASCII characters. The fact that D makes it so easy to deal with
> unicode code points is a definite improvement, but taking the abstraction to the
> point that you're definitely dealing with characters rather than code units or
> code points could be too costly.

When _manipulating_ text (indexing, search, changing), you have the 
choice between:
* On the fly abstraction (composing characters on the fly, and/or 
normalising them), for each operation for each piece of text (including 
parameters, including literals).
* Use of a type that constructs this abstraction once only for each 
piece of text.
Note that a single count operation is forced to construct this 
abstraction on the fly for the whole text... (and for the searched snippet).
Also note that optimisation is probably easier is the second case, for 
the abstraction operation is then standard.

> Now, if it can be done efficiently, then having unicode dealt with properly
> without the programmer having to worry about it would be a big boon. As it is,
> D's handling of unicode is a big boon, even if it doesn't deal with graphemes
> and the like.

It has a cost at intial Text construction time. Currently, on my very 
slow computer, 1MB source text requires ~ 500 ms (decoding + 
decomposition + ordering + "piling" codes into characters). Decoding 
only using D's builtin std.utf.decode takes about 100 ms.
The bottle neck is piling: 70% of the time in average, on a test case 
melting texts from a dozen natural languages. We would be very glad to 
get the community's help in optimising this phase :-)
(We have progressed very much already in terms of speed, but now reach 
limits of our competences.)

> So, I think that we definitely should have an abstraction for unicode which uses
> characters as the elements in the range and doesn't have to care about the
> underlying encoding of the characters (except perhaps picking whether char,
> wchar, or dchar is use internally, and therefore how much space it requires).
> However, I'm not at all convinced that such an abstraction can be done efficiently
> enough to make it the default way of handling strings.

If you only have ASCII, or if you don't manipulate text at all, then as 
said in a previous post any string representation works fine (whatever 
the encoding it possibly uses under the hood).
D's builtin char/dchar/wchar and string/dstring/wstring are very nice 
and well done, but they are not necessary in such a use case. Actually, 
as shown by Steven's repeted complaints, they rather get in the way when 
dealing with non-unicode source data (IIUC, by assuming string elements 
are utf codes).

And they do not even try to solve the real issues one necessarily meets 
when manipulating unicode texts, which are due to UCS's coding format. 
Thus my previous statement: the level of codepoints is nearly never the 
proper level of abstraction.

> - Jonathan M Davis

Denis
_________________
vita es estrany
spir.wikidot.com