Unicode's proper level of abstraction? [was: Re: VLERange:...]

Thu Jan 13 05:17:13 PST 2011

On 01/13/2011 01:10 PM, Jonathan M Davis wrote:
> I wasn't saying that code points are guaranteed to be characters. I was saying
> that in most cases they are, so if efficiency is an issue, then having properly
> abstract characters could be too costly.

The problem is then: how does a library or application programmer know, 
for sure, that all true characters (graphemes) from all source texts its 
software will ever deal with are coded with a single codepoint?
If you cope with ASCII only now & forever, then you know that.
If you do not manipulate text at all, then the question vanishes.

Else, you cannot know, I guess. The problem is partially masked because, 
most of us currently process only western language sources, for which 
scripts there exist precomposed codes for every _predefine_ character, 
and text-producing software (like editors) usually use precomposed codes 
when available. Hope I'm clear.
(I hope this use of precomposed codes will change because the gain in 
space for western langs is ridiculous and the cost in processing is 
instead relevant.)
In the future, all of this may change, so that the issue would more 
often be obvious for many programmers dealing with international text. 
Note that even now nothing prevents a user (including a programmer in 
source code!), even less a text-producing software, to use decomposed 
coding (the right choice imo). And there are true characters, and you 
can "invent" as many fancy characters you like, for which no precomposed 
code is defined, indeed. All of this is valid unicode and must be 
properly dealt with.

 > However, having a range type which
 > properly abstracts characters and deals with whatever graphemes and
 > normalization and whatnot that it has to would be a very good thing 
to have. The real question is whether it can be made efficient enough to 
even consider using it normally instead of just when you know that 
you're really going to need it.

Upon range, we initially planned to expose a range interface in our type 
for iteration, instead of opApply, for better integration with coming D2 
style, and algorithms. But had to let it down due to a few range bugs 
exposed in a previous thread (search for "range usability" IIRC).

> The fact that you're seeing such a large drop in performance with your Text type
> definitely would support the idea that it could be just plain too expensive to
> use such a type in the average case. Even something like a 20% drop in
> performance could be devastating if you're dealing with code which does a lot of
> string processing. Regardless though, there will obviously be cases where you'll
> need something like your Text type if you want to process unicode correctly.

The question of efficency is not as you present it. If you cannot 
guarantee that every character is coded by a single code (in all pieces 
of text, including params and literal), then you *must* construct an 
abstraction at the level of true characters --and even probably 
normalise them.
You have the choice of doing it on the fly for _every_ operation, or 
using a tool like the type Text. In the latter case, not only everything 
is far simpler for client code, but the abstraction is constructed only 
once (and forever ;-).

In the first case, the cost is the same (or rather higher because 
optimisation can probably be more efficient for a single standard case 
than for various operation cases); but _multiplied_ by the number of 
operations you need to perform on each piece of text. Thus, for a given 
operation, you get the slowest possible run: for instance indexing is 
O(k*n) where k is the cost of "piling" a single char, and n the char 
count...

In the second case, the efficiency issue happens only initially for each 
piece of text. Then, every operation is as fast as possible: indexing is 
indeed O(1).
But: this O(1) is slightly slower than with historic charsets because 
characters are now represented by mini code arrays instead of single 
codes. The same point applies even more for every operation involving 
compares (search, count, replace). We cannot solve this: it is due to 
UCS's coding scheme.

> However, regardless of what the best way to handle unicode is in general, I
> think that it's painfully clear that your average programmer doesn't know much
> about unicode.

True. Even those who think they are informed. Because Unicode's docs all 
not only ignore the problem, but contribute to creating it by using the 
deceiving term "abstract character" (and often worse, "character" alone) 
to denote what a codepoint codes. All articles I have ever read _about_ 
Unicode by third party simply follow. Evoking this issue on the unicode 
mailing list usually results in plain silence.

> Even understanding the nuances between char, wchar, and dchar is
> more than your average programmer seems to understand at first. The idea that a
> char wouldn't be guaranteed to be an actual character is not something that many
> programmers take to immediately. It's quite foreign to how chars are typically
> dealt with in other languages, and many programmers never worry about unicode at
> all, only dealing with ASCII.

(average programmer ? ;-)
Not that much to "how chars are typically dealt with in other 
languages", rather to how characters were coded in historic charsets. 
Other languages ignore the issue, and thus run incorrectly with 
universal text, the same way as D's builtin tools do it.
About ASCII, note that the only kind of source it's able to encode is 
plain english text, without any bit of fancy thingy in it. A single 
non-breaking space, "≥", "×" (product U+00D7), or using a letter 
imported from foreign language like in "à la", same for "αβγ", not to 
evoke "©" & "®"...

> So, not only is unicode a rather disgusting
> problem, but it's not one that your average programmer begins to grasp as far as
> I've seen. Unless the issue is abstracted away completely, it takes a fair bit
> of explaining to understand how to deal with unicoder properly.

Please have a look at 
https://bitbucket.org/denispir/denispir-d/src/a005424f60f3, read 
https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20level%20of%20abstraction, 
and try https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/Text.d
Any feedback welcome (esp on reformulating the text concisely ;-)

> - Jonathan M Davis

Denis
_________________
vita es estrany
spir.wikidot.com