Unicode's proper level of abstraction? [was: Re: VLERange:...]
spir
denis.spir at gmail.com
Thu Jan 13 05:17:13 PST 2011
On 01/13/2011 01:10 PM, Jonathan M Davis wrote:
> I wasn't saying that code points are guaranteed to be characters. I was saying
> that in most cases they are, so if efficiency is an issue, then having properly
> abstract characters could be too costly.
The problem is then: how does a library or application programmer know,
for sure, that all true characters (graphemes) from all source texts its
software will ever deal with are coded with a single codepoint?
If you cope with ASCII only now & forever, then you know that.
If you do not manipulate text at all, then the question vanishes.
Else, you cannot know, I guess. The problem is partially masked because,
most of us currently process only western language sources, for which
scripts there exist precomposed codes for every _predefine_ character,
and text-producing software (like editors) usually use precomposed codes
when available. Hope I'm clear.
(I hope this use of precomposed codes will change because the gain in
space for western langs is ridiculous and the cost in processing is
instead relevant.)
In the future, all of this may change, so that the issue would more
often be obvious for many programmers dealing with international text.
Note that even now nothing prevents a user (including a programmer in
source code!), even less a text-producing software, to use decomposed
coding (the right choice imo). And there are true characters, and you
can "invent" as many fancy characters you like, for which no precomposed
code is defined, indeed. All of this is valid unicode and must be
properly dealt with.
> However, having a range type which
> properly abstracts characters and deals with whatever graphemes and
> normalization and whatnot that it has to would be a very good thing
to have. The real question is whether it can be made efficient enough to
even consider using it normally instead of just when you know that
you're really going to need it.
Upon range, we initially planned to expose a range interface in our type
for iteration, instead of opApply, for better integration with coming D2
style, and algorithms. But had to let it down due to a few range bugs
exposed in a previous thread (search for "range usability" IIRC).
> The fact that you're seeing such a large drop in performance with your Text type
> definitely would support the idea that it could be just plain too expensive to
> use such a type in the average case. Even something like a 20% drop in
> performance could be devastating if you're dealing with code which does a lot of
> string processing. Regardless though, there will obviously be cases where you'll
> need something like your Text type if you want to process unicode correctly.
The question of efficency is not as you present it. If you cannot
guarantee that every character is coded by a single code (in all pieces
of text, including params and literal), then you *must* construct an
abstraction at the level of true characters --and even probably
normalise them.
You have the choice of doing it on the fly for _every_ operation, or
using a tool like the type Text. In the latter case, not only everything
is far simpler for client code, but the abstraction is constructed only
once (and forever ;-).
In the first case, the cost is the same (or rather higher because
optimisation can probably be more efficient for a single standard case
than for various operation cases); but _multiplied_ by the number of
operations you need to perform on each piece of text. Thus, for a given
operation, you get the slowest possible run: for instance indexing is
O(k*n) where k is the cost of "piling" a single char, and n the char
count...
In the second case, the efficiency issue happens only initially for each
piece of text. Then, every operation is as fast as possible: indexing is
indeed O(1).
But: this O(1) is slightly slower than with historic charsets because
characters are now represented by mini code arrays instead of single
codes. The same point applies even more for every operation involving
compares (search, count, replace). We cannot solve this: it is due to
UCS's coding scheme.
> However, regardless of what the best way to handle unicode is in general, I
> think that it's painfully clear that your average programmer doesn't know much
> about unicode.
True. Even those who think they are informed. Because Unicode's docs all
not only ignore the problem, but contribute to creating it by using the
deceiving term "abstract character" (and often worse, "character" alone)
to denote what a codepoint codes. All articles I have ever read _about_
Unicode by third party simply follow. Evoking this issue on the unicode
mailing list usually results in plain silence.
> Even understanding the nuances between char, wchar, and dchar is
> more than your average programmer seems to understand at first. The idea that a
> char wouldn't be guaranteed to be an actual character is not something that many
> programmers take to immediately. It's quite foreign to how chars are typically
> dealt with in other languages, and many programmers never worry about unicode at
> all, only dealing with ASCII.
(average programmer ? ;-)
Not that much to "how chars are typically dealt with in other
languages", rather to how characters were coded in historic charsets.
Other languages ignore the issue, and thus run incorrectly with
universal text, the same way as D's builtin tools do it.
About ASCII, note that the only kind of source it's able to encode is
plain english text, without any bit of fancy thingy in it. A single
non-breaking space, "≥", "×" (product U+00D7), or using a letter
imported from foreign language like in "à la", same for "αβγ", not to
evoke "©" & "®"...
> So, not only is unicode a rather disgusting
> problem, but it's not one that your average programmer begins to grasp as far as
> I've seen. Unless the issue is abstracted away completely, it takes a fair bit
> of explaining to understand how to deal with unicoder properly.
Please have a look at
https://bitbucket.org/denispir/denispir-d/src/a005424f60f3, read
https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/U%20missing%20level%20of%20abstraction,
and try https://bitbucket.org/denispir/denispir-d/src/a005424f60f3/Text.d
Any feedback welcome (esp on reformulating the text concisely ;-)
> - Jonathan M Davis
Denis
_________________
vita es estrany
spir.wikidot.com
More information about the Digitalmars-d
mailing list