std.algorithm.remove and principle of least astonishment

spir denis.spir at gmail.com
Mon Nov 22 04:25:34 PST 2010


On Sun, 21 Nov 2010 17:56:15 -0800
Jonathan M Davis <jmdavisProg at gmx.com> wrote:

> We could always define an abstract Character (or whatever you want to call it) 
> which holds a character - regardless of whether it uses a grapheme or not - and 
> make it relatively easy to iterate over Characters rather than dchars.

This is not a solution, it would force constructing graphemes for each routine applied to a given text. You need to do it only once.

> It would 
> be nice if they abolished graphemes though...

What is the alternative? For a given set of base characters (say ascii letters, cardinal NC) and a given set of "combining marks" (say latin diacritics, cardinal ND), what is the number of combinations? If I'm right, the answer is NC * 2^ND (in other words, an astronomical number). We would need thousands of bits for each code point ;-)
Also, we cannot predict future. Think that for each new diacritic, you must double the number of precomposed characters, simply by adding this diacritic to every already existing combination. We cannot know what would be needed in a few years.

The error UCS & Unicode have done is the opposite one. To silently pretend that code points represent characters (I cannot believe that choosing the term "abstract character" to denote what is coded by a code point was innocent. It can only introduce confusion). They should have said that a code point represents, say, an "abstract marks". And made clear that a character, meaning a logical text element, is represented by a mini-array of code units (what I call a code stack, see other post for why).
This would have avoided confusion from start on, and encouraged programmer to design proper, correct, text representations -- at least for text processing. Now, and only because of that, everybody seems to discover consequent issues 20 years too late. Even in unicode circles: I have tried to evoke this on the usincode maling list several times in past years, with about no echo at all. People do not *want* to hear of it.
I think this has been a deliberate marketing choice for the UCS/Unicode standard. Probably they were afraid of reactions from programming  communities if they had made clear dealing with universal text required adding *2* levels of abstraction over plain ASCII. Another error was to promote using code units for space-efficiency. Else, there would be only 1 new level.

> It is quite possible that while 
> D's handling of unicode is a huge improvement over other languages, by treating 
> dchar as a full character essentially everywhere, we're opening ourselves up for 
> a variety of bugs caused by graphemes which will be subtle and hard to find.
> But I'm not sure what the correct solution to that is.

There is one general solution as long as efficiency is considered irrelevant: a text is represented as a string of graphemes. There is no solution with efficiency because cases for which this is overkil are most common one (as of now, but this will change with the growth of computing is asiatic countries).

Denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com



More information about the Digitalmars-d mailing list