std.algorithm.remove and principle of least astonishment

Mon Nov 22 03:21:24 PST 2010

On Sun, 21 Nov 2010 21:26:53 -0500
Michel Fortin <michel.fortin at michelf.com> wrote:

> On 2010-11-21 20:21:27 -0500, Andrei Alexandrescu 
> <SeeWebsiteForEmail at erdani.org> said:
> 
> > That design, with which I experimented for a while, had two drawbacks:
> > 
> > 1. It had the default reversed, i.e. most often you want to regard a 
> > char[] or a wchar[] as a range of code points, not as an array of code 
> > units.
> > 
> > 2. It had the unpleasant effect that most algorithms in std.algorithm 
> > and beyond did the wrong thing by default, and the right thing only if 
> > you wrapped everything with byDchar().

Hello Michel,

> Well, basically these two arguments are the same: iterating by code 
> unit isn't a good default. And I agree. But I'm unconvinced that 
> iterating by dchar is the right default either. For one thing it has 
> more overhead, and for another it still doesn't represent a character.

This is an issue evoked in a previous thread some weeks ago. More on it below.

> Now, add graphemes to the equation and you have a representation that 
> matches the user-perceived character concept, but for that you add 
> another layer of decoding overhead and a variable-size data type to 
> manipulate (a grapheme is a sequence of code points). And you have to 
> use Unicode normalization when comparing graphemes. So is that a good 
> default? Probably not. It might be "correct" in some sense, but it's 
> totally overkill for most cases.

It is not possible, as writer of a textprocessing lib ot Text type, to define a right level of abstraction (code unit, code point, or grapheme) that would both be usually efficent and avoid unexpected failures for "naive" use of the tool.
The only safe level in 99% cases is the highest-level one, namely grapheme. Only then can one be sure that, for instance text.count("ä") will actually count "ä"'s in source text. But in most cases, this is overkill. It depends on what the text actually, *and* on what the programmer knows about it (I mean that texts may be plain ASCII, so that even unsigned byte strings would do the job, but if the programmer cannot guess it...).
The tool writer cannot guess anything.

> My thinking is that there is no good default. If you write an XML 
> parser, you'll probably want to work at the code point level; if you 
> write a JSON parser, you can easily skip the overhead and work at the 
> UTF-8 code unit level until you start parsing a string; if you write 
> something to count the number of user-perceived characters or want to 
> map characters to a font then you'll want graphemes...

At least 3 factors must be taken into account:

1. The actual content of source texts. For instance, 99.999% of all texts won't ever hold code points > ffff. This tells which size should be used for code units. The safe general choice indeed beeing 32 bits.

2. The normalisation form of graphemes; whether they are decomposed (the right choice), or in unknown form or possibly in mixed forms, or as precomposed as possible. In the latter case (by far the most common one for western language texts), and one can assert that every grapheme in every source text to be dealt with has a fully precomposed form (= 1 single code *point*), then the level of code points is safe enough.

3. Whether text is just transferred through an app or is also processed. Many apps just use some bits of input texts (files, user input, literals) as is, without any processing, and often output some of them, possibly concatenated. This is safe whatever the abstraction level of text representation used; one can concat plain utf8 representing composite graphemes in decomposed form. 

But as soon as any text-processing routine is used (indexing, slicing, find, count, replace...), then questions arise about correctness of the app.

And, as said already, to be able to safely choose any lower-level of repreentation, the programmer must know about the content, its properties, its UCS coding. For instance, imagine you need to write an app dealing with texts containing phonetic symbols (IPA). How do you know which is the lowest safe level?
* What is the common coding of IPA graphemes in UCS?
* Can they be coded in various ways (yes!, too bad..)
* What is the highest code point ever possibly needed? (==> is utf8 or utf16 enough for code points?)
* Do all graphemes have a fully precomposed form?
* Can I be sure that all texts will actually be coded in precomposed form (this depends on text producing tools), "for ever"?

> Perhaps there should be simply no default; perhaps you should be forced 
> to choose explicitly at which layer you want to operate each time you 
> apply an algorithm on a string... and to make this less painful we 
> could have functions in std.string acting as a thin layer over similar 
> ones in std.algorithm that would automatically choose the right 
> representation for the algorithm depending on the operation.

My next project should be to write one Text type dealing at the highest-level -- if only to showcase the issues invloved by the "missing level of abstraction" in common tools supposed to deal with universal text.
This is much easier in D thank to proper string types, and availibility of tools to cope with lower-level issues, mainly decoding/encoding and validity checking (I do not know yet how practicle said tools are).

denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com