std.algorithm.remove and principle of least astonishment

Michel Fortin michel.fortin at michelf.com
Sun Nov 21 18:26:53 PST 2010


On 2010-11-21 20:21:27 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

> That design, with which I experimented for a while, had two drawbacks:
> 
> 1. It had the default reversed, i.e. most often you want to regard a 
> char[] or a wchar[] as a range of code points, not as an array of code 
> units.
> 
> 2. It had the unpleasant effect that most algorithms in std.algorithm 
> and beyond did the wrong thing by default, and the right thing only if 
> you wrapped everything with byDchar().

Well, basically these two arguments are the same: iterating by code 
unit isn't a good default. And I agree. But I'm unconvinced that 
iterating by dchar is the right default either. For one thing it has 
more overhead, and for another it still doesn't represent a character.

Now, add graphemes to the equation and you have a representation that 
matches the user-perceived character concept, but for that you add 
another layer of decoding overhead and a variable-size data type to 
manipulate (a grapheme is a sequence of code points). And you have to 
use Unicode normalization when comparing graphemes. So is that a good 
default? Probably not. It might be "correct" in some sense, but it's 
totally overkill for most cases.

My thinking is that there is no good default. If you write an XML 
parser, you'll probably want to work at the code point level; if you 
write a JSON parser, you can easily skip the overhead and work at the 
UTF-8 code unit level until you start parsing a string; if you write 
something to count the number of user-perceived characters or want to 
map characters to a font then you'll want graphemes...

Perhaps there should be simply no default; perhaps you should be forced 
to choose explicitly at which layer you want to operate each time you 
apply an algorithm on a string... and to make this less painful we 
could have functions in std.string acting as a thin layer over similar 
ones in std.algorithm that would automatically choose the right 
representation for the algorithm depending on the operation.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/



More information about the Digitalmars-d mailing list