std.algorithm.remove and principle of least astonishment

Jonathan M Davis jmdavisProg at gmx.com
Sun Nov 21 17:52:23 PST 2010


On Sunday 21 November 2010 17:21:27 Andrei Alexandrescu wrote:
> On 11/21/10 7:00 PM, Jonathan M Davis wrote:
> > Actually, the better implementation would probably be to provide wrapper
> > ranges for ranges of char and wchar so that you could access them as
> > ranges of dchar. Doing otherwise would make it so that you couldn't
> > access them directly as ranges of char or wchar, which would be
> > limiting, and since it's likely that anyone actually wanting strings
> > would just use strings, there's a good chance that in the majority of
> > cases, what you'd want would really be a range of char or wchar anyway.
> > Regardless, it's quite possible to access containers of char or wchar as
> > ranges of dchar if you need to.
> 
> I agree except for the majority of cases part. In fact the original
> design of range interfaces for char[] and wchar[] was to require
> byDchar() to get a bidirectional interface over the arrays of code units.
> 
> That design, with which I experimented for a while, had two drawbacks:
> 
> 1. It had the default reversed, i.e. most often you want to regard a
> char[] or a wchar[] as a range of code points, not as an array of code
> units.
> 
> 2. It had the unpleasant effect that most algorithms in std.algorithm
> and beyond did the wrong thing by default, and the right thing only if
> you wrapped everything with byDchar().
> 
> The second iteration of the design, which is currently in use, was to
> define in std.range the primitives such that char[] and wchar[] offer by
> default the bidirectional range interface. I have gone through all
> algorithms in std.algorithm and std.string and noticed with amazed
> satisfaction that they most always did the right thing, and that I could
> tweak the few that didn't to complete a satisfactory implementation.
> (indexOf has slipped through the cracks.) I think that experience with
> the current design is speaking in its favor.
> 
> One thing could be done to drive the point home: a function byCodeUnit()
> could be added that actually does iterate a char[] or a wchar[] one code
> unit at a time (and consequently restores their behavior as T[]). That
> function could be simply a cast to ubyte[]/ushort[], or it could
> introduce a random-access range.

Well, I don't know for certain whether people would normally want to iterate 
over Array!char as a char range or a dchar range. However, when thinking about 
the likely uses, it seems to me that you if you really want a string, you'd 
likely be using a string rather than Array!char, so I figure that the most likely 
use case for Array!char would be to iterate over a range of char. But I could be 
totally wrong about that.

As for character arrays, I do think that the normal use case is to want to see 
them as ranges of dchar rather than char or wchar. However, that can get a bit 
funny due to the fact that while the _programmer_ almost always views them that 
way, the _algorithms_ vary quite a bit more in whether they really want dchar or 
whether char or wchar works just fine. I do agree though that the current design 
works quite well overall though.

If I were to change it, I'd probably make strings into structs which have an 
array property (giving access to the char[] or wchar[] array if you need it) and 
give the struct a range interface which was over dchar. To really make that 
work, though, you'd need uniform function call syntax (or things like 
str.splitlines() would quick working), and there could be other reasons why it 
would fall apart. However, it would quickly and easily make dchar iteration the 
default while still allowing access to the interior char[] or wchar[]. But since 
you'd still have to special case functions which actually wanted the char[] or 
wchar[], I'm not sure if you ultimately gain much - though it does fix the 
foreach error where it defaults to char or wchar.

Overally, what we have works quite well. It _is_ a bit convoluted at times, but 
it's generally convoluted because of the nature of unicode rather than how we're 
implementing it. It's not perfect (unicode is too disgusting for perfection to 
be possible anyway), but it works _far_ better than any other language that I've 
used, and I actually understand unicode and its issues far better than I did 
before messing around with D strings.

- Jonathan M Davis


More information about the Digitalmars-d mailing list