input validation

Wed Mar 4 09:39:22 PST 2009

Andrei Alexandrescu wrote:
> Sean Kelly wrote:
>> Georg Wrede wrote:
>>> The distinction is not whether you or others write stuff. It's about 
>>> whether it is for debugging *only*, as opposed to general input 
>>> validation.
>>
>> So I guess the real question is whether a function is expected to 
>> validate its parameters.  I'd argue that it isn't, but then I'm from a 
>> C/C++ background.  For me, validation is a debugging tool, or at least 
>> an optional feature for applications that want the added insurance.
> 
> Interesting. My policy is to favor validation whenever it doesn't impact 
> performance. Imagine for example that strlen() validated its input for 
> non-null. Would that show on the profiling chart of any C application? 
> No, unless the application's core loop only called strlen() on a 
> 1-character string or so.

Interesting.  So the inexpensive checks would go in the function body 
itself, with the exhaustive extra stuff in contracts.  That does seem 
reasonable, though I still like the visual separation that the 'in' 
clause provides, and I'd love to be able to use the proposed inheritance 
feature of contracts, which seems like it might necessitate duplicating 
these inexpensive checks in the contract and in the function body itself.

> One simple case that clarifies the necessary tradeoff is binary search. 
> That assumes the range to be searched is sorted. If you actually checked 
> for that, it would render binary search useless as a linear search would 
> be in fact faster. So you need to assume. One way to do so is in the 
> documentation. You write in the docs that findSorted expects a sorted 
> range. Another way is to encode this information in the type of the 
> sorted range. But that's onerous as most of the time you have an array 
> you just sorted, not a SortedArray value.
> 
> The approach I took with the new phobos is:
> 
> int[] haystack;
> int[] needle;
> ...
> auto pos1 = find(haystack, needle); // linear
> sort(haystack);
> auto pos2 = find(assumeSorted(haystack), needle);
> 
> The assumeSorted function wraps the haystack in an AssumeSorted!(int[]) 
> type without adding members or running extra code. It's there to clarify 
> to everyone what's going on. And it's usable with other arguments or 
> functions too, e.g.
> 
> auto pos3 = find(haystack, assumeSorted(needle));
> setIntersection(assumeSorted(haystack), assumeSorted(needle));
> 
> Interestingly, assumeSorted can actually do checking without impacting 
> the complexity of the search. In debug mode, it can arrange to run 
> random isSorted tests every 1/N calls, where N is the average length of 
> the incoming arrays, then its complexity impact is amortized constant.

One thing I've always really liked about pointer arguments is that they 
tend to document what's happening at the call-side as well (because of 
the address-of operator typically needed to obtain the address of a 
variable).  I tend to avoid boolean parameters for similar reasons, 
unless the meaning can be communicated clearly at the call point.  It 
seems like this serves a similar purpose, and I like it despite the 
potential for a user accidentally calling the slow overload when he 
could actually use the fast one--better it be correct than fast, after all.

I'm not terribly fond of the added verbosity however, or that this seems 
like I couldn't use the property form:

     assumeSorted("abcd").find('c')

Truth be told, my initial inclination would be to repackage the binary 
search as a one-liner with a different name, which kind of sabotages the 
whole idea.  But I'll try to resist this urge.