avoid toLower in std.algorithm.sort compare alias

Sat Apr 21 18:26:42 PDT 2012

On Sat, Apr 21, 2012 at 05:45:35PM -0700, Jonathan M Davis wrote:
> On Saturday, April 21, 2012 20:36:18 bearophile wrote:
> > Jonathan M Davis:
> > > I'm not sure if it's an order of magnitude worse than your
> > > solution though, since it's been a while since I studied Big(O)
> > > notation (doing the conversion only once is still more expensive
> > > than not converting, but I'm not sure how much more expensive - it
> > > might cost less than sort such that it actually doesn't matter as
> > > for as Big(O) goes though).
> > 
> > Performing the toLower every time the cmp function is called doesn't change
> > the O complexity.
> 
> Yes it does. It adds a loop to each comparison (two loops actually,
> but since they're not nested, Big(O) only cares about the one), since
> toLower has to loop over all of the characters. As sort loops over
> each of the strings to compare them for moving them into a sorted
> position or not, it calls toLower, which adds a nested loop, so it
> increases the Big(O) complexity. Something like this
> 
> foreach(str; strings)
>     str < otherStr;
> 
> becomes
> 
> foreach(str; strings)
> {
>     foreach(dchar c; str)
>         //lower characters
> 
>     foreach(dchar c; otherStr)
>         //lower characters
> 
>     str < otherStr;
> }
> 
> though that's obviously very pseudo-code-ish and not exactly what sort
> does.  Regardless, those extra loops when the comparison happens are
> nested and therefore increase the Big(O) complexity of the overall
> algorithm.
[...]

Actually, I don't think the nested loops affect Big-O complexity at all.
The O(l) complexity (where l = string length) is already inherent in the
string comparison "str < otherStr". Adding more loops over the strings
doesn't change the Big-O complexity (you just make O(l) into 3*O(l)
which is the same as O(l)).

However, always remember that Big-O notation hides constant factors and
terms. These factors are negligible given a sufficiently large problem
size, but for small problem sizes, the hidden constants are very much
significant. An O(n log n) algorithm may actually take n*log(10*n) steps
or 1000*n*log(n) steps; for large enough n, they're approximately the
same, but when n is small, the 1000 has a very significant hit on
observed performance.

That's why optimizing the constant factors is sometimes necessary
(provided you've already minimized the big-O complexity to the full,
since otherwise you're just pinching pennies yet freely spending $100
bills). Inner loop optimization, like strength reduction, loop invariant
hoisting, etc., are examples of where constant factors get optimized. If
you have a loop:

	real x;
	for (i=0; i < n; i++) {
		// bigComplexCalculation is independent of i
		real n = bigComplexCalculation(x);

		// make use of n in some way
	}

Then moving the bigComplexCalculation() call outside the loop improves
the observed performance significantly, even though you're not changing
the big-O complexity: a small change from 10*n to 8*n means a 20%
improvement in observed performance, even though the algorithm still
degrades linearly with problem size just like before.

T

-- 
Microsoft is to operating systems & security ... what McDonalds is to gourmet cooking.