Found on proggit: Krug, a new experimental programming language, compiler written in D

Tue May 1 18:46:20 UTC 2018

On Tue, May 01, 2018 at 05:13:13PM +0000, IntegratedDimensions via Digitalmars-d wrote:
[...]
> The point of O is for the most dominant rate of growth(asymptotic
> behavior).  In mathematics, one only cares about n as it approaches
> infinity and so any constant term will eventually be dwarfed. So
> technically, in the theory, it is "incorrect" to have any extraneous
> terms.

Well, yes.  Of course the whole idea behind big O is asymptotic
behaviour, i.e., behaviour as n becomes arbitrarily large.
Unfortunately, as you point out below, this is not an accurate depiction
of the real world:

> In CS, it is used to approximate the time for large n to be able to
> compare different algorithms in the "long run". Since computers cannot
> process infinite n, n will be finite and generally relatively
> small(e.g., less than 10^100, which is quite small compared to
> infinity).

Exactly, and therefore the model does not quite match reality, and so
when the scale of n matters in reality, then the conclusions you draw
from big O will be inaccurate at best, outright wrong at worst.

> So the failure you are pointing out is really because the application.
> In some cases the constant term may be applicable and same cases it
> isn't.  Since it depends on n and we cannot use the simplification
> that n is infinity, it matters what n is. This is why it is also
> important to know which algorithms do better for small n because if n
> is small during program use one might be using the wrong algorithm.

Exactly.  Yet this important point is often overlooked / neglected to be
mentioned when big O is taught to CS students.

I'm not saying big O is useless -- it has its uses, but people need to
be aware of its limitations rather than blindly assuming (or worse,
being told by an authority) that it's a magical pill that will solve
everything.

> But once one goes down this road one then has to not count ideal
> cycles but real cycles and include all the other factors involved. Big
> O was not mean to give real world estimates because it is a
> mathematical domain. It may or may not work well depending on how
> poorly it is used, sorta like statistics.  Generally though, they are
> such a great simplification tool that it works for many processes that
> are well behaved.

I agree that big O is a wonderful simplification in many cases.  But
this comes with caveats, and my complaint was that said caveats are more
often than not overlooked or neglected.

To use a concrete example: traditional big O analysis says a hashtable
is fastest, being O(1), especially when the hash function minimizes
collisions. Minimal collisions means short linear search chains when
multiple entries fall into the same bucket, i.e., we stay close to O(1)
instead of the worst case of O(n) (or O(log n), depending on how you
implement your buckets). In certain situations, however, it may actually
be advantageous to *increase* collisions with a locality-sensitive hash
function, because it increases the likelihood that the next few lookups
may already be in cache and therefore doesn't incur the cost of yet
another cache miss and RAM/disk roundtrip.  The buckets are bigger, and
according to big O analysis "slower", because each lookup incurs the
cost of O(n) or O(log n) search within a bucket.  However, in practice
it's faster, because it's expensive to load a bucket into the cache (or
incur disk I/O to read a bucket from disk).  If lookups are clustered
around similar keys and end up in the same few buckets, then once the
buckets are cached any subsequent lookups become very cheap. Large
buckets actually work better because instead of having to incur the cost
of loading k small buckets, you just pay once for one large bucket that
contains many entries that you will soon access in the near future. (And
larger buckets are more likely to contain entries you will need soon.)

Also, doing an O(n) linear search within small-ish buckets may actually
be faster than fancy O(log n) binary trees, due to the CPU's cache
predictor.  A linear scan is easy for the cache predictor to recognize
and load in a series of consecutive cache lines, thus amortizing away
the RAM roundtrip costs, whereas with a fancy binary tree the subsequent
memory access is hard to predict (or may have no predictable pattern),
so the predictor can't help you, and you have to pay for the RAM
roundtrips.  When n gets large, of course, the binary tree will overtake
the performance of the linear search.  But the way big O is taught in CS
courses gives the wrong impression that O(n) linear search is always
inferior and therefore bad and to be avoided.  Students need to be told
that this is not always the case, and that there are times when O(n) is
actually better than O(log n).

> Ideally it would be better to count the exact number of cycles used by
> an algorithm and have them normalized to some standard cycle that
> could be compared across different architectures. Machines can do the
> accounting easily. Even then, many anomalous behavior will generally
> be seen but it would be more accurate, although possibly not much more
> informative, than Big O.
[...]

Sorry, counting cycles does not solve the problem.  That may have worked
back in the days of the 8086, but CPU architectures have moved on a long
way since then.  These days, cache behaviour is arguably far more
important than minimizing cycles, because your cycle-minimized algorithm
will do you no good if every other cycle the instruction pipeline has to
be stalled because of branch hazards, or because of a cache miss that
entails a RAM roundtrip.

Furthermore, due to the large variety of cache structures out there,
it's unrealistic to expect a single generalized cycle-counting model to
work for all CPU architectures. You'd be drowning in nitty gritty
details instead of getting useful results from your analysis.  CPU
instruction pipelines, out-of-order execution, speculative execution,
etc., will complicate the analysis so much that a unified model that
works across all CPUs would pretty much be impossible.

A more promising approach that has been pursued in the recent years is
the cache-oblivious model, where the algorithm is designed to take
advantage of a cache hierarchy, but *without* depending on any specific
one. I.e., it is assumed that linear access of N elements sequentially
across blocks of size B will be faster than N random accessese, but the
algorithm is designed in such a way that it does not depend on specific
values of B and N, and it does not need to be tuned to any specific
value of B and N.  This model has shown a lot of promise in algorithms
that may in theory have "poor" big O behaviour, but in practice operates
measurably faster because they take advantage of the modern CPU cache
hierarchy.

As an added bonus, the cache hierarchy analysis naturally extends to
include secondary storage like disk I/O, and the beauty of cache
oblivious algorithms is that they can automatically take advantage of
this without needing massive redesign, unlike the earlier situation
where you have to know beforehand whether your code is going to be
CPU-bound or disk-bound, and you may have to use completely different
algorithms in either case.  Or you may have to hand-tune the parameters
of your algorithm (such as optimize for specific disk block sizes). A
cache-oblivious can Just Work(tm) without further ado, and without
sudden performance hits.

T

-- 
One reason that few people are aware there are programs running the internet is that they never crash in any significant way: the free software underlying the internet is reliable to the point of invisibility. -- Glyn Moody, from the article "Giving it all away"