Vectorized Laziness

Tue Nov 10 15:01:33 PST 2009

I've seen this good Google Talk video from Jan 2008, "MonetDB/X100: a (very) fast column-store", about a more efficient column-oriented DBMS:
http://www.youtube.com/watch?v=yrLd-3lnZ58

The efficiency of this DBMS (something like 50 times faster) is produced by few things:
- It's column-wise, this helps in several other successive optimizations too (row-oriented DBMS have their purpose still, the column-oriented are good only if you want to perform statistics on most of your data, certain kinds of data mining, etc).
- At 13.57 it shows that instead of yielding single tuples (or single items, it's column-oriented), it yields arrays of about 100 fields. This allows to create primitives that are much more efficient. And the CPU can process them better, using SSE instruction too. Such small arrays are designed to fit in the CPU cache (probably L2). Such vectorized operations are also pipelined in some way.
- The filtering operations often don't produce new vectors, they just mark the items as not present any more inside an array. This helps reducing the number of allocatations of new arrays.
- On disk data is kept compressed, the compression is column-wise, and decompression is done only just-in-time to reduce the amount of data transfers. The number compression takes in account that often data is sorted, so it's delta-compressed, and then such delta is compressed only in a range of the Gaussian-like residual distribution (outliers are encoded directly). This compression also allows to keep large indexes in RAM, that speeds up things more. Such compression is I/O bound, the CPU performs it at max speed.
- They even shown vectorized hashing, but I have not understood how they work.
- Reading from the disks is done in a merged way, to avoid reading the same data many times for similar queries.

(The good thing is that it's not hard to understand most things shown in this video. But I'd like to see the C code they use as "reference", that's just 3 times faster than their DBMS. Such C code may be improved).

Inside, DBMS work as the lazy operations that are getting more common in D2 code, that are common in Python3, and even more in Haskell.

So in D2 to reduce the overhead of the lazy operations it may be useful to use a vectorized strategy (that's meant to be almost transparent for the D programmer), so items can be produced and moved in groups, inside arrays, like in that video. (The DBMS has an interpreter overhead, it's made of fast primitives used by lazy interpreted code). This idea can be applied in other languages too.

Lazy filtering is a bit less common in D2 code, so I don't know if the idea of not creating new blocks (and just marking items as absent, for example using a bit array attached to the items array) can be useful in D2 too, as in that X100 DBMS.

In this discussion you have to think about pieces of code like:
xfilter(xchain(xmap(_, xrange(_)), xfilter(xmap(_))))
or things even more complex, and not something simpler, that LDC may be able to just fully inline.

(Time ago I have posted related comments on the Python newsgroup, but it was ignored.)

(Partially unrelated note: years after starting to program in D, I still miss a lot strict/lazy array comprehensions in D. Maybe Walter will eventually see how handy they are.)

Bye,
bearophile