Numerical code

Thu Apr 29 08:33:09 PDT 2010

I have recently written some "numerical code", some floating point scientific code I don't usually need. I have seen that this kind of code often contains kernels, that is small chunks of code inside the inner loops, that good compilers translate in even just 8-15 asm instructions.

There are situations where even just two or three asm instructions added or removed from such kernels can determine the difference from a slow code and an acceptably fast program. In other programs the kernel is a little longer but good register spills become essential (FFT). So the programmer, the language, and the optimization stages of the compiler must do all they can to pull away as much as possible from such inner loops. Even a moderate form of supercompilation can be appealing. Then, what's inside the other parts of the (smallish) program is often not important. Some times it can even be written in Python with only a moderate difference in the total running time. (This is why they have created a Python package named CorePy, that is a kind of dynamic assembler that gives a good help in the creation of computational kernels written in good modern asm).

[Such kernels are often inside other outer loops. Usually such code is at least O(n ln n) and often O(n^2). Today I see more and more algorithms almost O(n), used online and in real time on huge stream of real time data, used for example often by Google. Such algorithms can have a kernel, but on them plain I/O effects play an important role.]

Beside the contents of such inner loops, there are two more things the programmer & compiler have to take into account (and they will need to take them even more in account in the close future), that is the effects of the secondary CPU caches and the localization of data in the memory of the single CPUs (Both such problems are taken in account quite well by the Chapel language. Their designers have predicted well the future hardware).

1) The cache problems are not hard to explain: the kernels of the inner loops must use CPU registers very well (seeing them like yet another tiny level of cache) and the L1 CPU cache. (One way to do this is to use some form of memory tiling, that is the algorithm designer has to find a way to process a limited chunk of the input data, and then step to the next chunk. Often this designing needs to be done even when there are cache-oblivious algorithms). But around them there are one or two or more outer loops, so if the programmer wants an efficient program then such outer loops must take into account the L2 and L3 caches too. So the outer loops too need some form of "higher level" tiling. In practice often numerical programs need to be organized in a pyramidal way (recently invented image processing algorithms are often designed this way). Creating such code that has two or three levels of tiling is possible, but it requires time and it's error-prone (this is why cache-oblivious algorithms are more appreciated, but they often aren't enough or are not available). So here you want some help from the programming language, to help in the partially automatic definition of such hierarchically tiled code. Chapel language helps in this.

2) I have read that when CPUs have more than about 16 cores (and AMD has announced a 12 core CPU), the current structure of a single shared RAM doesn't seem to work well any more. What's needed seems to be Core-local memory, plus communications across cores (and no shared RAM). In this situation the code must be quite local. The thread-local default of D is a step in the right direction, but more abstractions can become necessary. Chapel already gives means to write this kind of code with its Distributions:
http://chapel.cray.com/tutorials/DC2010/DC08-DISTRIBUTIONS.pdf

I have found an update of the Chapel tutorial:
http://chapel.cray.com/tutorials/DC2010/DC-tutorial.tar.gz

(In a future post I'd like to describe better the Chapel Domains, because they seem an abstraction quite easy to understand and useful for more than just numerical code.)

Bye,
bearophile