LLVM IR influence on compiler debugging

Thu Jun 28 23:04:36 PDT 2012

This is a very easy to read article about the design of LLVM:
http://www.drdobbs.com/architecture-and-design/the-design-of-llvm/240001128

It explains what the IR is:

>The most important aspect of its design is the LLVM Intermediate 
>Representation (IR), which is the form it uses to represent code 
>in the compiler. LLVM IR [...] is itself defined as a first 
>class language with well-defined semantics.<

>In particular, LLVM IR is both well specified and the only 
>interface to the optimizer. This property means that all you 
>need to know to write a front end for LLVM is what LLVM IR is, 
>how it works, and the invariants it expects. Since LLVM IR has a 
>first-class textual form, it is both possible and reasonable to 
>build a front end that outputs LLVM IR as text, then uses UNIX 
>pipes to send it through the optimizer sequence and code 
>generator of your choice. It might be surprising, but this is 
>actually a pretty novel property to LLVM and one of the major 
>reasons for its success in a broad range of different 
>applications. Even the widely successful and relatively 
>well-architected GCC compiler does not have this property: its 
>GIMPLE mid-level representation is not a self-contained 
>representation.<

That IR has a great effect on making it simpler to debug the 
compiler, I think this is important (and I think it partially 
explains why Clang was created so quickly):

>Compilers are very complicated, and quality is important, 
>therefore testing is critical. For example, after fixing a bug 
>that caused a crash in an optimizer, a regression test should be 
>added to make sure it doesn't happen again. The traditional 
>approach to testing this is to write a .c file (for example) 
>that is run through the compiler, and to have a test harness 
>that verifies that the compiler doesn't crash. This is the 
>approach used by the GCC test suite, for example. The problem 
>with this approach is that the compiler consists of many 
>different subsystems and even many different passes in the 
>optimizer, all of which have the opportunity to change what the 
>input code looks like by the time it gets to the previously 
>buggy code in question. If something changes in the front end or 
>an earlier optimizer, a test case can easily fail to test what 
>it is supposed to be testing. By using the textual form of LLVM 
>IR with the modular optimizer, the LLVM test suite has highly 
>focused regression tests that can load LLVM IR from disk, run it 
>through exactly one optimization pass, and verify the expected 
>behavior. Beyond crashing, a more complicated behavioral test 
>wants to verify that an optimization is actually performed. 
>[...] While this might seem like a really trivial example, this 
>is very difficult to test by writing .c files: front ends often 
>do constant folding as they parse, so it is very difficult and 
>fragile to write code that makes its way downstream to a 
>constant folding optimization pass. Because we can load LLVM IR 
>as text and send it through the specific optimization pass we're 
>interested in, then dump out the result as another text file, it 
>is really straightforward to test exactly what we want, both for 
>regression and feature tests.<

Bye,
bearophile