Misc questions:- licensing, VC++ IDE compatible, GPGPU, LTCG,

bearophile bearophileHUGS at lycos.com
Sun May 16 13:12:50 PDT 2010


Walter Bright:

Thank you for your answers and explanations.

> The back end can be improved for floating point, but for integer/pointer work it 
> is excellent.

Only practical experiments can tell if you are right.
(a Raytracer uses lot of floating point ops. My small raytracers compiled with dmd are some times slower than the same compiled with ldc).


> It does not do link time code generation nor profile guided optimization, 
> although in my experiments such features pay off only in a small minority of 
> cases.

I agree that profile guided optimization on GCC usually pays little, so I usually I don't use it with GCC. My theory is that it is not using the profiling information well enough yet. Reading the asm output of the Java HotSpot (and this is not easy to do) has shown me that HotSpot performs some things that GCC isn't doing yet, that in numerical programs give a good performance increase. Here I have shown one of the simpler and most effective optimizations done by HotSpot thanks to the profile information it continuously collects:
http://llvm.org/bugs/show_bug.cgi?id=5540

Link time optimization, as done by LDC has given a good speedup in several of my programs, I like it enough. It allows to apply all other compiler optimizations more effectively. It's able to decrease the program size too.


> In my experiments on vector array operations, the improvement from the 
> CPU's vector instructions is disappointing. It always seems to get hopelessly 
> bottlenecked by memory bandwidth.

dmd array operations are currently not so useful.
But there are several ways to vectorize code that can give very large (up to 10-16 times) speedups on numerical code.

This is one of the kinds of vectorization:
http://gcc.gnu.org/wiki/Graphite
http://wiki.llvm.org/Polyhedral_optimization_framework

Another kind of vectorization is performing up to three levels of tiling (when the implemented algorithm is not cache oblivious).

Another kind of vectorization is the usage of all the fields of a SSE (and future AVC) registers in parallel. Doing this well seems very hard for compilers (llvm is not able to do it, gcc does it a bit in some situations, and I don't know what the intel compiler does here, I think the intel compiler performs it only if the given C code is written in a specific way that you often have to find by time-consuming trial and error), I don't know why. So this optimization is often done manually, writing asm by hand... if you look at the asm written in video decoders you can see that it's many times faster than the asm produced from C by the best compilers.

Then there are true parallel optimizations, that means using more than one CPU core to perform operations, examples of this are Cilk, parallel fors done in a simple way or in a more refined way as in Chapel language, and then there are the various message passing implementations, etc.

If you have heavy numerical code and you combine all those things you can often get code 40 times faster or more. To perform all such optimizations you need smart compilers and/or a language that gives lot of semantics to the back-end (as Cilk, Chapel, Fortress).

Bye,
bearophile


More information about the Digitalmars-d mailing list