D 50% slower than C++. What I'm doing wrong?

Tue Apr 24 07:02:21 PDT 2012

Am Sat, 14 Apr 2012 21:05:36 +0200
schrieb "ReneSac" <reneduani at yahoo.com.br>:

> I have this simple binary arithmetic coder in C++ by Mahoney and 
> translated to D by Maffi. I added "notrow", "final" and "pure"  
> and "GC.disable" where it was possible, but that didn't made much 
> difference. Adding "const" to the Predictor.p() (as in the C++ 
> version) gave 3% higher performance. Here the two versions:
> 
> http://mattmahoney.net/dc/  <-- original zip
> 
> http://pastebin.com/55x9dT9C  <-- Original C++ version.
> http://pastebin.com/TYT7XdwX  <-- Modified D translation.
> 
> The problem is that the D version is 50% slower:
> 
> test.fpaq0 (16562521 bytes) -> test.bmp (33159254 bytes)
> 
> Lang| Comp  | Binary size | Time (lower is better)
> C++  (g++)  -      13kb   -  2.42s  (100%)   -O3 -s
> D    (DMD)  -     230kb   -  4.46s  (184%)   -O -release -inline
> D    (GDC)  -    1322kb   -  3.69s  (152%)   -O3 -frelease -s
> 
> 
> The only diference I could see between the C++ and D versions is 
> that C++ has hints to the compiler about which functions to 
> inline, and I could't find anything similar in D. So I manually 
> inlined the encode and decode functions:
> 
> http://pastebin.com/N4nuyVMh  - Manual inline
> 
> D    (DMD)  -     228kb   -  3.70s  (153%)   -O -release -inline
> D    (GDC)  -    1318kb   -  3.50s  (144%)   -O3 -frelease -s
> 
> Still, the D version is slower. What makes this speed diference? 
> Is there any way to side-step this?
> 
> Note that this simple C++ version can be made more than 2 times 
> faster with algoritimical and io optimizations, (ab)using 
> templates, etc. So I'm not asking for generic speed 
> optimizations, but only things that may make the D code "more 
> equal" to the C++ code.

I noticed the thread just now. I ported fast paq8 (fp8) to D, and with some careful D-ification and optimization it runs a bit faster than the original C program when compiled with the GCC on Linux x86_64, Core 2 Duo. As others said the files are cached in RAM anyway if there is enough available, so you should not be bound by your hard drive speed anyway.
I don't know about this version of paq you ported the coder from, but I try to give you some hints on what I did to optimize the code.

- time portions of your main()
  is the time actually spent at start up or in the compression?

- use structs, where you don't classes don't make your code cleaner

- where ever you have large arrays that you don't need initialized to .init, write:
  int[<large number>] arr = void;
  double[<large number>] arr = void;
  This disables default initialization, which may help you in inner loops.
  Remember that C++ doesn't default initialize at all, so this is an obvious
  way to lose performance against that language.
  Also keep in mind that the .init for floating point types is NaN:
  struct Foo { double[999999] bar; }
  Is not a block of binary zeroes and hence cannot be stored in a .bss
  section in the executable, where it would not take any space at all.
  struct Foo { double[999999] bar = void; }
  On the contrary will not bloat your executable by 7,6 MB!
  Be cautious with:
  class Foo { double[999999] bar = void; }
  Classes' .init don't go into .bss either way. Another reason to use a
  struct where appropriate. (WARNING: Usage of .bss on Linux/MacOS is currently
  broken in the compiler front-end. You'll only see the effect on Windows)

- Mahoney used an Array class in my version of paq, which allocates via calloc.
  Do this as well. You can't win otherwise. Read up a bit on calloc if you want.
  It generally 'allocates' a special zeroed out memory page multiple times.
  No matter how much memory you ask for, it wont really allocate anything until
  you *write* to it, at which point new memory is allocated for you and the
  zero-page is copied into it.
  The D GC on the other hand allocates that memory and writes zeroes to it 
  immediately.
  The effect is two fold: First, the calloc version will use much less RAM, if
  the 'allocated' buffers aren't fully used (e.g. you compressed a small file).
  Second, the D GC version is slowed down by writing zeroes to all that memory.
  At high compression levels, paq8 uses ~2 GB of memory that is calloc'ed. You
  should _not_ try to use GC memory for that.

- If there are data structures that are designed to fit into a CPU cache-line
  (I had one of those in paq8), make sure it still has the correct size in your
  D version. "static assert(Foo.sizeof == 64);" helped me find a bug there that
  resulted from switching from C bitfields to the D version (which is a library
  solution in Phobos).

I hope that gives you some ideas what to look for. Good luck!

-- 
Marco