D 50% slower than C++. What I'm doing wrong?

Sat Apr 14 12:05:36 PDT 2012

I have this simple binary arithmetic coder in C++ by Mahoney and 
translated to D by Maffi. I added "notrow", "final" and "pure"  
and "GC.disable" where it was possible, but that didn't made much 
difference. Adding "const" to the Predictor.p() (as in the C++ 
version) gave 3% higher performance. Here the two versions:

http://mattmahoney.net/dc/  <-- original zip

http://pastebin.com/55x9dT9C  <-- Original C++ version.
http://pastebin.com/TYT7XdwX  <-- Modified D translation.

The problem is that the D version is 50% slower:

test.fpaq0 (16562521 bytes) -> test.bmp (33159254 bytes)

Lang| Comp  | Binary size | Time (lower is better)
C++  (g++)  -      13kb   -  2.42s  (100%)   -O3 -s
D    (DMD)  -     230kb   -  4.46s  (184%)   -O -release -inline
D    (GDC)  -    1322kb   -  3.69s  (152%)   -O3 -frelease -s

The only diference I could see between the C++ and D versions is 
that C++ has hints to the compiler about which functions to 
inline, and I could't find anything similar in D. So I manually 
inlined the encode and decode functions:

http://pastebin.com/N4nuyVMh  - Manual inline

D    (DMD)  -     228kb   -  3.70s  (153%)   -O -release -inline
D    (GDC)  -    1318kb   -  3.50s  (144%)   -O3 -frelease -s

Still, the D version is slower. What makes this speed diference? 
Is there any way to side-step this?

Note that this simple C++ version can be made more than 2 times 
faster with algoritimical and io optimizations, (ab)using 
templates, etc. So I'm not asking for generic speed 
optimizations, but only things that may make the D code "more 
equal" to the C++ code.