DMD now incorporates a disassembler

Sun Jan 9 06:04:25 UTC 2022

On Sunday, 9 January 2022 at 02:58:43 UTC, Walter Bright wrote:
>
> I've never seen one. What's the switch for gcc to do the same 
> thing?
>

For GCC/Clang you'd want -S (and then -masm=intel to make the 
output ~~beautiful to nobody but the blind~~ readable). This 
dumps the output to a file, which isn't exactly the same as what 
-vasm does, but I have already begun piping the -vasm output to a 
file since (say) hello world yields a thousand lines of output 
which is much easier to consume in a text editor.

To do it with ldc the flag is `--output-s`. I have opened a PR to 
make the ldc dmd-compatibility wrapper (`ldmd2`) mimic -vasm

Intel (and to a lesser extent Clang) actually annotate the 
generated text with annotations intended to be read by the humans.

e.g.

Intel C++ (which is in the process of being replaced with Clang 
relabeled as Intel C++) prints it's (hopeless unless you are 
using PGO, but still) estimates of the branch probabilities.
```

test      al, al                                        #5.8
je        ..B1.4        # Prob 22%                      #5.8
                         # LOE rbx rbp r12 r13 r14 r15
                         # Execution count [7.80e-01]
```

You can also ask the compiler to generate an optimization report 
inline with the assembly code. This *is* useful when tuning since 
you can tell what the compiler is or isn't getting right (e.g. 
find which roads to force the loop unrolling down). The Intel 
Compiler also has a reputation for having an arsenal of dirty 
tricks to make your code "faster" which it will deploy on the 
hope that you (say) don't notice that your floating point numbers 
are now less precise.

`-qopt-report-phase=vec` yields:
```
         # optimization report
         # LOOP WITH UNSIGNED INDUCTION VARIABLE
         # LOOP WAS VECTORIZED
         # REMAINDER LOOP FOR VECTORIZATION
         # MASKED VECTORIZATION
         # VECTORIZATION HAS UNALIGNED MEMORY REFERENCES
         # VECTORIZATION SPEEDUP COEFFECIENT 3.554688
         # VECTOR TRIP COUNT IS ESTIMATED CONSTANT
         # VECTOR LENGTH 16
         # NORMALIZED VECTORIZATION OVERHEAD 0.687500
         # MAIN VECTOR TYPE: 32-bits integer
vpcmpuq   k1, zmm16, zmm18, 6                           #5.5
vpcmpuq   k0, zmm16, zmm17, 6                           #5.5
vpaddq    zmm18, zmm18, zmm19                           #5.5
vpaddq    zmm17, zmm17, zmm19                           #5.5
kunpckbw  k2, k0, k1                                    #5.5
vmovdqu32 zmm20{k2}{z}, ZMMWORD PTR [rcx+r8*4]          #7.9
vpxord    zmm21{k2}{z}, zmm20, ZMMWORD PTR [rax+r8*4]   #7.9
vmovdqu32 ZMMWORD PTR [rcx+r8*4]{k2}, zmm21             #7.9
add       r8, 16                                        #5.5
cmp       r8, rdx                                       #5.5
jb        ..B1.15       # Prob 82%                      #5.5
```

People don't seem to care about SPEC numbers too much anymore, 
but the Intel Compilers still have many features for gaming 
standard test scores.

http://www.spec.org/cpu2006/results/res2007q3/cpu2006-20070821-01880.html If you looked at this, you'd think that Intel just managed a huge increase on `libquantum` which we can all use on our own code, but it turns out they worked out they can just tell the compiler to automagically parallelize the code, but still only have 1 nominal process.

https://stackoverflow.com/questions/61016358/why-can-gcc-only-do-loop-interchange-optimization-when-the-int-size-is-a-compile for more overfitting.

> Compilers that take a detour through an assembler to generate 
> code are inherently slower.

Certainly, although in my experience not by much. Time spent in 
the assembler in dominated by time spent in the linker, and just 
about everywhere else in the compiler (especially when you turn 
optimizations on). Hello World is about 4ms in the assembler on 
my machine.

GCC and Clang have very different architectures in this regard 
but end up being pretty similar in terms of compile times. The 
linker an exception to that rule of thumb, however, in that the 
LLVM linker is much faster than any current GNU offering.

>> It doesn't have a distinct IR like LLVM does but the final 
>> stage of the RTL is basically a 1:1 representation of the 
>> instruction set:
>
> That looks like intermediate code, not assembler.

It is the (final) intermediate code, but it's barely intermediate 
at this stage i.e. these are effectively just the target 
instructions printed with LISP syntax.

It's, helpfully, quite obfuscated unfortunately: Some of that is 
technical baggage, some of it is due to the way that GCC was 
explicitly directed to be difficult to consume).

I'm __not__ suggesting any normal programmer should use, just 
showing what GCC does since I mentioned LLVM.

Anyway, I've been playing with -vasm and I think it seems pretty 
good so far. There are some formatting issues which shouldn't be 
hard to fix at all (this is why we asked for some basic tests of 
the shape of the output), put I think I've only found one (touch 
wood) situation where it actually gets the instruction *wrong* so 
far.

Testing it has led to me finding some fairly bugs in the dmd 
inline assembler, which I am in the process of filing.