dmd -run speed trends

Sun Dec 17 06:40:33 UTC 2023

On Saturday, 16 December 2023 at 03:31:05 UTC, Walter Bright 
wrote:
> On 12/7/2023 12:39 PM, Witold Baryluk wrote:
>> Inspecting output of `dmd -v`, shows that a lot of time is 
>> spend on various helpers of `writefln`. Changing `writefln` to 
>> `writeln` (and adjusting things so the output is still the 
>> same), speeds things a lot:
>> 
>> 729.5 ms -> 431.8 ms (dmd 2.106.0)
>> 896.6 ms -> 638.7 (dmd 2.098.1)
>> 
>> Considering that most script like programs will need to do 
>> some IO, `writefln` looks a little bloated (slow to compile). 
>> Having string interpolation would probably help a little, but 
>> even then 431 ms is not that great.
>
> It would be illuminating to compare using printf rather than 
> writefln.

Hi Walter.

Ok, added `extern(C) int printf(const char *format, ...);`, and 
measured various variants.

(no `version` was used, just direct editing of the code for 
various variants).

Minimums of about 250 runs (in batches of 60, and various 
orders). As before time is `compile time + run time`. Run is 
essentially same for all variants, and about 2.5 ms minimum (with 
6ms average).

* DMD 2.106.0
* gdc 13.2.0
* ldc2 1.35.0 (DMD v2.105.2, LLVM 16.0.6)
* ld 2.41.50.20231202
* mold 2.3.3 and mold 2.4.0, segfault, when using with dmd or 
ldc2.
* gold 1.16 (binutils 2.41.50.20231202)

| variant                           | min time [ms] | Notes |
|-----------------------------------|--------------:|-------|
| `3×writefln+split`                | 723 ms        | |
| `3×writeln+split`                 | 432 ms        | (from 
previous tests) |
| `2×writefln+1×printf+split`       | 722 ms        | |
| `1×writefln+2×printf+split`       | 715 ms        | |
| `0×writefln+3×printf+split`       | 396 ms        | with unused 
`import std.stdio : writefln` |
| `0×writefln+3×printf+split`       | 389 ms        | without 
`import std.stdio` |
| `3×printf`                        | 278 ms        | without 
`import std.array` either |
| `3×printf` using gdc              | 158 ms        | ditto, 
`gdmd -run` |
| `3×printf` using ldc2             | 129 ms        | ditto, 
`ldmd2 -run` |
| `3×printf` + gold                 | 153 ms        |  |
| `3×printf` using gdc + gold       | 146 ms        |  |
| `3×printf` using ldc2 + gold      | 132 ms        |  |
| `3×printf` using gdc + mold       | 125 ms        |  |

(I also tried, `-frelease`, `-O0`, etc, no huge difference)

"with unused `import std.stdio : writefln`" - imports `std.stdio` 
(and all unconditional transitive imports), but no function from 
`std.stdio` is actually compiled, which is good.

"without `import std.stdio`, but still with `std.array`" - for 
`split`, doesn't import `std.stdio` transitively, but still 
imports a lot of crap, like `core.time`, `std.format.*`, 
`core.stdc.config`, and dozen other unnecessary things. Most of 
it is not compiled but still, parsing this is not free. As of the 
weird stuff being compiled due to `split`, are things like 
`core.checkedint.mulu`, `std.algorithm.comparison.max`, 
`std.exception.enforce`, `std.utf._utfException!Flag.no`, and few 
more.

"without `import std.array` either" - is clean, but compiler 
still decides to do few things that are dubious a little, i.e. 
`core.internal.array.equality.__equals` (which is the only things 
that I cannot find to map to anything in the source code, but 
could be something implicit about `[string]bool` associative 
array).

Another observation, it looks quite a bit of overhead is due to 
work spent in kernel. I see about 50% in user space, and 50% in 
kernel space (i.e. reading directories, files, etc).

For fastest run: User: 70.5 ms, System: 46.0 ms

For slowest run: User: 392.5 ms, System: 363.2 ms

`strace`ing the dmd itself, it is not too crazy, but some 
sequences can be optimized:

```
stat("/usr/include/dmd/druntime/import/core/internal/array/comparison.di", 0x7ffecfc2a0f0) = -1 ENOENT (No such file or directory)
stat("/usr/include/dmd/druntime/import/core/internal/array/comparison.d", {st_mode=S_IFREG|0644, st_size=7333, ...}) = 0
stat("/usr/include/dmd/druntime/import/core/internal/array/comparison.d", {st_mode=S_IFREG|0644, st_size=7333, ...}) = 0
openat(AT_FDCWD, 
"/usr/include/dmd/druntime/import/core/internal/array/comparison.d", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=7333, ...}) = 0
read(3, "/**\n * This module contains comp"..., 7333) = 7333
close(3)
```

Each of these syscalls is about 15μs on my system (when stracing, 
probably little less in real run without `strace` overheads)

There should be a way to reduce this in half with smarter 
sequencing (i.e. do `open` first instead of `stat + open + 
fstat`).

But the `cc` (used by dmd to do final linking), does a lot more 
stupid things in this respect, including repeating exactly same 
syscall on a file again and again (i.e. like seek to the end). In 
total `cc` and childs (but without final executable running), 
does about 49729 syscalls (so easily 250-400ms), with command 
line and object file produced by dmd. "only" 8400 syscalls, with 
command line and object file produced by ldc2.

Using `gdmd -pipe -run`, is minimally faster than without (155.2 
ms vs 157.8 ms), but it is close to noise in measurement.

For reference, the last variant:

```d
#!/usr/bin/env -S dmd -run

extern(C) int printf(const char *format, ...);

struct Person {
   string name;
   int age;
}

auto people = [
   Person("John", 45),
   Person("Kate", 30),
];

void main() {
   // import std.stdio : writefln;
   foreach (person; people) {
     printf("%.*s is %d years old\n", person.name.length, 
person.name.ptr, person.age);
     // writefln("%s is %d years old", person.name, person.age);
   }

   static auto oddNumbers(T)(T[] a) {
     return delegate int(int delegate(ref T) dg) {
       foreach (x; a) {
         if (x % 2 == 0) continue;
         auto ret = dg(x);
         if (ret != 0) return ret;
       }
       return 0;
     };
   }

   foreach (odd; oddNumbers([3, 6, 9, 12, 15, 18])) {
     printf("%d\n", odd);
     // writefln("%d", odd);
   }

   static auto toLookupTable(string data) {
     // import std.array : split;
     bool[string] result;
     // foreach (w; data.split(';')) {
     //   result[w] = true;
     // }
     return result;
   }

   enum data = "mov;btc;cli;xor;afoo";
   enum opcodes = toLookupTable(data);

   foreach (o, k; opcodes) {
     printf("%.*s\n", o.length, o.ptr);
     // writefln("%s", o);
   }
}
```