dmd -run speed trends
Witold Baryluk
witold.baryluk at gmail.com
Sun Dec 17 06:40:33 UTC 2023
On Saturday, 16 December 2023 at 03:31:05 UTC, Walter Bright
wrote:
> On 12/7/2023 12:39 PM, Witold Baryluk wrote:
>> Inspecting output of `dmd -v`, shows that a lot of time is
>> spend on various helpers of `writefln`. Changing `writefln` to
>> `writeln` (and adjusting things so the output is still the
>> same), speeds things a lot:
>>
>> 729.5 ms -> 431.8 ms (dmd 2.106.0)
>> 896.6 ms -> 638.7 (dmd 2.098.1)
>>
>> Considering that most script like programs will need to do
>> some IO, `writefln` looks a little bloated (slow to compile).
>> Having string interpolation would probably help a little, but
>> even then 431 ms is not that great.
>
> It would be illuminating to compare using printf rather than
> writefln.
Hi Walter.
Ok, added `extern(C) int printf(const char *format, ...);`, and
measured various variants.
(no `version` was used, just direct editing of the code for
various variants).
Minimums of about 250 runs (in batches of 60, and various
orders). As before time is `compile time + run time`. Run is
essentially same for all variants, and about 2.5 ms minimum (with
6ms average).
* DMD 2.106.0
* gdc 13.2.0
* ldc2 1.35.0 (DMD v2.105.2, LLVM 16.0.6)
* ld 2.41.50.20231202
* mold 2.3.3 and mold 2.4.0, segfault, when using with dmd or
ldc2.
* gold 1.16 (binutils 2.41.50.20231202)
| variant | min time [ms] | Notes |
|-----------------------------------|--------------:|-------|
| `3×writefln+split` | 723 ms | |
| `3×writeln+split` | 432 ms | (from
previous tests) |
| `2×writefln+1×printf+split` | 722 ms | |
| `1×writefln+2×printf+split` | 715 ms | |
| `0×writefln+3×printf+split` | 396 ms | with unused
`import std.stdio : writefln` |
| `0×writefln+3×printf+split` | 389 ms | without
`import std.stdio` |
| `3×printf` | 278 ms | without
`import std.array` either |
| `3×printf` using gdc | 158 ms | ditto,
`gdmd -run` |
| `3×printf` using ldc2 | 129 ms | ditto,
`ldmd2 -run` |
| `3×printf` + gold | 153 ms | |
| `3×printf` using gdc + gold | 146 ms | |
| `3×printf` using ldc2 + gold | 132 ms | |
| `3×printf` using gdc + mold | 125 ms | |
(I also tried, `-frelease`, `-O0`, etc, no huge difference)
"with unused `import std.stdio : writefln`" - imports `std.stdio`
(and all unconditional transitive imports), but no function from
`std.stdio` is actually compiled, which is good.
"without `import std.stdio`, but still with `std.array`" - for
`split`, doesn't import `std.stdio` transitively, but still
imports a lot of crap, like `core.time`, `std.format.*`,
`core.stdc.config`, and dozen other unnecessary things. Most of
it is not compiled but still, parsing this is not free. As of the
weird stuff being compiled due to `split`, are things like
`core.checkedint.mulu`, `std.algorithm.comparison.max`,
`std.exception.enforce`, `std.utf._utfException!Flag.no`, and few
more.
"without `import std.array` either" - is clean, but compiler
still decides to do few things that are dubious a little, i.e.
`core.internal.array.equality.__equals` (which is the only things
that I cannot find to map to anything in the source code, but
could be something implicit about `[string]bool` associative
array).
Another observation, it looks quite a bit of overhead is due to
work spent in kernel. I see about 50% in user space, and 50% in
kernel space (i.e. reading directories, files, etc).
For fastest run: User: 70.5 ms, System: 46.0 ms
For slowest run: User: 392.5 ms, System: 363.2 ms
`strace`ing the dmd itself, it is not too crazy, but some
sequences can be optimized:
```
stat("/usr/include/dmd/druntime/import/core/internal/array/comparison.di", 0x7ffecfc2a0f0) = -1 ENOENT (No such file or directory)
stat("/usr/include/dmd/druntime/import/core/internal/array/comparison.d", {st_mode=S_IFREG|0644, st_size=7333, ...}) = 0
stat("/usr/include/dmd/druntime/import/core/internal/array/comparison.d", {st_mode=S_IFREG|0644, st_size=7333, ...}) = 0
openat(AT_FDCWD,
"/usr/include/dmd/druntime/import/core/internal/array/comparison.d", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=7333, ...}) = 0
read(3, "/**\n * This module contains comp"..., 7333) = 7333
close(3)
```
Each of these syscalls is about 15μs on my system (when stracing,
probably little less in real run without `strace` overheads)
There should be a way to reduce this in half with smarter
sequencing (i.e. do `open` first instead of `stat + open +
fstat`).
But the `cc` (used by dmd to do final linking), does a lot more
stupid things in this respect, including repeating exactly same
syscall on a file again and again (i.e. like seek to the end). In
total `cc` and childs (but without final executable running),
does about 49729 syscalls (so easily 250-400ms), with command
line and object file produced by dmd. "only" 8400 syscalls, with
command line and object file produced by ldc2.
Using `gdmd -pipe -run`, is minimally faster than without (155.2
ms vs 157.8 ms), but it is close to noise in measurement.
For reference, the last variant:
```d
#!/usr/bin/env -S dmd -run
extern(C) int printf(const char *format, ...);
struct Person {
string name;
int age;
}
auto people = [
Person("John", 45),
Person("Kate", 30),
];
void main() {
// import std.stdio : writefln;
foreach (person; people) {
printf("%.*s is %d years old\n", person.name.length,
person.name.ptr, person.age);
// writefln("%s is %d years old", person.name, person.age);
}
static auto oddNumbers(T)(T[] a) {
return delegate int(int delegate(ref T) dg) {
foreach (x; a) {
if (x % 2 == 0) continue;
auto ret = dg(x);
if (ret != 0) return ret;
}
return 0;
};
}
foreach (odd; oddNumbers([3, 6, 9, 12, 15, 18])) {
printf("%d\n", odd);
// writefln("%d", odd);
}
static auto toLookupTable(string data) {
// import std.array : split;
bool[string] result;
// foreach (w; data.split(';')) {
// result[w] = true;
// }
return result;
}
enum data = "mov;btc;cli;xor;afoo";
enum opcodes = toLookupTable(data);
foreach (o, k; opcodes) {
printf("%.*s\n", o.length, o.ptr);
// writefln("%s", o);
}
}
```
More information about the Digitalmars-d
mailing list