Requesting Help with Optimizing Code

Thu Apr 8 03:27:12 UTC 2021

On Thursday, 8 April 2021 at 01:24:23 UTC, Kyle Ingraham wrote:
> Hello all. I have been working on learning computational 
> photography and have been using D to do that. I recently got 
> some code running that performs [chromatic 
> adaptation](https://en.wikipedia.org/wiki/Chromatic_adaptation) 
> (white balancing). The output is still not ideal (image is 
> overexposed) but it does correct color casts. The issue I have 
> is with performance. With a few optimizations found with 
> profiling I have been able to drop processing time from ~10.8 
> to ~6.2 seconds for a 16 megapixel test image. That still feels 
> like too long however. Image editing programs are usually much 
> faster.
>
> The optimizations that I've implemented:
> * Remove `immutable` from constants. The type mismatch between 
> constants (`immutable(double)`) and pixel values (`double`) 
> caused time-consuming checks for compatible types in mir 
> operations and triggered run-time type conversions and memory 
> allocations (sorry if I butchered this description).
> * Use `mir.math.common.pow` in place of `std.math.pow`.
> * Use `@optmath` for linearization functions 
> (https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L192 and https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L318).
>
> Is there anything else I can do to improve performance?
>
> I tested the code under the following conditions:
> * Compiled with `dub build --build=release --compiler=ldmd2`
> * dub v1.23.0, ldc v1.24.0
> * Intel Xeon W-2170B 2.5GHz (4.3GHz turbo)
> * [Test 
> image](https://user-images.githubusercontent.com/25495787/113943277-52054180-97d0-11eb-82be-934cf3d22112.jpg)
> * Test code:
> ```d
> #!/usr/bin/env dub
> /+ dub.sdl:
>     name "photog-test"
>     dependency "photog" version="~>0.1.1-alpha"
>     dependency "jpeg-turbod" version="~>0.2.0"
> +/
>
> import std.datetime.stopwatch : AutoStart, StopWatch;
> import std.file : read, write;
> import std.stdio : writeln, writefln;
>
> import jpeg_turbod;
> import mir.ndslice : reshape, sliced;
>
> import photog.color : chromAdapt, Illuminant, rgb2Xyz;
> import photog.utils : imageMean, toFloating, toUnsigned;
>
> void main()
> {
>     const auto jpegFile = "image-in.jpg";
>     auto jpegInput = cast(ubyte[]) jpegFile.read;
>
>     auto dc = new Decompressor();
>     ubyte[] pixels;
>     int width, height;
>     bool decompressed = dc.decompress(jpegInput, pixels, width, 
> height);
>
>     if (!decompressed)
>     {
>         dc.errorInfo.writeln;
>         return;
>     }
>
>     auto image = pixels.sliced(height, width, 3).toFloating;
>
>     int err;
>     double[] srcIlluminant = image
>         .imageMean
>         .reshape([1, 1, 3], err)
>         .rgb2Xyz
>         .field;
>     assert(err == 0);
>
>     auto sw = StopWatch(AutoStart.no);
>
>     sw.start;
>     auto ca = chromAdapt(image, srcIlluminant, 
> Illuminant.d65).toUnsigned;
>     sw.stop;
>
>     auto timeTaken = sw.peek.split!("seconds", "msecs");
>     writefln("%d.%d seconds", timeTaken.seconds, 
> timeTaken.msecs);
>
>     auto c = new Compressor();
>     ubyte[] jpegOutput;
>     bool compressed = c.compress(ca.field, jpegOutput, width, 
> height, 90);
>
>     if (!compressed)
>     {
>         c.errorInfo.writeln;
>         return;
>     }
>
>     "image-out.jpg".write(jpegOutput);
> }
> ```
>
> Functions found through profiling to be taking most time:
> * Chromatic adaptation: 
> https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L354
> * RGB to XYZ: 
> https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L142
> * XYZ to RGB: 
> https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L268
>
> A profile for the test code is 
> [here](https://github.com/kyleingraham/photog/files/6274974/trace.zip). The trace.log.dot file can be viewed with xdot. The PDF version is [here](https://github.com/kyleingraham/photog/files/6275358/trace.log.pdf). The profile was generated using:
>
> * Compiled with dub build --build=profile --compiler=ldmd2
> * Visualized with profdump - dub run profdump -- -f -d -t 0.1 
> trace.log trace.log.dot
>
> The branch containing the optimized code is here: 
> https://github.com/kyleingraham/photog/tree/up-chromadapt-perf
> The corresponding release is here: 
> https://github.com/kyleingraham/photog/releases/tag/v0.1.1-alpha
>
> If you've gotten this far thank you so much for reading. I hope 
> there's enough information here to ease thinking about 
> optimizations.

I am away from a proper dev machine at the moment so I can't 
really delve into this in much detail, but some thoughts:

Are you making the compiler aware of your machine? Although the 
obvious point here is vector width (you have AVX-512 from what I 
can see, however I'm not sure if this is actually a win or not on 
Skylake W), the compiler is also forced to use a completely 
generic scheduling model which may or may not yield good code for 
your procesor. For LDC, you'll want `-mcpu=native`. Also use 
cross module inlining if you aren't already.

I also notice in your hot code, you are using function contracts: 
When contracts are being checked, LDC actually does use them to 
optimize the code, however in release builds when they are turned 
off this is not the case. If you are happy to assume that they 
are true in release builds (I believe you should at least), you 
can give the compiler this additional information via the use of 
an intrinsic or similar (with LDC I just use inline IR, on GDC 
you have __builtin_unreachable()). LDC *will* assume asserts in 
release builds as far as I have tested, however this still 
invokes a branch (just not to a proper assert handler).

Via `pragma(inline, true)` you can wrap this idea into a little 
"function" like this

```D
pragma(LDC_inline_ir)
     R inlineIR(string s, R, P...)(P);

pragma(inline, true)
void assume()(bool condition)
{
     //flesh this out if you want actually do something in debug 
builds.
     if(!condition)
     {
         inlineIR!("unreachable;", void)();
     }
}
```
So you call this with information you know about your code, and 
the optimizer will assume that the condition is true - in your 
case it looks like the arrays are supposed to be of length 3, so 
if you let the optimizer assume this it can yield a much much 
much better function because of it (Compilers these days will 
usually generate a big unrolled loop using the full vector width 
of the machine, and some little ones that - if you put your 
modular arithmetic hat on - make up the difference for the whole 
array, but in your case you could just want 3 movs maybe). Using 
this method can also yield much more aggressive autovectorization 
for some loops.

Helping the optimizer in the aforementioned manner will often 
*not* yield dramatic performance increases, at least in 
throughput, however they will yield code that is smaller and has 
a simpler control flow graph - these two facts are kind to 
latency and your instruction cache (The I$ is quite big in 2021, 
however processors now utilise a so-called L0 cache for the 
decoded micro-ops which is not very big and can also yield 
counterintuitive results on - admittedly contrived - loop 
examples due to tricks that the CPU does to make typical code 
faster ceasing to apply). I think AMD Zen has some interesting 
loop alignment requirements but I could be misremembering.

As for the usual performance traps you'll need to look at the 
program running in perf, or vTune if you can be bothered (vTune 
is like a rifle to perf's air pistol, but is enormous on your 
disk, it does however understand D code pretty well so I am 
finding it quite interesting to play with).