Requesting Help with Optimizing Code

Thu Apr 8 03:25:39 UTC 2021

On Thursday, 8 April 2021 at 01:24:23 UTC, Kyle Ingraham wrote:
> The issue I have is with performance.

This belongs in the "Learn" forum, I think.

> With a few optimizations found with profiling I have been able
> to drop processing time from ~10.8 to ~6.2 seconds for a 16
> megapixel test image. That still feels like too long however.
> Image editing programs are usually much faster.

I agree, intuitively that sounds like there is a lot of room for 
further optimization.

> Functions found through profiling to be taking most time:
> * Chromatic adaptation: 
> https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L354
> * RGB to XYZ: 
> https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L142
> * XYZ to RGB: 
> https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L268

This is very high level code, of a sort which depends heavily on 
inlining, loop unrolling and/or auto-vectorization, and a few 
other key optimizations to perform well. So, I really can't tell 
which parts are likely to compile down to good ASM just from 
skimming it.

Personally, I try to stick to `static foreach` and normal runtime 
loops for the inner loops of really CPU-intensive stuff like 
this, so that it's easier to see from the source code what the 
ASM will look like. With inlining and value range propagation 
it's definitely possible to get good performance out of 
range-based or functional-style code, but it's harder because the 
D source code is that much further away from what the CPU will 
actually do.

All that said, I can offer a few tips which may or may not help 
here:

The only concrete floating-point type I see mentioned in those 
functions is `double`, which is almost always extreme overkill 
for color calculations. You should probably be using `float` (32 
bits per channel), instead. Even half-precision (16 bits per 
channel) is more than the human eye can distinguish. The extra 13 
bits of precision offered by `float` should be more than 
sufficient to absorb any rounding errors.

Structure your loops (or functional equivalents) such that the 
"for each pixel" loop is the outermost, or as far out as you can 
get it. Do as much as you can with each pixel while it is still 
in registers or L1 cache. Otherwise, you may end up bottle-necked 
by memory bandwidth.

Make sure you have optimizations enabled, especially cross module 
inlining, -O3, the most recent SIMD instruction set you are 
comfortable requiring (almost everyone in the developed world now 
has Intel SSE4 or ARM Neon), and fused multiply add/subtract.

There will always be three primary colors throughout the entire 
maintenance life of your code, so just go ahead and specialize 
for that. You don't need a generic matrix multiplication 
algorithm, for instance. A specialized 3x3 or 4x4 version could 
be branchless and SIMD accelerated.