Requesting Help with Optimizing Code
tsbockman
thomas.bockman at gmail.com
Thu Apr 8 03:25:39 UTC 2021
On Thursday, 8 April 2021 at 01:24:23 UTC, Kyle Ingraham wrote:
> The issue I have is with performance.
This belongs in the "Learn" forum, I think.
> With a few optimizations found with profiling I have been able
> to drop processing time from ~10.8 to ~6.2 seconds for a 16
> megapixel test image. That still feels like too long however.
> Image editing programs are usually much faster.
I agree, intuitively that sounds like there is a lot of room for
further optimization.
> Functions found through profiling to be taking most time:
> * Chromatic adaptation:
> https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L354
> * RGB to XYZ:
> https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L142
> * XYZ to RGB:
> https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L268
This is very high level code, of a sort which depends heavily on
inlining, loop unrolling and/or auto-vectorization, and a few
other key optimizations to perform well. So, I really can't tell
which parts are likely to compile down to good ASM just from
skimming it.
Personally, I try to stick to `static foreach` and normal runtime
loops for the inner loops of really CPU-intensive stuff like
this, so that it's easier to see from the source code what the
ASM will look like. With inlining and value range propagation
it's definitely possible to get good performance out of
range-based or functional-style code, but it's harder because the
D source code is that much further away from what the CPU will
actually do.
All that said, I can offer a few tips which may or may not help
here:
The only concrete floating-point type I see mentioned in those
functions is `double`, which is almost always extreme overkill
for color calculations. You should probably be using `float` (32
bits per channel), instead. Even half-precision (16 bits per
channel) is more than the human eye can distinguish. The extra 13
bits of precision offered by `float` should be more than
sufficient to absorb any rounding errors.
Structure your loops (or functional equivalents) such that the
"for each pixel" loop is the outermost, or as far out as you can
get it. Do as much as you can with each pixel while it is still
in registers or L1 cache. Otherwise, you may end up bottle-necked
by memory bandwidth.
Make sure you have optimizations enabled, especially cross module
inlining, -O3, the most recent SIMD instruction set you are
comfortable requiring (almost everyone in the developed world now
has Intel SSE4 or ARM Neon), and fused multiply add/subtract.
There will always be three primary colors throughout the entire
maintenance life of your code, so just go ahead and specialize
for that. You don't need a generic matrix multiplication
algorithm, for instance. A specialized 3x3 or 4x4 version could
be branchless and SIMD accelerated.
More information about the Digitalmars-d
mailing list