Of possible interest: fast UTF8 validation
Ethan
gooberman at gmail.com
Wed May 16 17:28:41 UTC 2018
On Wednesday, 16 May 2018 at 16:54:19 UTC, Walter Bright wrote:
> I used to do things like that a simpler way. 3 functions would
> be created:
>
> void FeatureInHardware();
> void EmulateFeature();
> void Select();
> void function() doIt = &Select;
>
> I.e. the first time doIt is called, it calls the Select
> function which then resets doIt to either FeatureInHardware()
> or EmulateFeature().
>
> It costs an indirect call, but if you move it up the call
> hierarchy a bit so it isn't in the hot loops, the indirect
> function call cost is negligible.
>
> The advantage is there was only one binary.
It certainly sounds reasonable enough for 99% of use cases. But
I'm definitely the 1% here ;-)
Indirect calls invoke the wrath of the branch predictor on
XB1/PS4 (ie an AMD Jaguar processor). But there's certainly some
more interesting non-processor behaviour, at least on MSVC
compilers. The provided auto-DLL loading in that environment
performs a call to your DLL-boundary-crossing function, which
actually winds up in a jump table that performs a jump
instruction to actually get to your DLL code. I suspect this is
more costly than the indirect jump at a "write a basic test"
level. Doing an indirect call as the only action in a for-loop is
guaranteed to bring out the costly branch predictor on the
Jaguar. Without getting in and profiling a bunch of stuff, I'm
not entirely sure which approach I'd prefer for a general
approach.
Certainly, as far as this particular thread goes, every general
purpose function of a few lines that I write that use intrinsics
is forced inline. No function calls, indirect or otherwise. And
on top of that, the inlined code usually pushes the branches in
the code out the code across the byte boundary lines just far
enough that the simple branch predictor is only ever invoked.
(Related: one feature I'd really really really love for linkers
to implement is the ability to mark up certain functions to only
ever be linked at a certain byte boundary. And that's purely
because Jaguar branch prediction often made my profiling tests
non-deterministic between compiles. A NOP is a legit optimisation
on those processors.)
More information about the Digitalmars-d
mailing list