Of possible interest: fast UTF8 validation

Ethan gooberman at gmail.com
Wed May 16 17:28:41 UTC 2018


On Wednesday, 16 May 2018 at 16:54:19 UTC, Walter Bright wrote:
> I used to do things like that a simpler way. 3 functions would 
> be created:
>
>   void FeatureInHardware();
>   void EmulateFeature();
>   void Select();
>   void function() doIt = &Select;
>
> I.e. the first time doIt is called, it calls the Select 
> function which then resets doIt to either FeatureInHardware() 
> or EmulateFeature().
>
> It costs an indirect call, but if you move it up the call 
> hierarchy a bit so it isn't in the hot loops, the indirect 
> function call cost is negligible.
>
> The advantage is there was only one binary.

It certainly sounds reasonable enough for 99% of use cases. But 
I'm definitely the 1% here ;-)

Indirect calls invoke the wrath of the branch predictor on 
XB1/PS4 (ie an AMD Jaguar processor). But there's certainly some 
more interesting non-processor behaviour, at least on MSVC 
compilers. The provided auto-DLL loading in that environment 
performs a call to your DLL-boundary-crossing function, which 
actually winds up in a jump table that performs a jump 
instruction to actually get to your DLL code. I suspect this is 
more costly than the indirect jump at a "write a basic test" 
level. Doing an indirect call as the only action in a for-loop is 
guaranteed to bring out the costly branch predictor on the 
Jaguar. Without getting in and profiling a bunch of stuff, I'm 
not entirely sure which approach I'd prefer for a general 
approach.

Certainly, as far as this particular thread goes, every general 
purpose function of a few lines that I write that use intrinsics 
is forced inline. No function calls, indirect or otherwise. And 
on top of that, the inlined code usually pushes the branches in 
the code out the code across the byte boundary lines just far 
enough that the simple branch predictor is only ever invoked.

(Related: one feature I'd really really really love for linkers 
to implement is the ability to mark up certain functions to only 
ever be linked at a certain byte boundary. And that's purely 
because Jaguar branch prediction often made my profiling tests 
non-deterministic between compiles. A NOP is a legit optimisation 
on those processors.)


More information about the Digitalmars-d mailing list