Standard way to supply hints to branches

Fri Sep 13 20:59:01 UTC 2024

On 14/09/2024 8:50 AM, claptrap wrote:
> On Friday, 13 September 2024 at 18:53:11 UTC, Walter Bright wrote:
>> On 9/13/2024 4:56 AM, Timon Gehr wrote:
>>> Well, it is the attribute being associated with the program path 
>>> being ill-defined that is being criticized in that blog post. The 
>>> difference is that for path-associated, you are saying that a 
>>> specific statement is likely or unlikely to be executed, for 
>>> branch-associated, you are saying in which direction a specific 
>>> branch is likely to go.
>>
>> Ok, thanks for the explanation. The branch predictor on CPUs defaults 
>> to a forward branch being considered unlikely, and a backwards branch 
>> being considered likely.
> 
> That was pretty much only the Pentiums, older AMDs just assumed branch 
> not taken if wasn't in the BTB already. Newer CPUs, Core2 onwards, Zen, 
> nobody seems to know for sure what they do, but the Intel SDMs do state 
> that the Core architecture doesn't use static prediction. I think Agner 
> Fog says it's essentially random.

https://www.agner.org/optimize/microarchitecture.pdf

Not quite random, but certainly has changed to a significantly more 
complicated design since the 90's.

"
3.8 Branch prediction in Intel Haswell, Broadwell, Skylake, and other Lakes
The branch predictor appears to have been redesigned in the Haswell and 
later Intel processors, but the design is undocumented.
Reverse engineering has revealed that the branch prediction is using 
several tables of local and global histories of taken branches 
[Yavarzadeh, 2023].
The measured throughput for jumps and branches varies between one branch 
per clock cycle and one branch per two clock cycles for jumps and 
predicted taken branches.
Predicted not taken branches have an even higher throughput of up to two 
  branches per clock cycle.
The high throughput for taken branches of one per clock was observed for 
up to 128 branches with no more than one branch per 16 bytes of code.
The throughput is reduced to one jump per two clock cycles if there is 
more than one branch instruction per 16 bytes of code. If there are more 
than 128 branches in the critical part of the code, and if they are 
spaced by at least 16 bytes, then apparently the first 128 branches
have the high throughput and the remaining have the low throughput.
These observations may indicate that there are two branch prediction 
methods: a fast method tied to the µop cache and the instruction cache, 
and a slower method using a branch target buffer.
"