Standard way to supply hints to branches
Richard (Rikki) Andrew Cattermole
richard at cattermole.co.nz
Fri Sep 13 20:59:01 UTC 2024
On 14/09/2024 8:50 AM, claptrap wrote:
> On Friday, 13 September 2024 at 18:53:11 UTC, Walter Bright wrote:
>> On 9/13/2024 4:56 AM, Timon Gehr wrote:
>>> Well, it is the attribute being associated with the program path
>>> being ill-defined that is being criticized in that blog post. The
>>> difference is that for path-associated, you are saying that a
>>> specific statement is likely or unlikely to be executed, for
>>> branch-associated, you are saying in which direction a specific
>>> branch is likely to go.
>>
>> Ok, thanks for the explanation. The branch predictor on CPUs defaults
>> to a forward branch being considered unlikely, and a backwards branch
>> being considered likely.
>
> That was pretty much only the Pentiums, older AMDs just assumed branch
> not taken if wasn't in the BTB already. Newer CPUs, Core2 onwards, Zen,
> nobody seems to know for sure what they do, but the Intel SDMs do state
> that the Core architecture doesn't use static prediction. I think Agner
> Fog says it's essentially random.
https://www.agner.org/optimize/microarchitecture.pdf
Not quite random, but certainly has changed to a significantly more
complicated design since the 90's.
"
3.8 Branch prediction in Intel Haswell, Broadwell, Skylake, and other Lakes
The branch predictor appears to have been redesigned in the Haswell and
later Intel processors, but the design is undocumented.
Reverse engineering has revealed that the branch prediction is using
several tables of local and global histories of taken branches
[Yavarzadeh, 2023].
The measured throughput for jumps and branches varies between one branch
per clock cycle and one branch per two clock cycles for jumps and
predicted taken branches.
Predicted not taken branches have an even higher throughput of up to two
branches per clock cycle.
The high throughput for taken branches of one per clock was observed for
up to 128 branches with no more than one branch per 16 bytes of code.
The throughput is reduced to one jump per two clock cycles if there is
more than one branch instruction per 16 bytes of code. If there are more
than 128 branches in the critical part of the code, and if they are
spaced by at least 16 bytes, then apparently the first 128 branches
have the high throughput and the remaining have the low throughput.
These observations may indicate that there are two branch prediction
methods: a fast method tied to the µop cache and the instruction cache,
and a slower method using a branch target buffer.
"
More information about the Digitalmars-d
mailing list