Computed gotos on Reddit

Tue Jul 24 23:11:05 PDT 2012

On 7/24/2012 10:04 PM, Dmitry Olshansky wrote:
> On 25-Jul-12 03:31, Walter Bright wrote:
>> On 7/24/2012 3:33 PM, Dmitry Olshansky wrote:
>>> There is no code just jump [ecx]. The offset is included already.
>>
>> I'm not seeing where "code" is in the asm code you presented.
>
>> Sorry, I have no idea why it is not the same. jumptable is a static
>> array, and so does not need to be loaded into a register. And code[]
>> needs to be loaded from somewhere, and it looks like it's omitted from
>> your example.
>
> Code was assumed to be in ebx obviously.

It's got to load it somehow.

> BTW Static array for jump table is all
> good and well but does this trick work with PIC code?

The jump table can be in the code segment, which is not possible for a user 
generated array.

> And last but not least - the jump
> target has to be _read_ from jump table
>   and then jumped to it isn't it?

And it has to be read from code[] and jumped to. No difference.

> OK I've taken your comments into account.
> Now I think I finally got it right:
>
> mov ecx, [ebx] ; ecx = code[pc]
> inc ebx ; pc ++
> jmp ecx ; goto code[pc], as ecx is already a pointer

Nope, ecx is an opcode, not a pointer. You need another indirection.

>
> vs
>
> mov ecx, [ebx] ; ecx = code[pc]
> inc ebx ; ; inc pc
> jump jump_table[ecx]; ; switch jump to it
>
> or in english, ommiting PC increment:
> 1.
> read x from array
> jump to it

It's

     pc => opcode => address

not

     pc => address

> 2.
>   read x from array
>   read jump address from 2nd static array at offset x
>   jump to it
>
> So to sum it all up: 1 vs 2 memory accesses, same register use, same (macro)
> instruction count.
> Still I think that not having to touch extra static table is bonus and jump
> jump_table[ecx] could be less efficient on some processors then plain jump ecx
> (need to checks this).

In both cases, you must pull the address out of an array and jump to it.

>>> Now the same thing with switches.
>>> And before you ask -
>>> NO! Single switch won't do as it would have successful branch
>>> prediction rate of
>>> as bad as about 2% (see the article!). The more opcode types the worse
>>> prediction is.
>>
>> I see what you mean, but this could be done with final switch by the
>> compiler, as I explained to bearophile.
>>
> Great but I still try to show that they are less efficient, see above.

No, it is the same code for each. The trouble is you're omitting one of the 
indirections needed for the computed goto case. You must:

    programcounter => opcode => address

2 indirections required. You keep skipping one of them!

>>> //GLOBALS
>>> size_t[] bytecode = firstPassCompile("my_source.ds"); //say it's
>>> DMDscript ;)
>>>
>>> void run(size_t pc)
>>> {
>>>      //here we don't JIT to real addresses beforehand
>>>      //as jump tables somehow should be good enough
>>>      // Okay...
>>>
>>>      //interpret:
>>>      switch(bytecode[pc])
>>>      {
>>>      L_op1:
>>>      case 0:
> [snip]
>>>      L_opN:
>>>      case N-1:
>>>      ...
>>>      //DISPATCH:
>>>      //here comes trouble, Nth table ... time to count overhead
>>>          switch(bytecode[pc])
>>>          {
>>>          case 0:    goto L_op1;
>>>          case 1: goto L_op2;
>>>          ...
>>>          }
>>>      }//end of giant switch
>>> }
>>
>> I see what you mean here, too. Thanks for the explanation. It never
>> occurred to me that one could write code like that, but I see the point,
>> and doing jump table merging could be done fairly easily. No new
>> language feature is required.
>
> Superb! I actually tried the code above, generating common things with a help of
> string mixins, of course currently it only gets slightly slower.
>
> Should I file an enhancement request?

For the jump table merging, yes please.