Implement the "unum" representation in D ?

Wed Sep 16 12:40:47 PDT 2015

On Wednesday, 16 September 2015 at 19:21:59 UTC, deadalnix wrote:
> No you don't. Because the streamer still need to load the unum 
> one by one. Maybe 2 by 2 with a fair amount of hardware 
> speculation (which means you are already trading energy for 
> performances, so the energy argument is weak). There is no way 
> you can feed 256+ cores that way.

You can load continuously 64 bytes in a stream, decode to your 
internal format and push them into the scratchpad of other cores. 
You could even do this in hardware.

If you look at the ubox brute forcing method you compute many 
calculations over the same data, because you solve spatially, not 
by timesteps. So you can run many many parallell computations 
over the same data.

> To gives you a similar example, x86 decoding is often the 
> bottleneck on an x86 CPU. The number of ALUs in x86 over the 
> past decade decreased rather than increased, because you simply 
> can't decode fast enough to feed them. Yet, x86 CPUs have a 64 
> ways speculative decoding as a first stage.

That's because we use a dumb compiler that does not prefetch 
intelligently. If you are writing for a tile based VLIW CPU you 
preload. These calculations are highly iterative so I'd rather 
think of it as a co-processor solving a single equation 
repeatedly than running the whole program. You can run the larger 
program on a regular CPU or a few cores.

> The problem is not transistor it is wire. Because the damn 
> thing is variadic in every ways, pretty much every bit as input 
> can end up anywhere in the functional unit. That is a LOT of 
> wire.

I haven't seen a design, so I cannot comment. But keep in mind 
that the CPU does not have to work with the format, it can use a 
different format internally.

We'll probably see FPGA implementations that can be run on FPGU 
cards for PCs within a few years. I read somewhere that a group 
in Singapore was working on it.