Implement the "unum" representation in D ?

Wed Sep 16 13:06:42 PDT 2015

On Wednesday, 16 September 2015 at 19:40:49 UTC, Ola Fosheim 
Grøstad wrote:
> You can load continuously 64 bytes in a stream, decode to your 
> internal format and push them into the scratchpad of other 
> cores. You could even do this in hardware.
>

1/ If you load the worst case scenario, then your power advantage 
is gone.
2/ If you load these one by one, how do you expect to feed 256+ 
cores ?

Obviously you can make this in hardware. And obviously this is 
not going to be able to feed 256+ cores. Even with a chip at low 
frequency, let's say 800MHz or so, you have about 80 cycles to 
access memory. That mean you need to have 20 000+ cycles of work 
to do per core per unum.

That simple back of the envelope calculation. Your proposal is 
simply ludicrous. It's a complete non starter.

You can make this in hardware. Sure you can, no problem. But you 
won't because it is a stupid idea.

>> To gives you a similar example, x86 decoding is often the 
>> bottleneck on an x86 CPU. The number of ALUs in x86 over the 
>> past decade decreased rather than increased, because you 
>> simply can't decode fast enough to feed them. Yet, x86 CPUs 
>> have a 64 ways speculative decoding as a first stage.
>
> That's because we use a dumb compiler that does not prefetch 
> intelligently.

You know, when you have no idea what you are talking about, you 
can just move on to something you understand.

Prefetching would not change anything here. The problem come from 
variable size encoding, and the challenge it causes for hardware. 
You can have 100% L1 hit and still have the same problem.

No sufficiently smart compiler can fix that.

> If you are writing for a tile based VLIW CPU you preload. These 
> calculations are highly iterative so I'd rather think of it as 
> a co-processor solving a single equation repeatedly than 
> running the whole program. You can run the larger program on a 
> regular CPU or a few cores.
>

That's irrelevant. The problem is not the kind of CPU, it is how 
do you feed it at a fast enough rate.

>> The problem is not transistor it is wire. Because the damn 
>> thing is variadic in every ways, pretty much every bit as 
>> input can end up anywhere in the functional unit. That is a 
>> LOT of wire.
>
> I haven't seen a design, so I cannot comment. But keep in mind 
> that the CPU does not have to work with the format, it can use 
> a different format internally.
>
> We'll probably see FPGA implementations that can be run on FPGU 
> cards for PCs within a few years. I read somewhere that a group 
> in Singapore was working on it.

That's hardware 101.

When you have a floating point unit, you get your 32 bits you get 
23 bits that go into the mantissa FU and 8 in the exponent FU. 
For instance, if you multiply floats, you send the 2 exponent 
into a adder, you send the 2 mantissa into a 24bits multiplier 
(you add a leading 1), you xor the bit signs.

You get the carry from the adder, and emit a multiply, or you 
count the leading 0 of the 48bit multiply result, shift by that 
amount and add the shit to the exponent.

If you get a carry in the exponent adder, you saturate and emit 
an inifinity.

Each bit goes into a given functional unit. That mean you need on 
wire from the input to the functional unit is goes to. Sale for 
these result.

Now, if the format is variadic, you need to wire all bits to all 
functional units, because they can potentially end up there. 
That's a lot of wire, in fact the number of wire is growing 
quadratically with that joke.

The author keep repeating that wire became the expensive thing 
and he is right. Meaning a solution with quadratic wiring is not 
going to cut it.