Implement the "unum" representation in D ?

Wed Sep 16 01:38:24 PDT 2015

On Wednesday, 16 September 2015 at 08:17:59 UTC, Don wrote:
> On Tuesday, 15 September 2015 at 11:13:59 UTC, Ola Fosheim 
> Grøstad wrote:
>> On Tuesday, 15 September 2015 at 10:38:23 UTC, ponce wrote:
>>> On Tuesday, 15 September 2015 at 09:35:36 UTC, Ola Fosheim 
>>> Grøstad wrote:
>>>> http://sites.ieee.org/scv-cs/files/2013/03/Right-SizingPrecision1.pdf
>>>
>>> That's a pretty convincing case. Who does it :)?
>
> I'm not convinced. I think they are downplaying the hardware 
> difficulties. Slide 34:
>
> Disadvantages of the Unum Format
> * Non-power-of-two alignment. Needs packing and unpacking, 
> garbage collection.
>
> I think that disadvantage is so enormous that it negates most 
> of the advantages. Note that in the x86 world, unaligned memory 
> loads of SSE values still take longer than aligned loads. And 
> that's a trivial case!
>
> The energy savings are achieved by using a primitive form of 
> compression. Sure, you can reduce the memory bandwidth required 
> by compressing the data. You could do that for *any* form of 
> data, not just floating point. But I don't think anyone thinks 
> that's worthwhile.
>

GPU do it a lot. Especially, but not exclusively on mobile. Not 
to reduce the misses (a miss is pretty much guaranteed, you load 
32 thread at once in a shader core, each of them will require at 
least 8 pixel for a bilinear texture with mipmap, that's the bare 
minimum. That means 256 memory access at once. One of these pixel 
WILL miss, and it is going to stall the 32 threads). It is not a 
latency issue, but a bandwidth and energy one.

But yeah, in the general case, random access is preferable, 
memory alignment, and the fact you don't need to do as much 
bookeeping are very significants.

Also, predictable size mean you can split your dataset and 
process it in parallel, which is impossible if sizes are random.

> The energy comparisons are plain dishonest. The power required 
> for accessing from DRAM is the energy consumption of a *cache 
> miss* !! What's the energy consumption of a load from cache? 
> That would show you what the real gains are, and my guess is 
> they are tiny.
>

The energy comparison is bullshit. As long as you haven't loaded 
the data, you don't know how wide they are. Meaning you need 
either to go pessimistic and load for the worst case scenario or 
do 2 round trip to memory.

The author also use a lot the wire vs transistor cost, and how it 
evolved? He is right. Except that you won't cram more wire at 
runtime into the CPU. The CPU need the wiring for the worst case 
scenario, always.

The hardware is likely to be slower as you'll need way more 
wiring than for regular floats, and wire is not only cost, but 
also time.

That being said, even a hit in L1 is very energy hungry. Think 
about it, you need to go a 8 - way fetch (so you'll end up 
loading 4k of data from the cache) in parallel with address 
translation (usually 16 ways) in parallel with snooping into the 
load and the store buffer.

If the load is not aligned, you pretty much have to multiply this 
by 2 if it cross a cache line boundary.

I'm not sure what his number represent, but hitting L1 is quite 
power hungry. He is right on that one.

> So:
> * I don't believe the energy savings are real.
> * There is no guarantee that it would be possible to implement 
> it in hardware without a speed penalty, regardless of how many 
> transistors you throw at it (hardware analogue of Amdahl's Law)
> * but the error bound stuff is cool.

Yup, that's pretty much what I get out of it as well.