Sargon component library now on Dub

Wed Dec 17 03:08:14 PST 2014

On Wednesday, 17 December 2014 at 09:11:22 UTC, Don wrote:
> So am I, the halffloat is much faster than any other 
> implementation I've seen. The fast path for the conversion 
> functions involves only a few machine instructions.
>
> I had an extra speedup for it that made it optimal, but it 
> requires a language primitive to dump excess hidden precision. 
> We still need this, it is a fundamental operation (C tries to 
> do it implicitly using "sequence points", but they don't 
> actually work properly).

The intrinsics _mm_cvtph_ps and _mm_cvtps_ph converts 4 
floats/halffloats with a latency of 4 clock cycles and a 
throughput of 1 per cycle on Haswell.

https://software.intel.com/sites/landingpage/IntrinsicsGuide/