[Dlang-internal] I need cent and ucent to be implemented soon

Elmar chrehme at gmx.de
Sun Apr 18 23:31:12 UTC 2021


On Wednesday, 18 September 2019 at 18:42:32 UTC, Joseph Rushton 
Wakeling wrote:
> On Saturday, 4 May 2019 at 19:08:10 UTC, Murilo wrote:
>> BigInt takes too long. I need something as fast as the 
>> primitive types.
>
> What hardware are you planning on running your programs on? I'm 
> not sure how good a speed you can guarantee without native 
> hardware support.

Hello together.
You can also run D code with 64-bit integers on 32-Bit 
architectures, so where should be problems to have 128-bit on 
32-Bit architectures or particularly 64-bit architecture?

In 32-Bit ARM you would do 128-Bit arithmetics like this:

```
c = a + b  // r0..r3 = a, r4..r7 = b, r8..r11 = c

ADD r8, r0, r4
ADC r9, r1, r5
ADC r10, r2, r6
ADC r11, r3, r7

c = a - b

SUB r8, r0, r4
SBC r9, r1, r5
SBC r10, r2, r6
SBC r11, r3, r7
//this would work for any multiple of 32-Bit arithmetics and even 
for any
//multiple of 8-Bit arithmetics since ARM has zero/sign extend 
instructions for 8-bit

//negation, logical operations, shifts and rotations are 
similarly easy to implement for 128-bit

c = a * b   // result depends on the type of a, b, c actually but 
assume ucent

UMULL r9, r8, r0, r4  @ c[0..64] = a[0..32] * b[0..32]
UMULL r11, r10, r0, r6  @ c[64..128] = a[0..32] * b[64..96]
UMLAL r11, r10, r1, r5  @ c[64..128] += a[32..64] * b[32..64]
UMLAL r11, r10, r2, r4  @ c[64..128] += a[64..96] * b[0..32]
MLA r11, r0, r7 @ c[96..128] += a[0..32] * b[96..128]
MLA r11, r1, r6 @ c[96..128] += a[32..64] * b[64..96]
MLA r11, r2, r5 @ c[96..128] += a[64..96] * b[32..64]
MLA r11, r0, r7 @ c[96..128] += a[96..128] * b[0..32]
UMLAL r10, r9, r0, r5  // c[32..96] += a[0..32] * b[32..64]
ADC r11, r11, #0  // consider overflow
UMLAL r10, r9, r1, r4  // c[32..96] += a[32..64] * b[0..32]
ADC r11, r11, #0  // consider overflow
```

You maybe can even reduce the multiplication code by one 
instruction with a smarter solution. Well, I almost suggested 
using godbolt.org to check, what Assembly code is generated by C# 
when using 128-Bit but I see that C# is not supported there (but 
D is :-) ).

Only the division is slightly more complicated. You'd probably 
inverse the divisor and multiply it to the divident `c = a * 
(1/b)` (and only calculating the upper 128-bit of the 256-bit 
product).

If b is 32-Bit then

```
2^n * 1/b = floor(0xFF..FF/b) + (0xFF..FF % b +1) * 1/b
<=>  //using x = floor(0xFF..FF/b), y = 0xFF..FF % b + 1
q = a/b = a/2^n * (x + y/2^n * (x + y/2^n * ( ... )))

//for n = 32 (step size) calculate UQ128 q as follows
q = x << 96
q += y * x << 64
q += y² * x << 32
q += y³ * x + (y⁴ * x >> 32) + (y⁵ * x >> 64) ... //until nothing 
changes
q = (q * a)[128..256]  //getting the upper 128-bit of 256-bit 
result

1/b = x * (1 + y/2^n + (y/2^n)² + (y/2^n)³ + ...)
   = x * ((y/2^n)^-m - 1) / ((y/2^n)^-1 - 1)

//q = r0..r3
UDIV(x, 0xFF..FF, divisor)  //x = floor(0xFF..FF / b)
MLS(y, x, divisor, x)  //= y = x - x * divisor = 0xFF..FF % b
ADD(y, y, 1)
...  //multiplications and additions
```

This algorithm is simple but has Worst-case execution time O(n) 
where n is the bit length which are a lot of multiplications. The 
result of 1/b is not perfectly accurate since it divides 0.FF..FF 
as divident and not 1.0 and the bigger y, the slower is this 
algorithm. But as I look at the worst case, I notice optimization 
potential:

```
q = x * ((1 << 32) + y) * ((1 << 64) + (y *= y) )  // 32-bit x 
32-bit x 64-bit
q += (q * (y *= y) ) >> 128  // upper 128-bit of 128-bit x 128-bit
q += (q * (y *= y) ) >> 256  // upper 128-bit of 128-bit x 256-bit
...  //another 3 times in the worst case
// the algorithm can stop in between, if y >> 2^n has become = 0
q = (q * a)[128..256]
```

I basically (re)found the Goldschmidt Division. The individual 
multiplications of x and y can be optimized because register 
parts of x and y can be entirely zero so that multiplications 
need only 32x32 bit multiplication in the best case.

For divisors of more than 32-bit, one can try to enlarge the rest 
of the division.
```
b'   // highest non-zero word in the 128-bit integer b
x = floor(0xFF..FF / b') << ((3 - n)*32)  // n is the 
Least-significant-first index of word b' in b
y = 0xFF..FF % b' + 1  // y now can be up to 127 bit large in the 
first step!

q = x   // x is ucent, three words of it are zero
q += (q * y) >> 128  // upper 128 bit of 128x128 multiplication
q += (q * (y *= y)) >> 256  // upper 128 bit of 128x256 
multiplication
...
```
The only difference is that the first value of y can become quite 
large already and x is already 128-bit where 3 words of it are 
zero.

A division of 128-bit by a constant is very easy, just a constant 
multiplication of the precalculated inverse and taking the upper 
128-bits of the result.


The only reason I see for not implementing it is low priority and 
calculation with reals is sufficient in most cases and faster (at 
least the register pressure will go down significantly). The only 
disadvantage of reals is the limited precision but except for 
extremely high precision applications and cryptographic-related 
things, I don't know a need for 128-bit (at least there is only 
few performance gain of using 128-bit for parallel operations). 
Oftentimes you can replace 128-bit arithmetics in cryptographic 
and multimedia with SIMD instructions which are used for parallel 
arithmetics in processors.

We actually need to make us aware of what 128-bit actually means! 
You can store any timestamp you likely ever would need already in 
a 80-Bit integer (about the order of Femtoseconds in a year if I 
remember right, so just 90-Bit would give you Femtoseconds in one 
millenium). 128-bit numbers can store integers up to 300*10¹² !! 
This is said by physicists to be far more than the number of 
atoms in our universe, so most of the values which can be stored 
in 128-Bit are not even numbers anymore in the original sense.

Probably they are waiting for 128-bit architectures but hm...
You can translate the upper code into a template for your ASM 
language (like AMD64) and there you go.


More information about the Dlang-internal mailing list