New floating-point arithmetic

sweety439 · 2022-04-30, 03:53

Double: 64 bits

bit 63: sign (0 = positive, 1 = negative)

bit 62 to bit 52: exponent (bit 62 = e_10, bit 61 = e_9, bit 60 = e_8, ..., bit 53 = e_1, bit 52 = e_0) (the exponent uses two's complement representation, the range is -1024 (e_10,e_9,e_8,...,e_1,e_0 = 10000000000) through +1023 (e_10,e_9,e_8,...,e_1,e_0 = 01111111111))

bit 51 to bit 0: significand precision (bit 51 = s_51, bit 50 = s_50, bit 49 = s_49, ..., bit 1 = s_1, bit 0 = s_0)

the value of the number is

(-1)^sign*(1.s_51,s_50,s_49,...,s_1,s_0)₂*2^(e_10,e_9,e_8,...,e_1,e_0)₂ (for the integer part in the "scientific notation", in decimal (base 10) it can be 1, 2, 3, ..., 8, 9, but in binary (base 2) it must be 1, thus this 1 need not to memory)

Special cases:

* sign = 0, e_10,e_9,e_8,...,e_1,e_0 = 01111111111 (+1023), s_51,s_50,s_49,...,s_1,s_0 are all 1 --> +∞
* sign = 1, e_10,e_9,e_8,...,e_1,e_0 = 01111111111 (+1023), s_51,s_50,s_49,...,s_1,s_0 are all 1 --> -∞
* sign = 0, e_10,e_9,e_8,...,e_1,e_0 = 10000000000 (-1024), s_51,s_50,s_49,...,s_1,s_0 are all 0 --> 0
* sign = 1, e_10,e_9,e_8,...,e_1,e_0 = 10000000000 (-1024), s_51,s_50,s_49,...,s_1,s_0 are all 0 --> NaN

(I think this double floating-point is better than the original one, since it is bijective)

sweety439 · 2022-04-30, 03:57

Examples:

Code:

0000 0000 0000 0000   = 1
8010 0000 0000 0000   = −2
0057 8000 0000 0000   = 47
8085 5000 0000 0000   = −341
3fff ffff ffff fffe   = 2¹⁰²⁴−2⁹⁷² (Max double)
4000 0000 0000 0001   = 2⁻¹⁰²⁴+2⁻¹⁰⁷⁶ (Min double)
7ff0 0000 0000 0000   = 1/2
7fe5 5555 5555 5555   ≈ 1/3
4000 0000 0000 0000   = 0
c000 0000 0000 0000   = NaN
3fff ffff ffff ffff   = ∞
bfff ffff ffff ffff   = −∞

sweety439 · 2022-04-30, 04:08

single: 32 bits, including 1 sign bit, 8 exponent bits, 23 significand precision bits
double: 64 bits, including 1 sign bit, 11 exponent bits, 52 significand precision bits
long double: 80 bits, including 1 sign bit, 16 exponent bits, 63 significand precision bits
quadruple: 128 bits, including 1 sign bit, 15 exponent bits, 112 significand precision bits
octuple: 256 bits, including 1 sign bit, 16 exponent bits, 239 significand precision bits (I think it is better, since 239 significand precision bits means its significant bits is 239+1 = 240 bits, and 240 is a highly-composite number, i.e. 240 has many divisors, and 240 bits ≈ 72 decimal digits (and thus its significant decimal digits is 72 digits), and 72 also has many divisors (72 is known as the smallest Achilles number), the current octuple-precision is 1 sign bit, 19 exponent bits, 236 significand precision bits)

super: 65536 bits, including 1 sign bit, 256 exponent bits, 65280 significand precision bits

use my sense of two's complement exponent and only one bit combo refers to each of (+∞,-∞,0,NaN), super-precision has 65279+1 = 65280 bits (≈19652 decimal digits) significant digits, its maximum number is 2^(2^255)-2^(2^255-65280), and its minimum nonzero number is 2^(-2^255)+2^(-2^255-65280)

retina · 2022-04-30, 04:25

The lack of negative zero will be a problem for some algorithms.

And if NaN always, or never, faults then is could be cumbersome. The existing QNaN vs SNaN allows for some nice efficiencies. And the extra bits in the QNaN/SNaN encoding provides good debugging opportunities.

I don't care about denormals, so whatever.

But -0 == Nan? Why? Extra circuitry/code for what gain?

sweety439 · 2022-04-30, 04:33

In super-precision:

Number rounds to 65280 binary significant digits, use Gaussian rounding (round half to even), i.e. (the bold number is the 65280th binary significant digit)

* ...00... --> ...0
* ...01...1... --> ...1
* ...01000... (the digits after the only 1 in the 65281st bit are all 0) --> ...0
* ...11... --> ...(+1)0
* ...10...0... --> ...1
* ...10111... (the digits after the only 0 in the 65281st bit are all 1) --> ...(+1)0

(remember: 0.999... = 1)

These calculations return "+∞":

* the result number >= 2^(2^255) (in fact, >=2^(2^255)-2^(2^255-65281), since we must use Gaussian rounding to round to 65280 binary significant digits, thus 2^(2^255)-2^(2^255-65281) (which has 65281 consecutive 1's after the "0" in the 2^(2^255) bit) become 2^(2^255) and become +∞, since the 2^(2^255-65280) digit is 1 and the digits after it is exactly a half, thus it will be rounded up)
* (+∞) + (x) (except the cases x = -∞ and x = NaN)
* (+∞) - (x) (except the cases x = +∞ and x = NaN)
* (x) - (-∞) (except the cases x = -∞ and x = NaN)
* (+∞) * (x) when x > 0 (including x = +∞)
* (-∞) * (x) when x < 0 (including x = -∞)
* (+∞) / (x) when x >= 0 (except x = +∞)
* (-∞) / (x) when x < 0 (except x = -∞)
* (x) / (0) when x > 0 (including x = +∞)
* (+∞) ^ (x) when x > 0 (including x = +∞)
* (x) ^ (+∞) when x > 1 (including x = +∞)
* (x) ^ (-∞) when 0 <= x < 1

These calculations return "0":

* the result number between 2^(-2^255) and -2^(-2^255) inclusive (in fact, between 2^(-2^255)+2^(-2^255-65281) and -2^(-2^255)-2^(-2^255-65281) inclusive, since we must use Gaussian rounding to round to 65280 binary significant digits, thus 2^(-2^255)+2^(-2^255-65281) (which has 65280 consecutive 0's after the "1" in the 2^(-2^255) bit) become 2^(2^-255) and become 0, since the 2^(-2^255-65280) digit is 0 and the digits after it is exactly a half, thus it will be rounded down)
* (x) / (+∞)
* (x) / (-∞)
* (x) ^ (+∞) when 0 <= x < 1
* (x) ^ (-∞) when x > 1 (including x = +∞)

These calculations return "NaN":

* at least one number is NaN
* the result number is complex number, e.g. (-1)^(1/2)
* (+∞) + (-∞)
* (+∞) - (+∞)
* (+∞) * (0)
* (+∞) / (+∞)
* (0) / (0)
* (0) ^ (0)
* (+∞) ^ (0)
* 1 ^ (+∞)

sweety439 · 2022-04-30, 04:33

Quote:

Originally Posted by retina

The lack of negative zero will be a problem for some algorithms.

And if NaN always, or never, faults then is could be cumbersome. The existing QNaN vs SNaN allows for some nice efficiencies. And the extra bits in the QNaN/SNaN encoding provides good debugging opportunities.

I don't care about denormals, so whatever.

But -0 == Nan? Why? Extra circuitry/code for what gain?

The value of -0 is the same as +0, thus no need to use a bit combo for -0

retina · 2022-04-30, 04:48

Quote:

Originally Posted by sweety439

The value of -0 is the same as +0, thus no need to use a bit combo for -0

I mean the way it is handled in the machine, not the value.

You need something to manipulate the bits to do the computations. If you make the bit patterns hard to deal with then it is no fun to use. Too many special cases in the code or the circuitry.

2022-04-30, 03:53	#1
sweety439 "99(4^34019)99 palind" Nov 2016 (P^81993)SZ base 36 2⁵×5×23 Posts	New floating-point arithmetic Double: 64 bits bit 63: sign (0 = positive, 1 = negative) bit 62 to bit 52: exponent (bit 62 = e_10, bit 61 = e_9, bit 60 = e_8, ..., bit 53 = e_1, bit 52 = e_0) (the exponent uses two's complement representation, the range is -1024 (e_10,e_9,e_8,...,e_1,e_0 = 10000000000) through +1023 (e_10,e_9,e_8,...,e_1,e_0 = 01111111111)) bit 51 to bit 0: significand precision (bit 51 = s_51, bit 50 = s_50, bit 49 = s_49, ..., bit 1 = s_1, bit 0 = s_0) the value of the number is (-1)^sign(1.s_51,s_50,s_49,...,s_1,s_0)₂2^(e_10,e_9,e_8,...,e_1,e_0)₂ (for the integer part in the "scientific notation", in decimal (base 10) it can be 1, 2, 3, ..., 8, 9, but in binary (base 2) it must be 1, thus this 1 need not to memory) Special cases: * sign = 0, e_10,e_9,e_8,...,e_1,e_0 = 01111111111 (+1023), s_51,s_50,s_49,...,s_1,s_0 are all 1 --> +∞ * sign = 1, e_10,e_9,e_8,...,e_1,e_0 = 01111111111 (+1023), s_51,s_50,s_49,...,s_1,s_0 are all 1 --> -∞ * sign = 0, e_10,e_9,e_8,...,e_1,e_0 = 10000000000 (-1024), s_51,s_50,s_49,...,s_1,s_0 are all 0 --> 0 * sign = 1, e_10,e_9,e_8,...,e_1,e_0 = 10000000000 (-1024), s_51,s_50,s_49,...,s_1,s_0 are all 0 --> NaN (I think this double floating-point is better than the original one, since it is bijective) Last fiddled with by sweety439 on 2022-05-03 at 03:34

2022-04-30, 04:08	#3
sweety439 "99(4^34019)99 palind" Nov 2016 (P^81993)SZ base 36 2⁵×5×23 Posts	single: 32 bits, including 1 sign bit, 8 exponent bits, 23 significand precision bits double: 64 bits, including 1 sign bit, 11 exponent bits, 52 significand precision bits long double: 80 bits, including 1 sign bit, 16 exponent bits, 63 significand precision bits quadruple: 128 bits, including 1 sign bit, 15 exponent bits, 112 significand precision bits octuple: 256 bits, including 1 sign bit, 16 exponent bits, 239 significand precision bits (I think it is better, since 239 significand precision bits means its significant bits is 239+1 = 240 bits, and 240 is a highly-composite number, i.e. 240 has many divisors, and 240 bits ≈ 72 decimal digits (and thus its significant decimal digits is 72 digits), and 72 also has many divisors (72 is known as the smallest Achilles number), the current octuple-precision is 1 sign bit, 19 exponent bits, 236 significand precision bits) super: 65536 bits, including 1 sign bit, 256 exponent bits, 65280 significand precision bits use my sense of two's complement exponent and only one bit combo refers to each of (+∞,-∞,0,NaN), super-precision has 65279+1 = 65280 bits (≈19652 decimal digits) significant digits, its maximum number is 2^(2^255)-2^(2^255-65280), and its minimum nonzero number is 2^(-2^255)+2^(-2^255-65280) Last fiddled with by sweety439 on 2022-05-03 at 03:32

2022-04-30, 04:33	#5
sweety439 "99(4^34019)99 palind" Nov 2016 (P^81993)SZ base 36 2⁵×5×23 Posts	In super-precision: Number rounds to 65280 binary significant digits, use Gaussian rounding (round half to even), i.e. (the bold number is the 65280th binary significant digit) * ...00... --> ...0 * ...01...1... --> ...1 * ...01000... (the digits after the only 1 in the 65281st bit are all 0) --> ...0 * ...11... --> ...(+1)0 * ...10...0... --> ...1 * ...10111... (the digits after the only 0 in the 65281st bit are all 1) --> ...(+1)0 (remember: 0.999... = 1) These calculations return "+∞": * the result number >= 2^(2^255) (in fact, >=2^(2^255)-2^(2^255-65281), since we must use Gaussian rounding to round to 65280 binary significant digits, thus 2^(2^255)-2^(2^255-65281) (which has 65281 consecutive 1's after the "0" in the 2^(2^255) bit) become 2^(2^255) and become +∞, since the 2^(2^255-65280) digit is 1 and the digits after it is exactly a half, thus it will be rounded up) * (+∞) + (x) (except the cases x = -∞ and x = NaN) * (+∞) - (x) (except the cases x = +∞ and x = NaN) * (x) - (-∞) (except the cases x = -∞ and x = NaN) * (+∞) * (x) when x > 0 (including x = +∞) * (-∞) * (x) when x < 0 (including x = -∞) * (+∞) / (x) when x >= 0 (except x = +∞) * (-∞) / (x) when x < 0 (except x = -∞) * (x) / (0) when x > 0 (including x = +∞) * (+∞) ^ (x) when x > 0 (including x = +∞) * (x) ^ (+∞) when x > 1 (including x = +∞) * (x) ^ (-∞) when 0 <= x < 1 These calculations return "0": * the result number between 2^(-2^255) and -2^(-2^255) inclusive (in fact, between 2^(-2^255)+2^(-2^255-65281) and -2^(-2^255)-2^(-2^255-65281) inclusive, since we must use Gaussian rounding to round to 65280 binary significant digits, thus 2^(-2^255)+2^(-2^255-65281) (which has 65280 consecutive 0's after the "1" in the 2^(-2^255) bit) become 2^(2^-255) and become 0, since the 2^(-2^255-65280) digit is 0 and the digits after it is exactly a half, thus it will be rounded down) * (x) / (+∞) * (x) / (-∞) * (x) ^ (+∞) when 0 <= x < 1 * (x) ^ (-∞) when x > 1 (including x = +∞) These calculations return "NaN": * at least one number is NaN * the result number is complex number, e.g. (-1)^(1/2) * (+∞) + (-∞) * (+∞) - (+∞) * (+∞) * (0) * (+∞) / (+∞) * (0) / (0) * (0) ^ (0) * (+∞) ^ (0) * 1 ^ (+∞) Last fiddled with by sweety439 on 2022-05-03 at 03:34

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
floating point operations	ATH	Lounge	3	2006-01-01 20:29
Floating point options for Windows XP 64	dsouza123	Hardware	2	2005-03-12 17:45
LL tests: Integer or floating point?	E_tron	Math	4	2004-01-13 19:44
Floating point precision	lunna	Hardware	11	2003-12-29 16:46
floating point exception in Version 23.4.2	mda2376	Software	2	2003-06-12 04:45

2022-04-30, 04:25	#4
retina Undefined "The unspeakable one" Jun 2006 My evil lair 61×109 Posts	The lack of negative zero will be a problem for some algorithms. And if NaN always, or never, faults then is could be cumbersome. The existing QNaN vs SNaN allows for some nice efficiencies. And the extra bits in the QNaN/SNaN encoding provides good debugging opportunities. I don't care about denormals, so whatever. But -0 == Nan? Why? Extra circuitry/code for what gain?