Cornell University
Electrical Engineering 4760
Short Floating Point mathematical functions
in GCC and assembler
Introduction
Floating point arithmetic on the AVR Mega series is fairly fast in GCC if the library libm.a in linked in. An IEEE floating point multiply takes about 125 cycles and an IEEE floating add takes 75 to 275 cycles. The variability in the add times is due to the need to do extensive normalization shifting if either: (a) the two input values are much different in magnitude or (b) a subtraction of two almost equal values leaves a small result. At the other extreme of arithmetic, a fixed point multiply in 8:8 format (16 bits with 8 bits of fraction) takes about 16 cycles and the 8:8 add takes two cycles. It would be nice to have a faster floating point for performing factored second-order-section digital filtering, or for video games, where IEEE precision is not usually needed, but where fixed point is too restrictive.
Two systems were implemented: 16 bit mantissa and 8 bit mantissa. Overall, I would say that the 16 bit version is more useful. It is only slightly slower, but much more accurate.
16 bit mantissa float representation
The floating format with 16 bits of mantissa, 7 bits of exponent, and a sign bit, is stored in the space of a 32-bit long integer. This format gives a factor of 2.5-3 speed up in multiplication (over IEEE) and a speed up of about a factor of 1.3-4.0 for addition. The speed for the multiply is about 35 cycles. The speed for the add is 35-106 cycles. My short float operations do not support overflow, denorm, or infinity detection (but underflow is detected and the value set to zero).
This section will concentrate on numbers stored as 32-bit long ints
. The lower 16 bits are the mantissa (more properly, significand). The mantissa value is considered a binary fraction with values 0.5<=mantissa<1.0
. The top 8 bits are the exponent, but the top bit is used for overflow during the calculation, so the exponent range is 0x00 to 0x7f, or about 10-18 to 1018. The sign bit is stored in the 23rd bit (high bit, 3rd byte). The high order bit of the significand is always one (unless the actual value is zero), because there are no denorms allowed. Typical numbers are shown below.
Examples:
Decimal Value |
Short float Representation |
0.0 |
0x0000_0000 |
1.0 |
0x3f00_8000 |
1.5 |
0x3f00_c000 |
10000 |
0x4c00_9c40 |
1.0001 |
0x3f00_8003 |
-1.0 |
0x3f80_8000 |
-1.5 |
0x3f80_c000 |
1e-18 |
0x0300_9392 |
-1e-18 |
0x0380_9392 |
16 bit float code
Test program: This program includes float to short-float (fp2sfp), short-float to float (sfp2fp), and negate (neg_sfp).
It is used to check accuracy and performance of the short-float operations.
The typedefs:
typedef unsigned long sfp; // declare mult routine sfp mult_sfp(sfp, sfp); // declare add routine sfp add_sfp(sfp, sfp); // sfp format is: // top byte is exponent, range +63/-64 (7 bits, offset binary) // third byte has sign bit in top bit // lower two bytes are mantissa fraction, normalized so that // the top mantissa bit is ALWAYS one, unless the value is zero // A zero is represented by all zero mantissa
multiply routine: This assembler routine multiplies two short-floats.
(mantissa_a)x(mantissa_b)
, must be 0.25<=product<1.0
exp_input_a + exp_input_b - 0x3e
. product<<1
exp_input_a + exp_input_b - 0x3f
.(sign_a ) xor (sign_b)
add routine: This assembler routine adds two short-floats.
0.5<sum<2.0
. 8 bit mantissa float representation
The floating format with 8 bits of mantissa, 7 bits of exponent, and a sign bit, is stored in the space of a 16-bit unsigned integer. The speed for the multiply is about 30 cycles. The speed for the add is 30-80 cycles. My short float operations do not support overflow, denorm, or infinity detection (but underflow is detected and the value set to zero).
This section will concentrate on numbers stored as 16-bit ints
. The lower 8 bits are the mantissa (more properly, significand). The mantissa value is considered a binary fraction with values 0.5<=mantissa<1.0
. The top 8 bits are the exponent, but the top bit is used for the sign bit during the calculation, so the exponent range is 0x00 to 0x7f, or about 10-18 to 1018. The high order bit of the significand is always one (unless the actual value is zero), because there are no denorms allowed. Typical numbers are shown below.
Examples:
Decimal Value |
Short float Representation |
0.0 |
0x0000 |
1.0 |
0x3f80 |
1.5 |
0x3fc0 |
10000 |
0x4c9c |
1.01 |
0x3f81 |
-1.0 |
0xbf80 |
-1.5 |
0xbfc0 |
1e-17 |
0x06b8 |
-1e-18 |
0x8393 |
8-bit float code
Copyright Cornell University
June 10, 2011
Bruce R. Land