AVR floating point

Cornell University
Electrical Engineering 4760
Short Floating Point mathematical functions
in GCC and assembler

Introduction

Floating point arithmetic on the AVR Mega series is fairly fast in GCC if the library libm.a in linked in. An IEEE floating point multiply takes about 125 cycles and an IEEE floating add takes 75 to 275 cycles. The variability in the add times is due to the need to do extensive normalization shifting if either: (a) the two input values are much different in magnitude or (b) a subtraction of two almost equal values leaves a small result. At the other extreme of arithmetic, a fixed point multiply in 8:8 format (16 bits with 8 bits of fraction) takes about 16 cycles and the 8:8 add takes two cycles. It would be nice to have a faster floating point for performing factored second-order-section digital filtering, or for video games, where IEEE precision is not usually needed, but where fixed point is too restrictive.

Two systems were implemented: 16 bit mantissa and 8 bit mantissa. Overall, I would say that the 16 bit version is more useful. It is only slightly slower, but much more accurate.

16 bit mantissa float representation

The floating format with 16 bits of mantissa, 7 bits of exponent, and a sign bit, is stored in the space of a 32-bit long integer. This format gives a factor of 2.5-3 speed up in multiplication (over IEEE) and a speed up of about a factor of 1.3-4.0 for addition. The speed for the multiply is about 35 cycles. The speed for the add is 35-106 cycles. My short float operations do not support overflow, denorm, or infinity detection (but underflow is detected and the value set to zero).

This section will concentrate on numbers stored as 32-bit long ints. The lower 16 bits are the mantissa (more properly, significand). The mantissa value is considered a binary fraction with values 0.5<=mantissa<1.0. The top 8 bits are the exponent, but the top bit is used for overflow during the calculation, so the exponent range is 0x00 to 0x7f, or about 10^-18 to 10¹⁸. The sign bit is stored in the 23rd bit (high bit, 3rd byte). The high order bit of the significand is always one (unless the actual value is zero), because there are no denorms allowed. Typical numbers are shown below.

Examples:

Decimal Value	Short float Representation
`0.0`	`0x0000_0000`
`1.0`	`0x3f00_8000`
`1.5`	`0x3f00_c000`
`10000`	`0x4c00_9c40`
`1.0001`	`0x3f00_8003`
`-1.0`	`0x3f80_8000`
`-1.5`	`0x3f80_c000`
`1e-18`	`0x0300_9392`
`-1e-18`	`0x0380_9392`

16 bit float code

Test program: This program includes float to short-float (fp2sfp), short-float to float (sfp2fp), and negate (neg_sfp).
It is used to check accuracy and performance of the short-float operations.
The typedefs:

typedef unsigned long sfp;
// declare mult routine
sfp mult_sfp(sfp, sfp);
// declare add routine
sfp add_sfp(sfp, sfp);
// sfp format is:
//   top byte is exponent, range +63/-64 (7 bits, offset binary)
//   third byte has sign bit in top bit
//   lower two bytes are mantissa fraction, normalized so that
//   the top mantissa bit is ALWAYS one, unless the value is zero
// A zero is represented by all zero mantissa

multiply routine: This assembler routine multiplies two short-floats.

If the sums of the input exponents is less than 0x3f then the exponent will underflow and the product is zero.
If the exponents don't underflow:
1. The result, (mantissa_a)x(mantissa_b), must be 0.25<=product<1.0
2. Then if the product has the high order-bit set, the output exponent is exp_input_a + exp_input_b - 0x3e.
3. Otherwise the second bit of the product will be set, and the output mantissa is the product<<1
  and the output exponent is exp_input_a + exp_input_b - 0x3f.
4. The sign of the product is (sign_a ) xor (sign_b)

add routine: This assembler routine adds two short-floats.

If either input is zero, the output is the other input..
Determine which input is bigger, which smaller (absolute value) by first comparing the exponents, then the mantissas if necessary.
Determine the difference in the exponents and shift the smaller input mantissa right by the difference.
But if the exponent difference is greater than 15 then just output the bigger input.
If the signs of the inputs are the same, add the bigger and (shifted) smaller mantissas.
The result must be 0.5<sum<2.0.
If the result is greater than one, shift the mantissa sum right one bit and increment the bigger exponent.
The sign is the sign of either input.
If the signs of the inputs are different, subtract the bigger and (shifted) smaller mantissas so that the result is always positive.
The result must be 0.0<difference<0.5.
Shift the mantissa left until the high bit is set, while decrementing the bigger exponent.
The sign is the sign of the bigger input.

8 bit mantissa float representation

The floating format with 8 bits of mantissa, 7 bits of exponent, and a sign bit, is stored in the space of a 16-bit unsigned integer. The speed for the multiply is about 30 cycles. The speed for the add is 30-80 cycles. My short float operations do not support overflow, denorm, or infinity detection (but underflow is detected and the value set to zero).

This section will concentrate on numbers stored as 16-bit ints. The lower 8 bits are the mantissa (more properly, significand). The mantissa value is considered a binary fraction with values 0.5<=mantissa<1.0. The top 8 bits are the exponent, but the top bit is used for the sign bit during the calculation, so the exponent range is 0x00 to 0x7f, or about 10^-18 to 10¹⁸. The high order bit of the significand is always one (unless the actual value is zero), because there are no denorms allowed. Typical numbers are shown below.

Examples:

Decimal Value	Short float Representation
`0.0`	`0x0000`
`1.0`	`0x3f80`
`1.5`	`0x3fc0`
`10000`	`0x4c9c`
`1.01`	`0x3f81`
`-1.0`	`0xbf80`
`-1.5`	`0xbfc0`
`1e-17`	`0x06b8`
`-1e-18`	`0x8393`

8-bit float code

Test program

multiply routine

add routine