Cornell University ECE4760

Floating Point Arithmetic

Pi Pico RP2040

**Floating point on RP2040**

There is full support in the ROM for IEEE 32-bit floating point (fp32). What we are attempting to do here is to see if 16-bit floating formats have useful range and resolution, but faster execution. For speed comparision, a fixed point format, s15x16 is included.

The system implemented is 16-bit floating point (fp16), with similar bit layout to the IEEE standard FP16, but without infinities or NANs, or denorms. The reason for doing this annoying exercise is to see if the Lattice-Boltzmann solver runs better in limited precision, and therefore faster, floating point than in fixed point. The 16-bit floats have a dynamic range of 1e5 and resolution of 1e-4. The bit layout is

[sign 1-bit | exponent 5-bits | mantissa 11-bits (10-bits actually stored)]

Sign is one for negative. The exponent is in offset binary. The maximum exponent, 2^{15}, is 0b11111, and the minimum, 2^{-15}, is 0b00000. Since there are no denorms, the most significant bit of the mantissa does not need to be stored, since it is always one. The mantissa is therefore a value between 1 and 2, with 10-bit resolution. We implemented:

- add/subtract
- multiply
- divide
- inverse sqrt
- compare
- shift -- This is an unusual float operator equvalent to mult/div by powers of two
- negate
- absolute value
- integer <> fp16 and IEEE <> fp16

A change in exponent bias shifts the range of the 16-bit floats to +/-2, with a resolution of 1e-9. The exponent is in offset binary. The maximum exponent, 2^{0}, is 0b11111, and the minimum, 2^{-31}, is 0b00000.

The s15x16 format gives good dynamic range for animation and high accuracy for fast cutoff IIR filters. The format s15x16 means 16 bits to the left of the binary-point, one of which is the sign bit. The range is +32767.9999 to -32768.0 and the is resolution 2^{-16}=1.5e-5. The range is enough for addressing pixels on a VGA screen, for example, and not worry too much about overflow.

**Execution speed**s

Execution speeds are estimates because the execution path through floating routines, particularly floating add, is variable. Fp32 and fp16 addition have about the same execution speed, while s15x16 is ten times faster. Fp16 and s15x16 have about the same multiply speed, and are about three times faster than fp32. For speed, fixed point is the clear winner. If memory footprint matters, e.g. for Lattice-Boltzmann, fp16 may have the advantage.

**Inverse-square-root (isqrt)**

There is a weird, bit-twiddle, trick that will extract the isqrt of numbers in fp32 or fp16 format. It is explained in https://en.wikipedia.org/wiki/Fast_inverse_square_root. The summary is that a number stored in fp32 or fp16 format is a reasonable approximation of the log of the number. Therefore, shifting it right by one bit makes it an approximation of the log of the square-root. Subtracting it from a constant inverts the log, while the constant does a bit more correction to the estimate. Treating the literal bits of the fp16 as an integer i,

i = 22971 - ( i >> 1 )

gives an estimate of the log of the isqrt. Storing this again as a fp16 and running one Newton interation to refine the estimate is the final isqrt. The 'magic number' 22971 is

1.5 * 2^(num_mantissa_bits) * ((exp_offset)-0.045)

For fp16; exp offset = 15 ; num mantissa bits = 10

See particularly the section "First approximation of the result" in the Wikipedia article.

**FP16 code**

C code, cmakelist, protothreads, **ZIP of all files**

Copyright Cornell University June 5, 2022