Cornell University ECE4760
Fixed Point Arithmetic
Pi Pico RP2040
Fixed point on RP2040
The fixed point page by Hunter Adams gives a good introduction to the general implementation and motivation for fixed point. Here we want to time the performance of two specific fixed point notations. As with most small systems without hardware floating point, fixed point arithemetic is faster than floating point on theRP2040, and often has enough dynamic range and accuracy for animation and DSP. The two formats are optimized for different computing goals.
The s15x16 format gives good dynamic range for animation and high accuracy for fast cutoff IIR filters. The format s15x16 means16 bits to the left of the binary-point, one of which is the sign bit. The range is +32767.9999 to -32768.0 and the is resolution 2-16=1.5e-5. The range is enough for addressing pixels on a VGA screen, for example, and not worry too much about overflow. Since a full 32-bit number requires several 32-by-32 multiplies, s15x16 format is not as fast as s1x14 because the hardware multiplier is limited to a 32-bit output.
The s1x14 format gives significantly higher speed for multiply, but the format s1x14 means two bits to the left of the binary-point, one of which is the sign bit. The range is is +1.9999 to -2.0 and the is resolution 2-14=6e-5. This format is most useful when the dynamic range of values is small and predictable. Analog-to-digital input is an obivious application. The range is strictly limited to 12-bits, of which the two LSB are somewhat noisy and can often be ignored. The 12-bit size fits nicely into a 14 bit fraction, with a couple of bits extra. Since the format fits into 1-bits, a multiply takes just one cycle of the hardware integer multiipler, and is fast. This format is good for implementing FFTs on ADC data, or for FIR filters. These algorithms usually produce multiplications with numbers less than one, and so do not overflow the limited dynamic range.
Execution speeds
Execution speeds are for the standard clock rate of 125 MHz and complier optimization -Ofast. The microsecond core timer was used to time execution of loops. The times include realistic loop overhead, as well as the time to store/retreive variables. Three operations were timed: Multiply and add (MAC), divide, and square-root. The MAC operation is used widely in filter operations and FFT. The calculations were each interative, but had to be chosen carefully so as not overflow s1x14. Raw data is shown below, then a table of speed ratios compared to floating point. Speed up compaed to floatng point is significant for all operations, and very good for the s1x14 MAC operation.
Iterations count: 100
Data is time in microseconds, then the actual loop value to check accuracy.
MAC
fptime 138 fp_mac 1.352407
fix16time 40 fix_mac 1.350418
fix14time 9 fix_mac 1.340637
DIVIDE
fptime 84 fp_div 0.369712
fix16time 53 fix_div 0.369431
fix14time 29 fix_div 0.369568
SQRT
fptime 303 fp_sqrt 2502.377197
fix16time 217 fix_sqrt 2502.373779
op and ratio | float | s15x16 | s1.14 |
---|---|---|---|
MAC | 1 | 3.4 | 15 |
divide | 1 | 1.6 | 2.9 |
sqrt | 1 | 1.4 | --- |
Applications
The
Copyright Cornell University March 12, 2022