Cornell University ECE4760
Speech compression/playback
IIR MEL filterbank to DDS synth

Pi Pico RP2040

Speech and the channel vocoder
Human speech is complex and varied, but can be modeled as a time-variable source (vocal cord waveform, noise), feeding a time-varying filter (throat, tounge, nose, lips). The filter characteristic varies very little over the fundamental period of the source, and so approximates a linear, time-invariant filter. Here we use second order IIR filters approximating a MEL scale spectrum to estimate the output of the vocal filter for compression and re-synthesis.

Microphone input of the speech waveform is digitized at 12.8 KHz. This rate is about right for speech spectrum (100-3500 Hz). The speech waveform is directed to 32, second-order, IIR filters distributed over the frequency range. Each filter output is rectified at full input rate, then lowpass filtered to get an average power output with a time constant of a few milliseconds. The result is an estimate of the vocal tract filter function. The 32 average power values are sampled every 20 mSec and used as input to a DDS synthesizer. The sampling produces an 8:1 compression ratio. (12.8 samples/mSec)/(32 samples/20 mSec). The DDS synth produces 32 sinewaves, each at the cneter frequency of one of the input filters. Each sinewave is multiplied by the average power of the speech input, measured by one IIR filter, for the corresponding frequency, then they are all added to approximate the original waveform.

The filters are second order with center frequencies of { 100, 175, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 1000, 1100, 1200, 1300, 1400, 1500, 1700 ,1850, 2000, / 2200, 2500, 2700, 3000, 3200, 3350, 3500 } and with a Q=20. This is approximately a MEL scale over the speech frequency range. The analysis, compression, and synthesis all take place in an ISR running at 12.8 kHz. Some pre-emphisis is applied at higher frequencies, since the natural sppech waveform carries less energy at higher frequencies.

The following audio is me saying the digits: one, two, ..., nine, zero; recorded and compressed as above.
digits.

Project ZIP file.
C code


Copyright Cornell University June 26, 2024