Speech generation
on an Atmel Mega32
ECE 476 Cornell University

Introduction

There are several ways of making a computer talk. The simplest is to record whatever you want to say and play it back at about 8000 samples/sec. The problem with this approach is that it takes a lot of memory and is not very flexible. You could use a dedicated voice recording chip like the ISD chipcorder to record/playback segements of speech. You could also use a speakjet or winbond chip which syntheizes arbitrary speech from fragments of English (called allophones).

It would be cheaper (in hardware cost) to have the MCU directly make speech. One way is the PICtalker approach which is an allophone synthesis software system (based on the obsolete SPO256-AL2 chip). I have ripped the binary allphone file to a matlab program (see below). The main problem with this code is that it requires a 64kbyte table, which would fill all of flash on the Mega32, but could be put into serial dataflash (separate chip). Another problem with this scheme is that the speech designer/programmer has to be very good at stringing together sounds in order to make understandable speech.

Another way is to compress speech so that the MCU can directly do the decompression on the fly.

Approaches:

DPCM

A version of the DPCM algorithm can be implemented using very little processing time. A 2-bit/sample compressor/decompressor was written in Matlab to encode, to make a packed C header file, and then to do a test-decode. Note that the quantization break-points and reconstruction values are made up by me. You can change them, but you must be consistent in the encoder and decoder. An optimization (program + function) based on the histogram of first derivitives suggests that quantization breakpoints of [-0.05, 0, 0.05] and reconstruction values of [-0.16, -0.026, 0.026, 0.16] are about right for demo wav file given below. A decoder written in Codevision for the Mega32 uses the packed code format to generate speech. Each second of speech takes 2 kByte of flash.

To use this system:

  1. If you want to have the Mega32 just speak the numerical digits, skip this list and use the code in the next paragraph.
  2. Get some clean, noise-free speech. You could record your own voice or use this TextToSpeech demo.
  3. Make sure the audio sample rate is 8kHz and save it in a wav file. This little matlab program downsamples a wav file by 2:1. If you use the text-to-speech demo in step (2) you will need to downsample.
  4. Run the Matlab compressor on the wav file. The compressor output file will be a table in C header format. You could, of course, have several short compressed tables in flash, or you could index into a long table to say just one word.
  5. Resynthesize on Mega32.
    1. Include the compressor output file from step (4) in your c program.
    2. Attach PORTB.3 to a low pass filter, and then to an audio amplifier. A 2k resistor and 0.1 ufd capacitor will work for the lowpass filter, but you will get cleaner sound if you use an active, 2-pole, Chebychev filter with a cutoff frequency of 2.5 to 3 kHz.

The file DPCMAllDigits.h has a Codevision flash array for the digits zero to nine. If you include this in a test program, you have available all the spoken digits. The sample index boundaries for the digits in the array are given below. Using this table you can speak individual digits by decompressing only part of the flash array.

Digit Boundary Time in sec Sample # Sample index in
DPCMAllDigits.h
0 - 1 0.85 6800 1700
1 - 2 1.45 11600 2900
2 - 3 2.0 16000 4000
3 - 4 2.75 22000 5500
4 - 5 3.32 26560 6640
5 - 6 4.0 32000 8000
6 - 7 4.75 38000 9500
7 - 8 5.5 44000 11000
8 - 9 6.05 48400 12100

DPCMAllDigits.h is based on the TextToSpeech demo page using the simulated voice "Claire". Commas were placed between the digit names for synthesis. The raw synthesis result (wav at 16 Ksamples/sec) and reduced rate result (wav at 8 Ksamples/sec) are included for reference.


Sine Wave Synthesis (SWS) In Progress... No usable Mega32 code below here.

SWS from Haskins Lab is a synthesis/compression scheme based only only playing back the few loudest sinewaves in the time dependent Fourier transform of the speech signal. An example of synthesized speech and SWS results are below. There are more examples at Yale. Parameter files have an entry for each time step in which each sine wave component has a frequency and amplitude.

You will probably notice that the SWS speech is understandable in a weird way. The 2-sine version is bad, but the 3-sine version is almost as good as the 4 or 5. For the 3-sine version, the compression ratio (assuming 8kHz, 8-bit, wav file) is better than 20:1. Every 20 mSec there are three sine frequencies and three sine amplitudes to specify. If each parameter can be reduced to one byte, then we only need 300 bytes/sec!

To use this technique:

  1. Get some clean, noise-free speech. You could record your own voice or use this TextToSpeech demo. Save the speech wavefrom as a wav file.
  2. Get the matlab routines from the Haskins web site, or a local copy. Unzip and add the destination folder to the matlab path.
  3. Run the SWS rouitine which opens a GUI.
  4. In the GUI:
    1. choose File menu, Extract Parameters... and find your wav file. Note that the extraction algorithm tends to fail for very clean, low noise, synthesized speech. You may need to add a bit of noise to the wav file, then resave it.
    2. play the result using Data menu, Play all.
    3. save the compressed file as a wav or save the parameters using File menu, Save parameters... and use the swi option.
  5. Open the swi file and inspect the format. There is a header with the number of sine components at each time step, followed by the time steps tagged with the time in mSec followed by the sine components frequency and amplitude. You can check the swi file by running it through a reconstruction program which reads the frequency and amplitudes, interpolates them and generates sine wave sums.
  6. Convert the swi file into C source code using ... not done yet!
  7. Include the C source code into a test program and ... not done yet!

 


In Progress... No usable Mega32 code below here.

ADPCM

ADPCM takes advantage of the high sample-to-sample similarity of speech waveforms to compress speech. More to come...

LPC

The method we are going to use here is to run an LPC encoder on a PC and the decoder on the MCU. Most of the code (except for customizing for Mega32) came from Dan Ellis. Using this method, we can trade off quality and compression. At reasonable compression, the quality is quite good.

The steps to doing this are:

  1. Get some clean, noise-free speech. You could record your own voice or use this TextToSpeech demo.
  2. Make sure the sample rate is as low as possible, I suggest 8kHz. This matlab program downsamples 2:1.
  3. Extract the LPC filter parameters in Matlab and save them as Codevision source. A higher order filter gives better speech, but requires a bigger table. I suggest an 8th order approximation, but 12 is a maximum. Resynthesize in matlab to check quality.
    1. Test input voice file Generated using the TTS demo site above. Size is 24,160 8-bit sound samples.
    2. lpcfit function Directly from Dan Ellis.
    3. main program This program calls the lpcfit function and the resynthesis function
    4. resynthesis function This version has been modified to be "C-like" and not use matlab internal functions
  4. Resynthesize on Mega32.
    1. LPC table as C source code example (from test input above) Size is 3384 bytes. Compression is about 7:1.
    2. C code --NOT done yet-- I can't make it go fast enoough.

PICtalker

The code to read and segement out the allophones:

  1. The allophone data file
  2. the allophone starting point file
  3. the program loads the allophone data file and starting point files, then attempts to synthesis "Mega32 speech". Refer to the following allophone description for the meaning of the address segment numbers.
  4. SPO256 allophone description. Numbers on the left are used in the matlab code to identify specific allophones.

Using this synthesis style, the challange is to map the predefined 59 allophone sound library into the best approximation of words. An interesting project would be to LPC encode each of the allophones, store the LPC table on the Mega32, then expand them on the fly. This would make a compact flexible system.


References:

  1. Ellis links to Haskins Lab where there are some weird sinewave speech examples. These examples achieve high compression, but sound too strange (to me) for routine use. They might make good sound effects, however. Paper describing the work.
  2. Speech Compression and software
  3. adpcm.c
  4. ADPCM description

Copyright Cornell University 2005