Speech generation
on an Atmel Mega644 in GCC
ECE 4760 Cornell University
Introduction
There are several ways of making a computer talk. The simplest is to record whatever you want to say and play it back at about 8000 samples/sec. The problem with this approach is that it takes a lot of memory and is not very flexible. You could use a dedicated voice recording chip like the ISD chipcorder to record/playback segements of speech. You could also use a speakjet or winbond chip which syntheizes arbitrary speech from fragments of English (called allophones).
It would be cheaper (in hardware cost) to have the MCU directly make speech. One way is the PICtalker approach which is an allophone synthesis software system (based on the obsolete SPO256-AL2 chip). I have ripped the binary allphone file to a matlab program (see below). The main problem with this code is that it requires a 64kbyte table, which would fill all of flash on the Mega644, but could be put into serial dataflash (separate chip). Another problem with this scheme is that the speech designer/programmer has to be very good at stringing together sounds in order to make understandable speech.
Another way is to compress speech so that the MCU can directly do the decompression on the fly. I used differential, pulse-code modulation (DPCM). The motivation for sending samples of the first derivitive of the speech signal, rather than the signal itself, is that the derivitive changes relatively little between samples so fewer bits are required. I implemented a DPCM scheme with 4:1 compression which sends 2-bit derivitive samples. It sounds acceptable, but a little scratchy. I also implemented 8:1 compression (1-bit derivitives). The quality is lower, but still understandable most of the time.
DPCM (2-bit samples)
A version of the DPCM algorithm can be implemented using very little
processing time. A 2-bit/sample compressor/decompressor was written in
Matlab to encode and to make a packed C header file, and
then to do a test-decode. Note that the quantization break-points and reconstruction
values are made up by me. You can change them, but you must be consistent
in the encoder and decoder. An optimization (program
+ function) based on the histogram of first derivitives
suggests that quantization breakpoints of [-0.05, 0, 0.05]
and
reconstruction values of [-0.16, -0.026, 0.026, 0.16]
are about
right for demo wav file given below. A decoder written
in GCC for the Mega644 uses the packed code format to generate speech.
Each second of speech takes 2 kByte of flash.
To use this system:
The file DPCMAllDigits.h
has a
GCC flash array for the digits zero to nine. If you include this in a test
program, you have available all the spoken digits. The sample index boundaries
for the digits in the array are given below. Using this table you can speak
individual digits by decompressing only part of the flash array.
Digit Boundary | Time in sec | Sample # | Sample index in |
0 - 1 | 0.85 | 6800 | 1700 |
1 - 2 | 1.45 | 11600 | 2900 |
2 - 3 | 2.0 | 16000 | 4000 |
3 - 4 | 2.75 | 22000 | 5500 |
4 - 5 | 3.32 | 26560 | 6640 |
5 - 6 | 4.0 | 32000 | 8000 |
6 - 7 | 4.75 | 38000 | 9500 |
7 - 8 | 5.5 | 44000 | 11000 |
8 - 9 | 6.05 | 48400 | 12100 |
DPCMAllDigits.h
is based
on the TextToSpeech
demo page using the simulated voice "Claire". Commas were placed
between the digit names for synthesis. The original synthesis
result (wav at 16 Ksamples/sec) and reduced
rate result (wav at 8 Ksamples/sec) used as input to the compressor are
included for reference.
DPCM (1-bit samples)
A version of the encoder was written that simply sends one bit/sample depending on the sign of the first derivitive. The reconstructed speech has noticably higher noise than the 2-bit version, but is still understandable. The 8Ksample/sec speech waveform (from the TextToSpeech demo page using the simulated voice "Mike") is compressed with a matlab program to produce a C header file, which is included in a mega644 test program. About 60 seconds of speech should fit into flash on a mega644. Attach PORTB.3 to a low pass filter, and then to an audio amplifier. The low pass should cutoff at about 18,000 radians/sec (3000 Hz). Sometimes you can skip the lowpass and use the input characteristics of the audio amp to lowpass.
Allophone synthesis
The code to read and segement out the allophones:
Using this synthesis style, the challange is to map the predefined 59 allophone sound library into the best approximation of words.
Old Mega32 info is here.
Copyright Cornell University 22-Feb-2010