FPGA Speech Vocoder

João Pedro Carvão (jc2697), Justin Joco (jaj263), and Thinesiya Krishnathasan (tk455)

Wednesday, May 15, 2019

The goal of this project was to design a real-time speech vocoder on an FPGA. Our design shows a highly parallel design built on the foundations of digital signal processing and CPU design.

Introduction

Our final project for ECE 5760: Advanced Microcontroller Design and System-on-Chip is a highly parallel hardware vocoder for real-time speech synthesis and visualization on a monitor through a VGA interface. We designed and implemented the vocoder for a DE1-SoC Development Kit. The entire system was built on the board’s Cyclone V FPGA. That is, audio input, analysis, synthesis, output, and visualization was done on the FPGA.

To implement the vocoder, we can input sound from any audio source with an aux connection to the board through the audio bus master, passing the input through several IIR filters to generate the mel cepstrum of the input stream for analysis, and a few more stages of filtering for quality control before finally reconstructing the sound. Once the coded voice is ready, it is put back into the audio bus master to be output into any speaker with an aux connection.

For data visualization, we had a basic GPU implemented as part of our FPGA system taking readings from the processed data to display a spectrogram and magnitude of voice on the VGA screen in real-time.

The results, are a lyrical reconstruction of a human’s voice, resembling that of a robot. Such tonal qualities are commonplace in electronic music.


High Level Design

Design of a vocoder on an FPGA required the design of filter banks similar to those common in much of the established digital signal processing (DSP) literature. It is natural to think of many of these signal block diagrams for DSP design as parallel operations, where all operations are happening simultaneously in real-time rather than sequential operations natural to think of in many software environments. Because of this parallel nature to DSP operations, implementing this vocoder in hardware was a natural choice.

Our basic design follows influence from the mel cepstrum, that uses filter banks to deconstruct an audio input, such as a human’s voice into several frequency bins. To reconstruct the voice, we needed to perform additional filtering to recreate a voice that can make sounds discernable to human ears.

In parallel with the real-time voice modulation, the FPGA reads the audio input and graphs its respective waveform to the top-half of the pixel monitor. Afterwards, the FPGA writes the modulated audio’s 32-frequency bin spectrogram onto the bottom-half of the screen.


Hardware Design


Vocoder Design and Implementation

Our first task was to get input from a sound source. We decided the natural choice for audio input for this project was a microphone, which Bruce Land generously donated to our group from a bin of old lab parts. We needed to first figure out how to get input from this microphone through the FPGA and out to speakers as an audio loopback test before we could consider any signal processing on the board. To do this, we used Altera’s audio bus master for the board to get input from the microphone to the FPGA and then out again. Once we were able to hear a sound, we knew we could begin designing and building a filter bank for analysis. One important modification to the bus master we needed to do with the Intel’s QSys bus design tool, was to take audio input from the board’s mic in aux port rather than line in. The reason for this is that the default line in port had no internal amplification on the board, so any sound output through line out at the end of the loopback test would be inaudible.

For speech analysis, we started with the IIR Filter example on the DSP page of the ECE 5760 website as well as the accompanying MATLAB code. We used the MATLAB code to generate the filter coefficients for these bandpass filters. Figure 1 shows an amplitude plot for each filter, specified at a certain center frequency. The edges of each bandpass filter were selected to be the frequency at which the amplitude dropped down to about 50% of the peak at the center frequency.

Figure 1: 32 bandpass filter profile

We edited this MATLAB script to compute the filter coefficients and generate appropriate Verilog code for all the 2 set of 32 filters. Since the audio bus master required stereo input, we had to include a left and right filter for each frequency channel, otherwise the audio bus master would not properly output our synthesized voice, resulting in silence.

The IIR filter Verilog code found on the DSP page includes a finite state machine. We added addition states to this FSM to perform some preprocessing before the pitch shift as well as the pitch shift. Figure 2 shows an overview of all the stages of the Vocoder.

Figure 2: Vocoder Overview

The first stage after the bandpass filters, which separates the input voice from the microphone into 32 different frequency channels, is to take their output’s absolute value. Then to low pass filter the absolute valued signal. Then the low pass filtered signal is then multiplied with a sine wave that has a frequency equal to the center frequency of the IIR filter times a scalar, this performs pitch shifting. This scalar is the same for all 64 low pass filters. By using the same scalar, we can preserve the content of the voice, which makes it understandable to human ears. Then the modulated outputs from all the left channel filters are summed together and all the modulated outputs from the right channel filters are summed together.


Optimization and Tuning

When implementing the 64-filters, we needed a multiply for implementing the IIR bandpass filter and another for the modulation with the pitch shifted sine wave; this brought to our attention the hardware limitations of the board. Specifically, we were concerned about the limitations of DSP blocks available on the board to perform hardware multiplies of our signals. To be able to use a feasible number of DSP blocks, we multiplexed a single multiply since the two multiplies were happening in different states of the finite state machine.

We implemented low pass filtering with the relation:

y_{out}[n] = \alpha F_{out} + (1-\alpha)y_{out}[n-1]

where F_{out} is the absolute value of the microphone input, \alphais the filter time constant, and the y_{out}[n]is the low pass filtered signal and the update for this value is a function of the previous time step. To conserve hardware multiply operations, we massaged this equation to implement it with an arithmetic right shift by the base 2 logarithm of \alpha (which we denote as \alpha' ) instead:

y_{out}[n] = \alpha F_{out} + (1-\alpha)y_{out}[n-1] = \alpha(F_{out} - y_{out}[n-1]) + y_{out}[n-1]

=((F_{out} - y_{out}[n-1]>>>\alpha')+y_{out}[n-1]

Without low pass filtering, humans can hear a background noise from the lower center frequencies. We spent some time experimenting with different values of \alpha to determine which values yielded best psychoacoustic results. To select the values of \alpha, we used the board’s switches. A left shift by 9 gave us the best results; however, we still noticed that there was a low frequency motorboat sound in the background. To dampen the noise, we increased the time constants for the lower 6 filters by an increment of 1. In other words, all the filters have a time constant of \alpha and starting from filter 6 and down we added 1,2,3,4,5, and 6 respectively to the time constant \alpha.

We used direct digital synthesis (DDS) with a sine ROM to compute the various sine waves for mixing. The frequency was set based an increment value through the ROM.


User Control

Switches 6 through 9 on the board can control the time constant, \alpha, for the low pass filters. Switch 0, is used to turn on the vocoder (route signal through filter bank). When this switch is low, the input from the microphones pass directly to the speakers. When this switch is high, the input from the microphone passes through the vocoder. Switches 1 through 3 set the pitch shift frequencies by setting the pitch shift constant. In hardware, pitch shifting is controlled by mixing the input voice with a sine wave at a frequency set to the center frequency of the bandpass filter scaled by the pitch shift constant. This constant is the same for the two channels of the 32-filter banks.


Vocoder Testing

Since the structure of each of the 64 filters are identical, we troubleshooted by outputting the data at each stage of the process to the speakers from one filter. We used the switches to rotate among each stage process. We also hooked up an oscilloscope to the audio output (lineout) of the FPGA to see the waveform.


QSys Bus Configuration

Though our Qsys bus design initially was configured to only have an audio bus-master, we modified our design to add a VGA bus-master in order to write to a 640 x 480 pixel, 8-bit color monitor in real time. Our design is as follows:

Figure 3: QSys bus layout


GPU Design and Implementation

To write to the monitor through the VGA in parallel to the real-time voice modulation, we configured the GPU state machine to run on the FPGA’s audio clock, rather than the 50MHz clock. This allows us to sample audio at a rate of 24 kHz. To optimize writing speed, we programmed the FPGA to write directly to the monitor via the VGA bus-master based on the bus design detailed above. On each state machine cycle, the FPGA reads the audio input and writes its respective waveform sample to the top-half of the pixel monitor. The height of each pixel of the drawn wave is proportional to the input audio’s magnitude. Afterwards, the FPGA writes the modulated audio’s 32-frequency bin spectrogram onto the bottom-half of the screen. Each waveform and spectrogram update on the monitor occurs in real-time, and clears or overwrittes previously rendered points.

The GPU's state machine shown in the Figure 4, with states defined, illustrates the above:

Figure 4: GPU State Machine
  • KEY[0] reset: Initialize registers and set drawing to the left side of the screen
  • S0: Read audio input
  • S1: Calculate y coord based on audio data amplitude
  • S2: If the y coordinate represents a negative amplitude, shift this y coordinate to the top quarter of the screen
  • S3: Write waveform pixel onto the top-half of monitor based on calculated y coordinate with the color white
  • S4: Wait 1 cycle
  • S5: Set the pixel point drawn 100 cycles before the current point to black. Skip this if we have not drawn 100 points yet.
  • S6: Read first filter's low-pass-filtered (LPF) power
  • S10: Map power to a color for writing based on log scaling
  • S7: Write spectrogram bin onto the bottom-half of monitor for a specific filter for in a given column for 7 subsequent pixels row-wise based on color mapping. Set to read the next filter's LPF power, if we haven't drawn all 32 filter LPF powers. If we have drawn all 32, go to S9
  • S8: Read a filter's LPF power based on index
  • S9: Set to write in the next pixel column and set SM to read the first filter's LPF power on next SM cycle

Note that frequency increases from the middle to the bottom of the screen. Time increases from left to right of the screen


GPU Testing

To test both the waveform and spectrogram, we fed different audio samples into the FPGA's audio input, including our own voices and audio playback. We asked our ourselves the following:

  • Did input sounds result in higher waveform amplitudes than silence?
  • Did the VGA spectrogram change relative to different pitch frequencies?
  • Did the duration of changes in audio input and the VGA plots match?
  • Did the power reading on the VGA spectrogram at a given time match the oscilloscope's frequency-domain reading?

This helped with debugging the GPU implementation.


Results

Overall, project's results successfully met our expectations for a real-time hardware vocoder. Our full set up is shown in Figure 5. It includes the FPGA, the monitor display, speakers, and the oscilliscope.

Figure 5: Overall setup with and without vocoding enabled

Video Demo

This video demonstrates our vocoder. We show voice modulation, pitch-shifting, and GPU graph writing.


Discussion

We modulated input audio with 32 IIR filters and altered the input's pitch using the FPGA switches in real-time. With 32 filters, our modulated voices sounded metallic. By shifting the input's pitch, we increased our voices' frequency from our original tones. Combining both result in speaking like a shrill-voiced robot. In addition, the FPGA successfully clears pixels in the audio waveform written 100 cycles before the current point.

On the GPU side, we note that it took the FPGA about 10-12 seconds to write across the screen at our specified audio sampling rate of 24kHz. The following figures show our monitor at various states.


Figure 6: Monitor with no input audio
Figure 7: Monitor with one voice sample in 10 seconds
Figure 8: Monitor with many voice samples in 10 seconds
Figure 9: Monitor with continuous input audio at various pitches

Based on the VGA color mapping design, blue represented low low pass filter energy and red represented high energy. Increases in energy were represented by colors changing from blue to red.

According to Figure 7 and 8, we saw that the length of a spectrogram change matched that for brief audio inputs, more specifically for those of less than 50ms. In Figure 9, we also noted that the FPGA was good at detecting changes in pitch for continuous audio, as the spectrogram curve matched that of a speaker continuously whistling at different frequencies. For each of these, the audio waveform was very sparse due to the magnitude of the audio input. Notice that for Figure 6, the audio waveform looked less sparse due to a lack of audio input.

Usability

Our design is very usable. A user only needs to use speak into the mic and set the switches to enable/disable modulation and change the output pitch.


Conclusions

A hardware implementation of a vocoder on an FPGA is a very natural task to implement, as the parallel nature of FPGA design lends itself to real time digital signal processing techniques. In addition to the real time benefits, this project has some exposure to audio bus master and GPU design through the audio interfaces and VGA functions we implemented. Additionally, we needed to consider the resource constraints of the FPGA, particularly with DSP blocks for the hardware multiplies we needed to implement our system. This requires sharing resources between different channels as well as careful consideration when designing our hardware’s mathematical operations.

This project also catered to our love of digital signal processing and the tonal qualities of sound, resulting in a very enjoyable design experience.

Future Work

With more time on this project, we can additional settings to our under interface, including pitch-shifting to lower frequencies and setting different power alocations per channel.

Standards

No standard was involved in this project.

Intellectual Property Considerations

The development environment, Quartus II was developed by Intel/Altera using compilers only available from them. The QSys bus design tool is also Intel’s intellectual property. Some of our functions were modified from existing functions found in the ECE 5760 homepage, including the audio bus master, VGA GPU, and IIR filter verilog module.

Legal Considerations

This project is not subject to any legal considerations.

Acknowledgements

We would like to thank Professor Bruce Land and our TAs, Josh Diaz, Ryan Hornung, and Adam Weld for their continuous help during the development of this project. Without their support, this project would not have been possible.

Appendix

Appendix A: Permissions

The group approves this report for inclusion on the course website.

The group approves the video for inclusion on the course Youtube channel.

Appendix B: Program Listings

An up-to-date version of all of the code written for this project can be found in the following Github repository: https://github.com/jc2697/ece5760_final_project

Additionally, you may download a local copy here: code.zip (5/13/2019). This is the version of the code used during the project demonstration.

Appendix C: Work Distribution

Joao Pedro: Hardware design and implementation: Filter design, audio modulation, and sound mixing. Report writing: Introduction, High Level Design, Hardware Design, Conclusions.

Justin: Hardware design and implementation: VGA audio waveform/spectrogram, QSys bus design. Report writing: High Level Design, Hardware Design, Results.

Thinesiya: Hardware design and implementation: Filter design, audio modulation, and sound mixing. Report writing: Hardware Design.

References