RF Signal Modulation Predictor

By Parker Miller (plm93), Yunyun Zhang (yz2625), and Peter Oh (jo299)

Introduction

We created a radio modulation classifier that predicts the modulation scheme of received wireless signals with a Convolutional Neural Network implemented on the DE1-SoC.

This utilized a Software-Defined Radio (an RTL-SDR) –– attached to the ARM processor via USB –– in order to obtain local radio signals. The radio signals are then sent over to the FPGA for classification by a CNN (AM-SSB, WBFM, GFSK). In addition, a spectrogram of the Fast Walsh-Hadamard Transform of the signal overtime was plotted on a VGA screen to visualize the received signals.


This system can be seen in the below photo. The silver USB dongle is the RTL-SDR which is connected to an adjustable dipole antenna. The board that it is plugged into is the DE1-SoC board. The ADALM-PLUTO on the right is another SDR which was used for transmitting wireless test signals. Not pictured is a VGA display which is also connected.

High-Level Design


Source of Idea

This idea for this project was inspired by the class competition of another course (ECE 4200) where the goal was to design a machine learning algorithm which could identify the modulation of various RF signals when provided their complex time domain values (also known as quadrature). Our group was also interested in neural networks so we decided to implement one of the more successful algorithms (a convolutional neural network) from that competition in hardware on an FPGA. As just feeding in test data from a dataset wasn’t a true application, we decided that this system should receive live radio data from a software defined radio. While this was relatively complex, we also wanted a means of visualizing the signals we received in order to provide a means of trying to tune the radio to find a signal with no prior knowledge. Originally we wanted to use an FFT spectrogram but realized that it would require too many multiplies to do in real time so we decided on the Fast Walsh-Hadamard Transform instead to get an alternative frequency domain representation that only relied on addition and subtraction.


Background Math

Quadrature (I/Q) Signals

In this report we refer to complex, quadrature, or IQ samples of a signal. This is a way that a signal can be decomposed into 2 values which can be used to find the amplitude and phase of a signal. I refers to the in-phase component and Q refers to the quadrature component (90 degrees out of phase). This convention allows the signal to be represented as the complex value I+jQ. As this is a complex number, now we know how to find the amplitude and phase of the signal. Though in this project we won't be using the amplitude and phase, but rather feeding I and Q data into the CNN and letting it determine ideal features for distinguishing the differently modulated signals. This is actually reasonable as when IQ data is plotted on the complex plane, it appears as an image which varies depending on the modulation used. Convolutional Neural Nets have been shown to perform well on image data, so we should expect reasonable results here too.

Fast Walsh-Hadamard Transform

The Walsh-Hadamard Transform is a generalized Fourier transform which decomposes a signal into a set of orthogonal signals. This is similar to a Fourier transform which uses sinusoids, though the Walsh-Hadamard Transform uses square/rectangular signals as it’s basis. Similar to how the fourier transform has an FFT for signals that are powers of 2 long, the Walsh-Hadamard Transform has the FWHT which does a similar decomposition to speed up computation on signals that are powers of 2 long. The actual transform is easily understood with the following image (from https://en.wikipedia.org/wiki/Fast_Walsh-Hadamard_transform):

A way to perform this recursive algorithm can be demonstrated with the following Python example code (from https://en.wikipedia.org/wiki/Fast_Walsh-Hadamard_transform):

 
            def fwht(a) -> None:
            """In-place Fast Walsh–Hadamard Transform of array a."""
            h = 1
            while h < len(a):
                for i in range(0, len(a), h * 2):
                    for j in range(i, i + h):
                        x = a[j]
                        y = a[j + h]
                        a[j] = x + y
                        a[j + h] = x - y
                h *= 2
        

Using this program and some formatted print statements, I was able to generate the verilog additions and subtractions required to apply the transform to a 128 sample long signal. One thing excluded in this example code is a normalization factor of 1/sqrt(2) for each calculation. As this is applied to each calculation it can be factored out and even removed if desired as it scales the entire output equally (assuming you only care about the relative values of the transform outputs).

Weights Conversion

Once we had a model in TensorFlow that was simple and small enough for the FPGA, we stored all of the weights/parameters into a local .h5 file. Once this file was exported, we imported it into a jupyter notebook file and utilized python to format these weights into signed 18 bit (6.12 fixed point).

The .h5 file had 2334 total weights, all as floating point values.

We used a function float2fix(val, width, precision) that we found here: https://stackoverflow.com/questions/41590009/how-to-convert-a-float-point-number-to-a-fixed-point-number-with-a-certain-width

This function allowed us to format our floating point values into 6.12 fixed point.

Convolutional Neural Network (CNN)

Convolution Layer Calculation: The convolution model we used in this project is the 2D convolution model. First, we added the [1,2] zero padding before and after the 128 samples to avoid information lost. The filter size in the convolution is [2,1], which means each neuron in the convolution layer will take two samples dot-product with the weight vector and add the bias term. For nth neuron in the convolution layer, [neuron_i, neuron_q]n = [w_i*nth sample_i, w_q*nth sample_q] + [bias_n, bias_n].

ReLu Function: ReLu stands for Rectified Linear Unit, the function is f(x) = max{0, x}.

Flatten Layer Structure: The flatten layer mainly flat the 3D matrix from the convolution output to 1D vector for the dense layer calculation. This layer first flats the kernel dimension, then with I, Q channels in this project, lastly flats with the signal samples.

Dense Layer Calculation:The input of the dense layer is a 1D flatten vector, and the output of the dense layer will be the 3 neurons, which represents the classification results for the 3 classes in this project. The mathematical representation of dense layer is Y = ReLu((weight input+bias)) for each class.


Logical Structure

Our system has a few different parts which all need to communicate with each other to make sure processing completes fast enough such that a signal is received, processed, and displayed before the next 128 samples are ready.

This can be broken down into a few different components:

  • RTL-SDR USB radio receiver
  • HPS Linux System
    • Our C program
    • rtl_tcp program
  • FPGA
    • Main State Machine for Graphics and Data Reads
    • FWHT Module
    • CNN Module with Convolution and Dense Layers

The RTL-SDR is responsible for tuning to a specific RF radio frequency and sampling downconverted time domain IQ data.

The rtl_tcp program is responsible for controlling the RTL-SDR and sending those samples to a TCP client.

The C program we wrote on the HPS is responsible for retrieving the samples and writing them to the FPGA SRAM. It also is responsible for collecting user input for gain, sample rate and frequency settings for the SDR and sending them to rtl_tcp over TCP. Finally, it is responsible for resetting the FPGA as well as writing text to the VGA display.

The top level FPGA state machine is responsible for reading out the samples written to SRAM to a register bank and feeding them into the FWHT and CNN modules. Once the FWHT is complete, this state machine draws the output on the VGA display. After that, when the CNN is complete, it outputs the predictions to the HPS over a PIO port. Finally the state machine resets to start the process over for the next 128 samples.


Hardware/Software Trade-off

We utilized software to accomplish tasks that were not possible to accomplish on the FPGA as well as ones simplified design without slowing down real time processing. Tasks that were parallelizable or needed to be done in real time were delegated to the FPGA hardware.

Specifically the parts of our system written in software were the TCP client to retrieve data from the RTL-SDR and forward to the FPGA, the terminal user interface for changing SDR settings, and the VGA text code for displaying what the CNN is currently predicting.

The system elements written in hardware were the VGA display system (QSYS IP), VGA display of the FWHT, the FWHT itself, and the CNN. The FWHT and CNN had parallelized math operations while the VGA display had to operate quickly to maintain near real-time display updates.

Program and Hardware Design


Software for User Interaction and Data Forwarding

The C program running on the HPS (ARM Linux system) performs a few different tasks. First our program performs some memory mapping to pointers so we can access PIO ports on the FPGA, SRAM on the FPGA, and memory for the VGA subsystem. After that it tries to stop any running jobs of the rtl_tcp program before starting a new rtl_tcp program in the background. The rtl_tcp program is required to control the RTL-SDR’s settings and collects data samples and stores them in linked list buffers while sending them to a local TCP port and waiting for a TCP client to receive them. The program then connects to the tcp socket that rtl_tcp opened. At this point, the FPGA is reset through a PIO port and the VGA display is cleared by the HPS and some text is drawn to the screen. The program then requests information from the user about SDR settings before sending them over TCP to the SDR. Now a separate user input function is run as a non blocking thread so the user can input new settings at any time. Then, the main function for collecting data from the SDR is run.

This data collection and forwarding loop is the most important for the operation of the program. At the start of the loop, it retrieves a block of data from the SDR over tcp. It then iterates through this buffer 256 times (128 I and 128 Q samples). Since the data comes in as interleaved I and Q, the code toggles an IQ variable so it can write to the appropriate location in SRAM (1 offset for I, 129 offset for Q). In addition, the values are offset by -128 in order to convert from unsigned 8 bit to signed 8 bit values centered at 0. Once 256 samples have been written, a 1 is written to the 0 position in SRAM to flag to the FPGA that the data is valid and ready. It then waits until the FPGA writes that value as a 0 which indicates that it is ready for the next samples. At this point the process repeats infinitely until the system is stopped. If there is ever a point that this process doesn’t occur fast enough to collect the samples from the SDR, the rtl_tcp program maintains a buffer to hold those values. In addition if some samples were lost, it would be ok since only 128 samples are analyzed at a time and if I and Q are swapped the CNN should still predict properly (since that's just a phase shift).


Fast Walsh-Hadamard Transform

The FWHT was implemented such that it was performed in place on 8 bit signals that were 128 samples long. The signals in question were the in-phase (I) component of the live signals coming in from the SDR. This was chosen instead of the signal magnitude as it didn’t require extra computation (i.e. finding the magnitude sqrt(I^2+Q^2) is complex on an FPGA) and looked better for discerning modulation types. As this transform had to be completed in real-time, the simplest way to accomplish it was to hardcode all of the additions and subtractions in a state machine. A 128 sample FWHT takes 128*log_2(128) additions and subtractions, so there were 7 computations states which each had 64 additions and 64 subtractions. An important thing to note is after each of the operations, the output is shifted right by 1 (a divide by 2) to prevent overflow. This is acceptable as the FWHT typically includes a normalization factor which this is proportional to. When the latest samples were loaded to the FPGA registers, the FWHT module was flagged to begin operation, this loaded a copy of the signal into a new register bank which had all of these computations performed on it in place. After iterating through the state machine and completing computation, the transform was output and indicated as valid to display on the VGA display.


VGA Signal Visualization and Main State Machine

The state machine found in our top level code is responsible for a few different operations including reading data from SRAM (written by HPS), plotting FWHT to the VGA display, and waiting until the CNN is complete to send the prediction to the HPS.

The first few states are dedicated to reseting the modules and reading from SRAM until a 1 is read in the first position in SRAM. This indicates that all samples have been written by the HPS to SRAM. After this, the state machine cycles between 2 states (1 read and 1 wait) until all of the IQ samples have been read from SRAM and written to a register bank. When this is complete, the FWHT is flagged to start calculating. Once that is complete, the next state begins for VGA drawing.

Taking the FWHT output and plotting it on a VGA display is relatively straightforward based on our previous labs. Here we take advantage of a VGA subsystem (QSYS IP) which was provided in code modified by Bruce Land and based on Intel/Altera University Graphics example code. This allows us to write color data to pixel location in an SRAM buffer which is taken by the subsystem and sent to VGA display.

Once the FWHT module indicates it is complete, we start iterating a Y position register so that the corresponding value of the FWHT is written to the appropriate location in VGA SRAM. To make this easier to see, we actually increment to a maximum Y of 255 and divide the FWHT index by 2 (shift right by 1) so it draws each value to the screen for 2 consecutive pixels instead of one. This is continued until the whole FWHT is drawn in a column. At this point the Y position is reset, the HPS is signalled to write new samples to SRAM (through a 0 written to the 0 position in SRAM), and the X position is incremented so a new column is drawn when the FWHT of the next 128 sample signal is complete. After the screen is completely filled, both X and Y reset so drawing restarts on the left side of the screen and overwrites the old data. This is similar to a spectrogram where a FFT of a signal is plotted over time.

The final state is entered after VGA drawing of a column is complete and waits for the CNN to complete before moving back to the first state to read from SRAM to see if the new samples have been loaded


Exploring and Training the Neural Network on Keras

This is a summary of the model that we generated using TensorFlow; the last two steps (activation and reshape) are not done on the FPGA to reserve resources and simply the overall flow. The last activation layer is a “softmax” function, which essentially makes sure that the final output for all of the different classes (3 in our case) gives a value that sums up to 1. We don’t need this softmax function on the FPGA because we can easily just check to see which of the three prediction values is greatest and pick that certain class as the final prediction.

Here is another visualization to understand what's really going on with our CNN:



Our initial CNN model did not look like this initially. Many example models online had 5+ layers, but this was definitely not feasible with the amount of resources we had on the FPGA. So, after playing around with the model, we were able to shrink the layers down to 2 layers (Convolution + Dense).

Another modification we had to make while messing around with the model is shrinking the number of kernels in the Convolution layer from 10 kernels to 3 kernels. This was motivated by the fact that we were running out of logic blocks on the FPGA to implement more than 5 kernels for the Convolution layer.


Weight Generation From Tensorflow

Once the model was finalized and weights exported as a .h5 file, generating the weights for the ROM table in the proper format was not so difficult. We utilized a float-to-fixed function provided by resources online to make sure the weights had the right sign and right value.

It’s important to note that the reason why we have to export the weights offline is that the training process of the CNN (forward propagation + backward propagation) requires a lot of memory and computation on the FPGA. So, in order to save time, memory, and computation, we trained the model offline on a CPU and then fed in the weights to the FPGA to do a simple forward propagation + classification.


Implementing the trained Neural Network on the FPGA

  • Convolution Layer
  • The convolution layer includes convolution neuron calculation, series calculation for samples, and the top convolution module for the kernel parallelization and data structure adjustment.

  • Dense Layer
  • The dense layer calculation is mentioned in the last section. First, we did the dot product of the output sample data from the convolution layer with the corresponding weights, then added the bias term for the class neuron in the dense layer. The three classes and I/Q channel data were calculated in parallel.

  • ReLu Activation Function
  • Since the ReLu activation function is max{0, x}, we realized that this just filters out the negative values and replaces them with zeros. We checked whether the sign bit is 1, if it’s 1, we replaced the output value with zero.

  • Overall Structure
  • The overall structure is shown below (nero = neuron):

  • Difficulties in CNN implementation
  • The most tricky part is to manipulate the high dimensional calculation and data arrays. At the beginning, we used unpacked array representation for the high dimensional data except the bit indicator. (18 bits for the data point itself) However, when we used more than 2D unpacked array data, the verilog/systemverilog didn’t support the proper array slice. Also, the Verilog didn’t support more than one unpacked array to pass through the port of the module. Then, we moved the unpacked array to the packed array form, which moves the dimension declaration before the data variable, to make the array slice work properly. Another tricky part is that we extended the output results data from the dense layer from 18 bits 6.12 fixed point representation to the 32 bits data. Since the dense layer takes the summation of 3*129*2 calculated neurons to provide the final result, the output data has the overflow issue then. To extend the signed 18 bits to 32 bits, we needed to manually do the sign extension work.

Results


Testing the FWHT and VGA Display

The FWHT transform module was first fully tested through simulation in ModelSim with test signals and compared to the same FWHT performed in Python. Once this was integrated into the VGA state machine for display, the same test signal was used to verify that it plotted values to the screen properly and SignalTap was used to confirm that the output matched what was previously found with ModelSim/Python.


Testing the HPS C program and interaction with the FPGA

The software running on the HPS was tested incrementally so each part separately and tested. First a PIO reset was set up as usual to be able to reset the FPGA which was straightforward. Next, TCP code was added to open the socket with rtl_tcp and retrieve samples from the SDR. These were printed out to verify operation before attempting to write to FPGA SRAM. At this point, the writes to SRAM were added such that the data could be streamed to the FPGA and this was confirmed through SignalTap to see that the correct values were received in the right locations and read out correctly into FPGA registers. Finally, code for user input and writing to the SDR over TCP to change settings was added and debugged. This was rather difficult to debug as there were some strange nuances in what data types were required for TCP writes and even when it worked, the incorrect data was being received by the SDR (seen through terminal error output). After researching, this was determined to be due to the C and network code using opposite byte ordering which required using htonl() to flip the bytes being sent.


Testing the Keras CNN Model

This confusion matrix shows how well our final model does on predicting a test split. Although this graph varies slightly depending on how the data was split each time, for the most part the matrix seems to classify each label in the proper way. It’s important to note that this matrix does not represent how well our FPGA system predicts; the input data for the FPGA system has different SNRs (Signal-to-Noise Ratio) than the training/testing data set, thus it’s hard to say whether this matrix is close to what the real prediction error was.


This graph is showing the change in train loss & error as the model goes through each epoch of training (up to 100). This graph is quite helpful in understanding how many epochs are necessary to properly train the model without having the issue of overfitting. You can see from the graph that after 40 epochs, the train loss & error begins to decrease significantly. Thus, for our final model we had the training process run between 50 to 100 epochs to get the best final weights possible.


Testing the CNN Model in ModelSim

We used the test sample IQ data from the Deepsig Dataset to feed as the input for the CNN model. Then, the CNN verilog model used the trained parameter and model from the Keras to make the prediction on the sample test data. For a while, we couldn’t get the prediction to match what was expected and realized that the ordering of the weights from Keras were incorrect. At last, we compared the prediction result with the Keras result to make sure that the CNN model on FPGA worked as we expected. Also, in order to make sure the prediction matched with the model we trained in Keras, we tested with class 1, class 2 and class 3 dataset respectively. The difference between the FPGA and the Keras prediction factor was less than 0.02. The convolution layer took around 129 clock cycles, and the dense layer took around 387 clock cycles. The total CNN classification model made the prediction every 520 clock cycles.


Final System Integration

After the CNN module was fully debugged, integrating it into the system with the FWHT and VGA display was relatively straightforward aside from needing to make sure reset pulses were timed correctly and the main state machine waited for the CNN computation to complete

At this point we tested the system by broadcasting the various modulation schemes we wanted to classify and observed the VGA display to see the changes in the FWHT spectrogram and the predicted modulation. Specifically, we used an ADALM-PLUTO SDR controlled by a MATLAB example program "helperModClassSDRTest.m" (https://www.mathworks.com/help/deeplearning/ug/modulation-classification-with-deep-learning.html) to generate and broadcast these signals.

Below is a screenshot of our C program starting up for this test. The lines from until "client accepted!" are output from rtl_tcp while the lines "Enter ..." and "To change/stop" are from our program. In this example, we enter a center frequency of 903MHz, sample rate of 1MS/s, and gain of 50dB. We then see from rtl_tcp that they were set correctly. The lines "ll+, now #" are outputs from rtl_tcp indicating that the buffer was initially filling but this stopped after the entering the desired SDR settings. This combined output is a little strange to look at but was the simplest way to see if the receiver accepted the settings (as not all values are valid).



The following images are the various displays seen depending on which signal is being received during this test:

No Signal: White Noise
AM-SSB Correctly Predicted
FM Correctly Predicted
GFSK Correctly Predicted
FM incorrectly predicted (prediction changing often)

While typical FFT spectrograms of these modulation schemes are well known, we had to broadcast a strong signal locally to determine which FWHT spectrograms corresponded to which modulation scheme.

When broadcasting a strong signal nearby the RTL-SDR receiver with another SDR (ADALM-PLUTO), we get reasonably reliable prediction of AM-SSB and GFSK. Prediction of Wideband FM is a bit more unpredictable and requires a careful placement of the transmitter and receier to get accurate predictions. A bad/low confidence prediction can be observed in the last photo when the prediction text is changing rapidly.

Conclusion


Results vs Expectations

While we initially had hoped to try to predict more modulation schemes and capture various broadcast RF signals, we managed to successfully implement almost all of what we initially intended on. In the completed project, we were able to differentiate and predict between AM-SSB, GFSK, and WB-FM modulation schemes based on locally broadcast test signals from a nearby SDR. In retrospect, we realized that getting high accuracy for predicting a dozen modulation schemes wouldn’t be possible with this hardware (in real-time at least) as most researchers are using much more complicated neural networks to achieve those kinds of results.

At first we wanted to try to differentiate between 5-6 schemes but after realizing that this would require storing tens of thousands of parameters for the neural network, we downsize the system so we could store all of the parameters on the FPGA without taking a latency hit from reading parameters from a form of RAM. Also, we likely took a prediction accuracy reduction by reducing the number of convolution kernels for the same reason (too many parameters). In addition, while we originally wanted to predict the modulation of local FM radio and air traffic control AM signals, we found that our location wasn’t ideal for this and the signals were too noisy. This could be improved with a higher end SDR or better hardware filtering.

Overall this was a success, but with more time and these results we would implement this system on a larger, more powerful FPGA which stored parameters in low latency RAM and collected data from a low noise SDR with tunable bandpass filtering. Also, if you want to do this yourself, don’t write the neural net in Verilog from scratch, use a high level synthesis tool to generate the hardware. It’s way more efficient and straightforward despite the long compile times.


Lessons Learned

Here are some other lessons we learned over the course of the project (some of which we hadn't taken seriously enough despite hearing before):

  • Simulate, simulate, simulate!
  • There's always "just one more test" - Peter Oh
  • Test modules at the lowest level possible before integrating into large blocks of code. Incremental and unit testing is key!
  • Verilog multidimensional arrays are scary (SystemVerilog supports them but they can be troublesome)
  • Define a proper file naming convention at the start of a project
  • If relying on other technology/black box software, make sure it can do everything you need before getting to far into a project
  • Keep track of FPGA resources early on during design or you will need to rewrite your hardware later
  • Don't implement a neural net in hardware from scratch, use high level synthesis

IP Considerations

This project was compiled using Quartus II tools (Intel/Altera) including the QSys bus tool and Intel/Altera QSys IP modules (SRAM, AV_Config, Pixel_DMA, VGA_Subsystem, Clock Bridge, and PIO). Our main Quartus project was based off of example code provided by Intel/Altera’s University Program and modified by Bruce Land. The original can be found in the below reference titled "Fast VGA Graphics".

We also used the Deepsig Inc. dataset RadioML 2016.10A and example jupyter notebook as a basis for our CNN python programs. There are provided under the CC BY-NC-SA 4.0 license and can be found in the references below.

Appendices


Appendix A: Permissions

  • The group approves this report for inclusion on the course website.
  • The group approves the video for inclusion on the course youtube channel.


Appendix B: Schematics of External Hardware and Test Setup


Appendix C: Work Distribution

  • Parker
    • Developed C code on HPS to interface with RTL-SDR and communicate with the FPGA
    • Implemented FWHT on FPGA
    • Designed state machine on FPGA to read streaming IQ data from the SDR into registers, draw FWHT output to VGA, and communicate CNN results to HPS for drawing to VGA
    • Setup data transmission of modulation signals with MATLAB and PlutoSDR
  • Yunyun
    • Helped with interpret the weights/layer structure and inside calculation of Keras CNN model
    • Worked with Peter on implementation of trained CNN and the testbench for the FPGA
    • Debugged the FPGA/Systemverilog testing for the CNN prediction model
  • Peter
    • Worked on creating an CPU-based CNN via TensorFlow + Keras
    • Helped with training, modifying, and testing the offline model
    • Wrote python code to format offline weights into proper ROM table entries for the FPGA
    • Worked with Yunyun on implementing convolution and dense layer in verilog
    • Helped debugging any FPGA/Verilog code


Appendix D: References


Appendix E: Commented Code


Appendix F: Acknowledgements

“Our team would like to thank Hunter Adams and Bruce Land for all of their help and support with this project. We’d also like to mention Katie Bradford for all her help as the TA for this class. Despite the fact that learning had to be remote, the class was nonetheless super fun, challenging, and exciting. Thank you for an awesome semester!”