FPGA Based Simple CNN MNIST Digit Classifier

Project Introduction

We wanted to explore DE1-SOC's costume parallel computing power in inferencing a neural network-based digit classifier.

Nerve impulse propagation in the imitated neural network is a highly paralleled data flow, therefore not suited for Von Neumann programming languages on Von-Neumann-based computer architectures. From lab 2 and lab 3, we have observed the power of FPGA in developing hardware-based customized parallel computing so we wanted to see if we can exploit the same strength in neural network inferencing. We decided to implement the entire inference structure in DE1-SOC FPGA and use HPS in DE1-SOC's Linux environment to DMA the input data in FPGA's memory through Qsys embedded system. We used Pytorch and trained the simplest CNN that is capable of classifying the MNIST dataset. The parameters are saved after training to be used during inference.

High level design

Rationale and Sources

Machine learning is one of the fast-growing research areas today. However, for some ML applications, the hardware becomes the major bottleneck for its performance and efficiency. For the final project, we would like to explore hardware acceleration for the inference of a trained Convolutional Neural Network (CNN) in particular as it “has become dominant in various vision tasks” [1].

A simple CNN can be used for basic image recognition or voice recognition using the spectrogram of the voice data. Even though CNN can also be used to extract features from raw data, this function is hard to demonstrate by itself. In this sense, we feel that digit recognition fits the scope of the final project better.

Mathematical Background

Pictures consist of a huge set of matrices of numbers. We use a greyscale picture as the system input and each number ranged from 0 to 255. The higher number is, the whiter color will be. As stated in [2], this range is a trade-off between the efficacy of storing information about the image (256 values fit perfectly in 1 byte) and the sensitivity of the human eye (we distinguish a limited number of shades of the same color).

To reduce the volume of the data set, convolution is usually used in many computer vision algorithms. Convolution is a process in which we pass over the input picture and generate new values based on some kernel or filters. The mathematical expression is listed below [2]:

This figure demonstrated one step of the convolution [2]. By changing the value in the kernel, different features of the input figure will be extracted.

Besides the convolutional layer, CNN often uses pooling layers to further reduce data size and accelerate the calculation speed. This layer divides the matrix into different regions and selects single number out by some rules. Max pooling is used in our design, where the maximum value will be selected and sent to the output region.

Logical Structure

Overall Network

The overall structure of the network is defined as the above image. The image will go through two levels of the layer with each one consisting of one Convolution and one MaxPool Relu. The result of the whole network will then go through the MLP layer for generating an interpretation of the network. For a given 28x28 image, the first convolution layer will map it to 2 channels of a 24x24 image using two 5x5 kernels. It will then go through a max-pooling layer and a ReLU activation layer and becomes 2 channels of a 12x12 image. Next, the second convolution layer will map it to 3 channels of 10x10 images using 3 2x3x3 kernels. It will then go through a max-pooling layer and a ReLU activation layer again and becomes 3 channels of 5x5 image. It will then be flattened into a 1x75 array and processed by the fully connected layer. The result will be a 1x10 array and the index that points to the largest number in the 10x1 array will be the output of the system.

Hardware/Software Tradeoffs

Hardware tradeoffs we made include robustness over speed. Theoretically, the max pool layer and ReLU layer can start before the convolution layer before it finishes since it only needs part of the final result of the convolution layer to compute their outcome. However, doing so will make it very hard to design as well as debug. For this reason, we've decided to add a memory unit in between every layer's calculation and only allow the next layer to start when the previous layer completes all its computations. Doing so definitely slowed our design down, but we think this tradeoff is reasonable in the scope of our design. Optimization for speed can be a future project itself once the correctness of our modules is proven.

Another hardware tradeoff we made is robustness and development speed oversize. As mentioned before, the address generation for the sliding windows in our design requires three types of counters to achieve. However, it is possible to generate all the addresses needed to achieve the sliding windows by writing complicated if statements. By introducing extra counters into our design, we made it larger. However, this also means we only need to make minor changes to modules to achieve different styles of the sliding window instead of re-writing the entire control logic for the address generator. This saved us a lot of codes. Also, different counter functionality can be checked separately. This also made our design easier to debug. To ensure the correct address generation, we simply need to make sure the start address generator is counting in the correct interval, then horizontal and vertical step tracking counters are resetting correctly, and then the actual address generator is changing corresponding to the horizontal and vertical step tracking counters. If all counters function correctly, we can ensure the overall address generation for the sliding windows is correct.

Program/Hardware Design

Program Details

Pytorch

We used Pytorch to design the simple CNN because we are most familiar with it. Other libraries can be used for this step as well. After playing around with the parameters, we came up with a simple CNN with 2 convolution layers and 1 fully connected layer. The first convolution layer has 1 channel input since MNIST datasets are grayscale images, and has 2 channels output. The size of the kernel is 5 and the stride is 1 and the padding is 0. The second convolution layer has 2 channels input and 3 channels output. Max pooling layer and ReLU layer are included after every convolution layer. We picked these numbers to ensure that no padding is needed at every layer to make FPGA's data flow slightly easier to manage. The model is trained for 10 epochs with a batch size of 600 and can achieve 97% accuracy on the MNIST validation set. We also added hooks to support logging outputs at each layer for hardware validation/debugging.

Intel Hex file

To initialize the M10k, we use intel hex file to initialize the M10k in the IP catalog. An simple example of the hex file could be as follows:
:040001000000004AB1
:00000001FF
The first line, from left to right, : stands for a start code of a record. 04 means the record length is 4 bytes, and 0001 means the address is 1. The following 00 means this is a data record. Next 0000004A means the data 0x000004A will be stored. The B1 in the end is the checksum of the data record. The second line, :00000001FF stands for the end of the file.
We use the following program to convert a python list into a hex file. The list could be float and the data m10k should be 32bits per word. The decimal bits for a fixed point notation have been parameterized by DEC_BITS.

import numpy as np

#DEC_BITS
DEC_BITS = 5

lines = [-1.0,2.3,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]

# signed float to unsigned fixed point (4 bytes per data)
k=0
for line in lines:
    lines[k] =  int(np.round(line * (2**DEC_BITS))) 
    if(lines[k] < 0):
        lines[k] = lines[k] + 4294967296 # +2**32, signed to unsigned 
    k = k + 1

# unsigned to string  (eg: 1 to '00000001')
k=0
for line in lines:
    lines[k] = str(hex(int(line)).upper()[2:]).zfill(8)
    k = k+ 1

# add addr and 0400
i = 0
for line in lines:
    lines[i] = '04'+str(hex(i).upper()[2:]).zfill(4)+'00'+line
    if(i>=0xffff):
        print("Too Many Data!")
    data = bytearray.fromhex(lines[i])
    c_sum = 0x100-sum(data) & 0x0ff
    lines[i] = lines[i] + str(hex(c_sum).upper()[2:]).zfill(2)
    i = i + 1

# calculate checksum, and add end of file
with open('m10k.hex', 'w') as f:
    for line in lines:
        line = ':'+line
        f.write(line)
        f.write('\n')
    end = ":00000001FF"
    f.write(end)
    f.write('\n')

Image conversion Python file

The program will let the image will be converted into greyscale array format and saved into .dat. To further achieve a better depiction of what the formated result is, the array will be converted back to PNG and displayed to show whether the transition is correct. The final .dat file of the image will be copied to FPGA using the SCP command invoked by the python script directly.

import torch
import torchvision
import torchvision.transforms as transforms
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np
import os
# 10k test data in total, please select data here
DATA_NUM = 3999 # from 0 to 9999


# Parameters
DEC_BITS = 9

img = Image.open('C:/Users/ys566/Documents/GitHub/ECE5760/final project/demo_pack/image_data/image_raw.png')
transform = transforms.Grayscale()
img = transform(img)
# img.show()

transform_ts = transforms.ToTensor()
tensor = transform_ts(img)
print(tensor)
listTensor = tensor.numpy()
listTensor = listTensor.squeeze()
plt.imshow(listTensor, cmap='gray')
plt.show()

lines = list(listTensor.flatten())
# print(lines)

def list_to_unsigned(lines):
    k=0
    for line in lines:
        lines[k] =  int(np.round(line * (2**DEC_BITS)))
        if(lines[k] < 0):
            lines[k] = lines[k] + 4294967296 # +2**32, signed to unsigned
        k = k + 1
    return lines

lines = list_to_unsigned(lines)
print(lines)
str1 = "".join((str(line)+",") for line in lines)
str1 = str1[:-1]
str1 = "int mnist_in[784] = {" + str1 + "};"
with open("C:/Users/ys566/Documents/GitHub/ECE5760/final project/demo_pack/image_data/mnist_in.dat", 'w') as f:
    f.write(str1)

print("success")

os.system('scp "C:/Users/ys566/Documents/GitHub/ECE5760/final project/demo_pack/image_data/mnist_in.dat"  root@10.253.17.15:/home/root/final_project/demo/')


FPGA Linux part: When the .dat file is sent via SCP, a C code will be run on the HPS side on the FPGA. The code will be in charge of two parts:

* Converting the .dat entries into the fixed point. Since the original greyscale is in float point with a range of 0--255, we will need to normalize that into 0-1 and then convert it to the fixed point format for our modules. 
* Sending the converted result to FPGA core via PIO. Those image data will be sent and saved into an M10K block which can then be read by our modules. 


Hardware Details

Processing Elements

Design of sliding window

The logic structure is consisted with the description of sliding window from above. Essentially on a higher level, we implement two FSMs. The FSM in the lowest level controls how each "sliding window" goes for one iteration in image data memory. The top level FSM controls when to load the next weight data into the sliding window memory. For each iteration, each sliding window of image data will multiple with the weight data in the sliding windows, which corresponds to how convolution is performed. The result will be saved directly within the result M10K and can be either used for next level of convolution or read as an output for verification purpose. To minimize use of inter-module wires, we adopted streaming method which will stream out each element in weight window and image data window. The counter on the multiplier FSM will record the number of streaming data it received and then proceed to output the result. The main control part of the sliding window is adapted using two parameters: column counter and row counter. These two parameters will control how many iterations the window has been processed, which will essentially control the start and end of each iteration of sliding window.

always@(posedge clk) begin
    if (reset) begin
      column_count <= 'b0;
      row_count <= 'b0;
    end
    else if (update) begin
      column_count <= (column_count < LAYER_WIDTH - PARTITION_WIDTH) ? column_count + 'b1 : 'b0;
      row_count <= done ? 'b0 : (column_count == LAYER_WIDTH - PARTITION_WIDTH) ? row_count + 'b1 : row_count;
    end
end
Design of Fixed Point Multipliers

The logical implementation of our hardware module is designed similarly as from previous labs. The multiplier use a combinational logic which will output result directly when both weight and image data are sent in. The multiplier will only output the value that corresponds to the fixed point part, which is listed as "partial" in the code snippet below.

logic signed [2* (INT_DIGIT + DECIMAL_DIGIT) - 1 : 0] partial;
assign partial =  a * b;
assign result = partial [  INT_DIGIT + 2 * DECIMAL_DIGIT - 1: DECIMAL_DIGIT] ;
Design of Convolution Layers

With the design of sliding windows and fixed point multipliers, we can implement a convolution layer. The essence of convolution, which will be introduced below, require two sliding windows for both weights and image and one multiplier. For modulation purpose, we instantiate each convolution layer with two sliding window loaders, one multiplier and one M10K for results.

Design of Max Pooling Layers

For Max-pooling, we designed our modules which will accept a stream of inputs and output the largest data. This module is also designed to incorporated into the convolution layer module which will directly output the largest result after the convolution results are streamed in. The main part of this module is an if statement which will output the largest value if the input data is larger than the previous largest value. From a hardware approach, this is the most direct and efficient way of designing MaxPool Relu layer.

always@(*) begin
    result_in = result;
    address_in = address;
    if (run)
        result_in = (data_in > result) ? data_in : result;
    else if (clear) begin
        result_in = 'b0;
        address_in = address + 'b1;
    end
end
Design of Mergers

When conducting convolution on multi-channel data, convolution needs to be done for each channel and to obtain the actual result we will need to add the result from all channels together. To ensure correct data flow inside our design, we implemented a merger after the second convolution layer before we can conduct max pool and ReLU. The merger will accumulate the result of each channel into the same M10K memory for future use. For debugging purposes, we also made the merger layer result accessible by the PIO in the top layer. The merger is designed with one input port that wait for input data. After it receives input data and run is high, the merger will output the read_address and write_address of the input data and will also output result_out. The main module inside the merger will be a FSM that load the initial address and accumulate the output address during each iteration.

(   input logic clk,
    input logic reset,
    input logic run,
    input logic signed [DATA_WIDTH-1:0] data_in,
    output logic [ADDR_WIDTH-1:0] read_address_out,

    output logic signed [DATA_WIDTH-1:0] result_out,
    output logic [ADDR_WIDTH-1:0] write_address_out,
    output logic we_out,
    output logic merge_done
    );
Multi-Layer Perceptron (MLP) and comparators

The Multi-Layer Perceptron (MLP) module is the final layer of the CNN structure. It behaves like a vector-matrix multiplier. We have a 75x1 input vector and 10x75 weight arrays and all these data are stored in M10Ks. We made full use of the beauty in hardware, parallelism, that each row of weights matrix has its own iterators.

There is an interface between data path and input M10ks. Since the input vector consists of 3 channels, they are stored in 3 different M10ks. And to use the read data from these m10ks, the read address should be from 0 to 25 for these M10Ks. Ts. To make it easier to read from these M10Ks, we made an interface module between the data path and 3 M10ks, making an illusion for the data path that there is only one memory with 75 entries (read address from 0 to 75).

The MLP module is controlled by a Finite State Machine (FSM). In the IDLE state, it waits for the run signal from the previous layer. The run signal will be set after all input M10ks are ready to be read. Then there is a data loading state, where read address generators are enabled, and continuously fetch data from M10ks to registers. Only a portion of the data is fetched and calculated, as shown in the figure below. For example, if the partition size is 5, in each loading state, 50 numbers are loaded from the array, and 5 numbers are loaded from the vector M10K. The partition size, row size, and column size are parameterized, which makes it easier to reuse our module.

After the data loading state, fetched data will be dot product with each other and the partial product will be stored in the result registers. The data loading state and calculation state will be iteratively executed until all the data is fetched and executed. After that, the multiplier will get into a done state, 10 results will be added by their bias and compared by a combination module.

Based on the comparison result, the digit recognition result will be sent to Hex module and displayed on the hex display on the board.

Storage Units

The inference of neural networks is very complicated. It is very unwise to simply connect different PEs and assume data flow will be flawless. For this reason, in our design, we added a storage unit (M10K in our case) after each PE to store the results. Every PE will be reading from the M10k memory in the layer before it and writing to the M10k memory after it. This way we can both instantiate the M10k memory with initialized value and test different PEs properly and ensure all the parts work correctly when integrated. As for small constants such as a bias for each PE, we simply used a register and directly connected it to the PEs.

One thing to notice about the M10k design in Altera FPGAs is that the M10k block used in all of their FPGAs incorporates an internal latch at the input ports. Therefore, there will be at least a cycle of delay in the reading process. Altera M10k blocks have 2 different modes of reading: synchronous read and asynchronous read. Synchronous read means the data will be ready at the q port on the next posedge of the clock if the input is latched in the internal latch at the current cycle's posedge of the clock. This means there will be 2 cycles of delay until the data is available after the address shows up on M10k's input address port. Asynchronous read means the data will be ready at the q port as soon as it is latched, with a little combinatory delay. Therefore, unless the clock cycle is too fast and the combinatory delay causes a setup timing error, the data is valid 1 cycle after the address shows up on M10k's input address port.

By default, M10k modules instantiated in Qsys are in synchronous mode. This is the case because Altera needs a way to analyze the timing of their M10ks to ensure they would function with the proposed design. This is also the case when instantiating M10k memory using Altera IP core in Quartus' IP catalog. However, in MegaWizard Plug-in Manager, the reading mode can be changed by changing the "Read output port(s)" selection under the section "Regs/Clkens/Adrs" if needed. The images below show the selection for the two modes:

The schematic on the left top corner of the wizard gives a nice idea of how the inputs and the outputs are latched or not latched.

When writing your own Verilog code of M10k modules and relying on the compiler to infer an M10k module, it becomes tricky. The following is an example Verilog code provided to infer the M10k module:

module M10K#(
  parameter ITE_NUM = 100,
  parameter DATA_WIDTH = 10,
  parameter ADDR_WIDTH = 10
)
(
  output reg [DATA_WIDTH - 1:0] q,
  input [DATA_WIDTH - 1:0] d,
  input [ADDR_WIDTH - 1:0] write_address, read_address,
  input we, clk
);
  // force M10K ram style
  // 307200 words of 8 bits
  reg [DATA_WIDTH-1:0] mem [ITE_NUM-1:0]  /* synthesis ramstyle = "no_rw_check, M10K" */;
  always @ (posedge clk) begin
    if (we) begin
      mem[write_address] <= d;
    end
    q <= mem[read_address]; // q doesn't get d in this clock cycle
  end
endmodule

Since the hardware must reflect the functionality described by Verilog, when inferring M10k, Quartus will use the corresponding mode that matches the Verilog. Since it is not standard to write a pipeline at inputs when using HDL to describe memory and memories are usually designed to be synchronous reads, Quartus will infer the M10k in asynchronous read mode to ensure the behavior of the M10k follows the Verilog code. To make M10k be inferred in synchronous read mode, an extra pipeline at the output need to be added like the following:

module M10K#(
  parameter ITE_NUM = 100,
  parameter DATA_WIDTH = 10,
  parameter ADDR_WIDTH = 10
)
(
  output reg [DATA_WIDTH - 1:0] q,
  input [DATA_WIDTH - 1:0] d,
  input [ADDR_WIDTH - 1:0] write_address, read_address,
  input we, clk
);
  // force M10K ram style
  // 307200 words of 8 bits
  reg [DATA_WIDTH-1:0] mem [ITE_NUM-1:0]  /* synthesis ramstyle = "no_rw_check, M10K" */;
  reg [DATA_WIDTH - 1:0] out_q;
  always @ (posedge clk) begin
    if (we) begin
      mem[write_address] <= d;
    end
    out_q <= mem[read_address]; // q doesn't get d in this clock cycle
    q <= out_q;
  end
endmodule

One drawback of our design is that Quartus can only generate memory units of the size of 2's exponent. This means if we want to separate all the results and parameters and put them into individual memory units, we are wasting lots of memory. For example, there are 750 weights for the fully connected layer and since 750 is larger than 512, we had to instantiate an M10k memory of size 1024 to hold all the weights.

Additionally, we chose our data type to be 27 bits wide to fully utilize the DSP blocks. However, Quartus can only generate memory units with a width of 20 or 32. This means we are wasting about 16% of our memory no matter what.

Nonetheless, even with all these wastes of memory, we still only occupied less than 3% of the memory on the DE1-SOC board with our design. This suggests that a much more complicated neural network can be implemented on this board without ever needing to worry about running out of memory.

Design Results

Demo

Hand written input and classification result

link

Execution Speed Comparison

The signal tap waveform shown above demonstrates the execution of the inference. From bottom to top, the duration of first layer convolution PE work duration, first layer max-pool & ReLU PE work duration, second layer convolution PE work duration, merger PE work duration, second layer max-pool & ReLU PE. Between the mlp_done and the final edge in max_relu_result_in_mem is the MLP PE work duration. The overall execution time is 403.06 us and the first layer of convolution takes up most of the time, with a latency of 311.08 us. This suggests that in the future we can add more parallel DSP blocks to this layer of convolution for even better performance. As for the MLP layer, because of the parallel DSP blocks, we designed for this layer, the latency of this layer is very fast at 91.98 us, which is only 4600 cycles, for a 10x75 * 75 fixed-point number matrix multiplication.

To test the same execution speed in software, we made a python based program that recognizes the same data set. On the i7-1165G7 with a 2.8GHz frequency laptop, the program execution time is 36.49ms. The speedup is around 90x. As a convention with the previous labs, the pivot of speedup should be on the ARM processor from SoC. Thus, the speedup would be greater if we made a baseline model on C. This speedup could also be optimized by increasing the clocking speed of our design. We are currently using a 50Mhz onboard system clock. By using PLL, we can make the entire design go even faster. We think we can at least speed up our design by 2 to 4 times in this way. When proper pipelining is implemented, the limit of the clock speed would most likely be limited by the M10k.

Usability

HPS System

To enhance the usability for users and provide a better experience for users to interact with the system, we split image loading and result in display control into two parts.

  1. User input part

    • A drawing pad using a Windows painting app with a 28*28 pixels canvas. The canvas has a black background and the user will be drawing using painter in white.
    • A python script that will convert the painted image to .PNG format and loaded into PyTorch for image formatting. The file will directly call SCP command in python and sent the .dat file to linux.
  2. FPGA Linux part

    • In this part, the user will be using the c program to initilize all M10k blocks to load all image data. This is all done by running the c code with no extra commands, which enhance user's side experience.

Control interface

After sending the data into the initial M10K blocks, the users can use two switches to control when to start the interpretation process. Before the information was passed into the M10K via PIO, the user need to place the first switch (SW[0]) high to proceed a reset to the system. After the initialization process is complete, the user will switch on the second switch (SW[1]) to generate a "run" signal to the whole system. The result will be displayed on the LED screen.

Conclusions

From this project, we have systematically developed simple convolution neural networks using Verilog. The major challenge comes from both designing the modules from a top-down, hardware-based approach and cooperating between each design flow. Most of the issues come from debugging on the FPGA since the module is large and the timing is strict for our interlayer combinational modules. This is also part of the reason we shifted from using streaming data instead of sending the whole sliding window in parallel to the multiplier module directly. Our design, despite using multiple M10K blocks and a logical module for arithmetic calculations, still have room to hold larger networks or input data due to the nature of the convolution layers. In addition, there is also room for us to do additional pipelining and more parallel designs if we have more sufficient time by the end of the semesters.

Reference

[1] https://insightsimaging.springeropen.com/articles/10.1007/s13244-018-0639-9

[2] https://towardsdatascience.com/gentle-dive-into-math-behind-convolutional-neural-networks-79a07dd44cf9

Appendix A: Permissions

The group approves this report for inclusion on the course website.

The group approves the video for inclusion on the course youtube channel.

Appendix B:

Task distribution:

Our open-source Github Repo for this project