Quartus 18.1

DE1-SoC: Examples verified
for Quartus version18.1
Cornell ece5760

The following examples have been ported and verified in the Quartus Prime 18.1 version. So far, the only error messages were in Qsys Video modules. Two unused signals in the VGA_PLL module (which were dimmed) had invalid device names. The video_in_clock and LCD_clock have to be set to set to different device names, then unchecked again to disable them. Clearly a GUI error and easily fixed.

The projects that have been tested and ported to 18.1 are from material used in the assigned labs
and projects that exercise features of the FPGA and Qsys bus.

Mandelbrot Set
Computes a full Mandelbrot set on the HPS as fast as possible.
Used as a lab exercise example.
GPU with FAST display from SRAM.
Builds a simple graphics co-processor for the HPS. Uses Qsys VGA subsystem, and
dual-ported SRAM for the VGA frame buffer. Data transfer from the HPS to FPGA is
via a separate dual-port SRAM.
Audio output bus_master.
Builds a Qsys bus_master for sound generation. The example implements DDS
with a frequency determined by switch settings. Relevant for a couple of labs.
VGA video at 640x480 displayed from SDRAM, in 16-bit color.
Uses SDRAM external to the FPGA to build a frame buffer.
This code is a good model for a couple of labs.
Bidirectional DMA to/from HPS.
Uses FPGA bus master DMA controllers to move data as quickly as possible to/from the HPS.
The actual DMA operations is quite fast. Getting the data on the HPS is slowed by conversion
from physical to virtual memory.
Bidirectional Qsys PIO to/from HPS
This uses two pair of PIO ports to send/receive data between HPS and FPGA. The program asks for a number and sends it to to the AXI master and LW master. The FPGA adds one to each and sends them back. On eht FPGA side, switches 1 and 2 need to be in the up position for the value to be echoed.
Direct I/O port from HPS to FPGA (not Qsys)
There is one 32-bit input and one 32-bit output port from the HPS which can be directly connected to the FPGA fabric, with no Qsys address.
HOLA Homebrew Logic Analyser.
This was built as a simple alternative to signalTap. It can monitor only 32 bits and use just one
32-bit trigger word. But it is small and easy to hook to a design. It was included here because it
used lots of FPGA features.
Video Input with VGA output
This example merges NTSC video input with VGA content generated by the HPS.
Video input is stored in SRAM, then copied by a bus-master into VGA display SDRAM.
Implementing on-chip memory
There are several ways to implement on-chip memory. Each has different characteristics and uses. The list includes (at least) Qsys RAM, HDL-inferred M10k blocks, IP_manager M10k blocks, HDL_inferred MLAB blocks, HDL_inferred direct registers.

Mandelbrot Set
This example is a base-line implementation of a mandelbrot solver which displays using the DE1-SoC HPS computer system. It computes a 640x480 approximation with a maximum of 1000 iterations in about 933 milliseconds, using level -Os compiler optimization. The code computes about 80 million complex iterations/sec. To get this speed we converted the code to 4:28 fixed point arithmetic which lowers the time to 2.02 seconds, or about 39 million iterations/sec. Detecting circular regions of the slowest areas (in blue) and just setting the count to maximum in those regions lowers the drawing time to 0.93 seconds. The colors are approximately logarithmic in number of iterations at that point. Image. The total number of iterations for all points on the screen and total execution time are displayed. Also included is a routine to erase all text on the screen. Use the sof file from the Zipped Quartus project without recompiling the project.
C code. ZIP project.

GPU with FAST display from SRAM.
The write rate of the above SDRAM-buffered VGA is low, so I rewrote the system (Qsys layout) to use dual-port SRAM for the VGA buffer. One port (s2) is connected through Qsys to the VGA controller and HPS, as usual. The other port (s1) is exported to the FPGA fabric, and connected directly to the GPU state machine in Verilog. The clock bridge shown syncs the the SRAM slave port to the GPU state machine. The logic to control the GPU state machine from the HPS is unchanged from above. Direct connection of display memory to the GPU state machine results in a write-rate of 48 pixels/microsecond. To get the high rate, the GPU state machine was rewritten to pipeline writes to the VGA display SRAM. To minimize on-chip memory use, the display mode was set to 8-bit color and changed from x/y addressing to sequential addressing (video core section 2.1, code snippet), saving 30% of SRAM. To make the mode change, the VGA_pixel_DMA module dialog box in the VGA subsystem needs to be modified. The HPS code is also changed to reflect the modified display mode. The left and right sides of the screen are written respectively from hardware and from the HPS, and should match. The times at the top of the screen are the writing times of the last polygon for hardware and software respectively. (HPS code, top-level, ZIP). A slightly improved version of the HPS code does parameter validation before setting up the GPU draw operation. The SRAM display memory is bigger than the original, so use this address include file.

Audio output bus_master
This bus_master state machine reads the FIFO status of the University Program audio interface, and if there is sufficient space in the FIFO, computes a new DDS sinewave sample and inserts it into the left and right audio channel FIFOs. The Qsys layout shows the relatively simple connections. The audio bus_master avalon_master is connected to the audio subsystem avalon_slave input. The design leaves the HPS interface in place, but contention between the two bus-masters for audio channels means that you can use one or the other (but see below for sharing the audio left/right channels). The state machine sets up the FIFO status read, then waits for the ACK. IF there is space in the FIFO, a new DDS sample is computed and written to the left channel, then waits for the ACK. The right channel is then written. Both channels must be written for the audio interface to work. Waiting for space in the FIFO effectively phase-locks the state machine to the audio-rate clock for sound systhesis (top_level_module, project ZIP).
-- If the audio bus-master hardware only checks that status of the left channel FIFO and only loads the left channel FIFO, and the HPS only checks that status of the right channel FIFO and only loads the right channel FIFO, then both can write to the audio at the same time. Since nothing is played by the audio interface unless there is data for each channel, the shorter duration channel determines play time. In this example, a WAV file is read by Matlab and samples sent by UDP to the HPS, which runs a thread to watch the UDP connection, and another thread to load the right channel. The hardware audio bus-master loads the left channel FIFO, then stalls until the HPS thread starts filling the right channel FIFO. (matlab program, HPS program, top-level module). The result is that the hardware plays a tone on the left channel during the time that the HPS program loads the right channel. Note that the hardware bus-master checks the top eight bits of the FIFO status word, while the HPS program checks the next eight bits (see section 4.1 of the Audio Core manual). The LEDR display is connected to the left channel FIFO status. When both sources are filling the FIFOs, you can see the contention by the variability of the FIFO depth, but actual audio play not affected.

VGA video at 640x480 displayed from SDRAM, in 16-bit color.
A stripped down display system uses SDRAM as a frame buffer.
The top level Verilog only connects the Qsys exported signals to the i/o pins and has no other logic.
The Qsys layout is modified to support 16 bit color. The Qsys modifications:

Inside the VGA subsystem
- The vga_pixel_dma module:
  - has the address modifed to 0x00000000, the base of SDRAM.
  - address mode changed to consecutive.
  - color space 16-bit.
- The dual-clock fifo module has color bits changed to 16-bits.
- The RGB resampler is changed to 16-bit input.
Output from the VGA DMA controller in the top-level Qsys is disconnected from on-chip-sram
and connected only to SDRAM.
The AXI-bus, HPS master remains connected to SDRAM so that the HPS can read/write VGA screen.
The AXI-bus master base address is C000_0000. This address is used in the HPS C-program to produce
high-speed i/o to the FPGA.
The 64 Mbyte of SDRAM is at AXI-bus master address C000_0000 to C3ff_ffff.
The light-weight AXI-bus base address is FF20_0000. This address is used in the HPS C-program to produce
low-speed control i/o to the FPGA.
The light-weight AXI-bus base address of the AVConfig module is FF20_3000 to FF20_300f.
Note that the exported signals in Qsys become i/o ports in the Qsys-generated computer-system module.
For example, the exported VGA conduit becomes i/o ports in the computer-system module instantiation to control the monitor.
The Qsys-generated computer-system module can be found in a sub-directory of your project directory named something like Computer_System. But usually you are going to just add or delete a few lines from the existing module instatiation i/o interface.

The HPS pixel writing macro is modifed to allow 16-bit writes to the bus, and uses the consecutive format:
// pixel macro -- shift-left in the pixel pointer is specified in the Video Core Manual
// probably becuase the DMA addressing is all in bytes
#define VGA_PIXEL(x,y,color) do{\
int *pixel_ptr ;\
pixel_ptr = (int*)((char *)vga_pixel_ptr + (((y)*640+(x))<<1)) ; \
*(short *)pixel_ptr = (color);\
} while(0)

Defined graphics routines are

void VGA_text (int, int, char *); // (x_position 0-79, line_position 0-59, pointer_to_string)
void VGA_text_clear(); // clears whole text buffer, but not graphics
void VGA_box (int, int, int, int, short); // (corner1_x, corner1_y, corner2_x, corner2_y, color)
void VGA_line(int, int, int, int, short) ; // (point1_x, point1_y, point2_x, point2_y, color)
void VGA_disc (int, int, int, short); // (center_x, center_y, radius, color)

Color coding is 16-Bit RGB. This format uses 5 bits for red, and 6 bits for green and 5 bits for blue.
If R and B are 5-bit integers and G is a 6-bit integer then color = B+(G<<5)+(R<<11);

A color-picker program allows you to specify R, G, B values, displays the color in the lower right, and shows 2D slices through the 3D RGB space, axis aligned, which include the specifed (R,G,B) point. The top slice is the red-green plane, the middle is blue-green, and bottom is blue-red plane. Three examples are shown below through points black (0,0,0) , medium gray (15,31,15), and white (31,63,31).The HPS perfrormance program linked below prompts for color mask values to set ranges for RGB, then draws 1000 discs with random colors constrained by the RGB masks.

(HPS color picker, HPS performance measure, ZIP)

The graphics primitives were converted to 16-bit color (HPS program).
This program assumes the 16-bit hardware used above.

Bidirectional DMA to/from HPS
The Qsys was modified to include two DMA controllers connected so that data can be copied from HPS-to-FPGA and/or FPGA-to-HPS. The only connection differences are reversing the read and write bus-masters for the second DMA. (ZIP)

The HPS code was expanded to define the two DMA transfer controls, and to print the data transfer rates in each direction. The transfer rates are symmetric and both around 270 MBytes/sec. The rate limiting step is loading and reading the onchip RAM on the HPS. For 10000 32-bit transfers, the FPGA DMA read/writes each took 150 microceconds, but loading/reading the onchip memory took 730 and 550 microseconds respectively.

The next step in optimizing the HPS code is to replace the load/read loops with memcpy. Interesting to find out the memcpy is faster with no optimization turned on. The first image is from the program using memcpy compiled with -O3 option, the second with -O0. The direct read/write to SDRAM slows down. Also notice in the second screen dump that direct FPGA sram write takes 1330 microseconds. The DMA read/write takes 300 microseconds, but the overhead of loading the buffer makes the total about 1000 microseconds, not really much faster. (array of size 10000). The DMA transfer rate is about 270 MBytes/sec, but the net transfer rate (including data copy to buffer onship RAM) is about 77 MBytes/sec.

PIOs on AXI bus and on light-weight AXI bus
Parallel ports (PIO ports) instantiated in Qsys are defined as output if they communicate data from the HPS to the FPGA, and as input if then communicate data from the FPGA to the HPS. PIO ports can be instantiated on the light-weight AXI bus, or on the full AXI bus. This example instantiates four PIO ports in Qsys, one input/output pair on the light-weight bus and one i/o pair on the full AXI bus. As usual there will be Qsys, Verilog, and C involved in the setup

(C code, project ZIP) :

The Qsys layout shows that the address assigned to each output module is zero for the bus it is on, while each input address is offset by 0x10.
-- Note that each University Program PIO module needs to be configured by double-clicking the module name, then checking the create custom parallel port box, and selecting a data width and data direction (ignore the board type menu).
A Verilog snippet shows that the HPS output to HPS input loopback is combinatorial, and uses a few of the switches for debugging the PIO interfaces on the HEX display. It also shows the veriog interface generated by Qsys in the computer system module.
The C program running on the HPS
1. defines some addresses,
2. then mmaps the real addresses to virtual addresses,
3. then falls into the usual loop waiting for user input of a number to send to both output ports,
  receiving two ports back and printing them.

Direct I/O port from HPS to FPGA (not Qsys)
There is one 32-bit input and one 32-bit output port from the HPS which can be directly connected to the FPGA fabric, with no Qsys address. To expose the i/o for connection, double-click on the ARM9_HPS component in Qsys, then in the dialog box click Enable general purpose signals. In the main Qsys window you should now see h2f_gp as a connection. Export it to form connections to FPGA. When you generate the new system, there will be signals:
// Direct gpio to FPGA
.arm_a9_hps_h2f_gp_gp_in (your_input_to HPS),
.arm_a9_hps_h2f_gp_gp_out (your_output_from_HPS),
The example code displays:
AXI PIO input on Hex0
AXI PIO output on Hex1
The new gpio input on Hex2
The new gpio output on Hex3
LW_AXI input on Hex4
LW_AXI output on Hex5
For all three sets of i/o: input_value_to_HPS = output_value_from_HPS + 1
The serial interface allows you to enter a value to send to all three ports, and measures send/receive time for each of the three systems. The AXI bus is the fastest at about 6.5 million sends and receives per second. The LW_AXI is about half that speed, and the direct gpio port is about 5 million sends and receives per second. Since the signals are not on Qsys, it has it's own base address: 0xFF706000.

C Code, Verilog, project ZIP

HOLA Homebrew Logic Analyser
Sometimes SignalTap feels like overkill. I built a simple logic analyser that connects to one 32-bit data word and one 32-bit trigger word. If you start with the pre-built project and add your own device-under-test (DUT) , then the total programming overhead is to connect two 32-bit signals to your design. An example DUT is a three-phase DDS sinewave generator. The trigger mask is set to 0xff and the trigger word to zero, by the HPS. The green line is the trigger point. The 8-bit phase is plotted at the top. The bits of the phase are plotted in red. The three sine phases are plotted in green. The C functions for data handling are below.

C code, project ZIP, address map

Much more info on HOLA is available.

To make the system easier to use for debugging, the low level functions on the HPS were abstracted to five C functions:

start_HOLA(trigger_mask, trigger_match_value) arms the capture system on the FPGA and waits for a trigger event to occur which matches the masked trigger value.
Trigger when (ext_trigger_source & trigger_mask) == (trigger_match_value & trigger_mask)
Data is returned in a 32-bit array called logic_data., The trigger position is array index 499.
Nothing is done with the data. It is up to the calling program to use the data, perhaps using one of the routines below.
end_HOLA signals the FPGA that the HPS is done using the current data, and re-enables constant data logging.
Call this after you have stored, plotted, or analysed the data returned.
print_binary_HOLA(begin, end, low_bit, bit_mask, *title) prints a vector of the current data in hexidecimal and binary. Inputs:
- Number of samples before the trigger, expressed as a negative integer
- Number of samples after the trigger, expressed as positive integer
- Base position (right-most bit) of the desired field in the 32-bit word from the FPGA
- Width of the desired field, expressed as a bit-mask, e.g. 8-bits is 0xff
  The vector is trimed to the length of the bit_mask, rounded up to the next 4-bits.
- Title of the data vector column, e.g. char title_s[]="sine"
print_HOLA(begin, end, low_bit, bit_mask, un/sign, *format, *title) prints a vector on the console. Inputs:
- Number of samples before the trigger, expressed as a negative integer
- Number of samples after the trigger, expressed as positive integer
- Base position (right-most bit) of the desired field in the 32-bit word from the FPGA
- Width of the desired field, expressed as a bit-mask, e.g. 8-bits is 0xff
- Signed/unsigned 'u' implies unsigned, 's' implies signed
- A printf format string, e.g. char fmt_s[]="%03d %d " , where the %03d formats the clock tick number and %d is the desired format for the data vector
- Title of the data vector column, e.g. char title_s[]="sine"
draw_wave_HOLA(low_bit, bit_mask, sign, v_pos, v_scale, h_scale, color) draws a time-series vector on the VGA display.
- Base position (right-most bit) of the desired field in the 32-bit word from the FPGA
- Width of the desired field, expressed as a bit-mask, e.g. 8-bits is 0xff
- Signed/unsigned 'u' implies unsigned, 's' implies signed
- Vertical position on the screen
- Vertical scale represented as powers of two: A value of 2 means right-shift 2 bits; A value of -1 means left-shift 1 bit
- Pixels/sample clock (hoizontal scale) can be 0,1,2,3,4: A value of 3 means 8 pixels/clock cycle
- Color of the waveform. a few 8-bit colors are defined near the top of the program.

Video Input with VGA output
This example merges NTSC video input with VGA content generated by the HPS. Video input is stored in SRAM, then copied by a bus-master into VGA display SDRAM. The HPS initializes one feature of the video input, the just draws discs to the screen. The Qsys layout (part 1, part 2) adds a bus_master which can read from the on-chip SRAM used to store the video input and write to the VGA buffer SDRAM which is also connected to the VGA controller. The pixel copy state machine runs at 50 MHz, so a clock bridge was added to drive the the EBAB module. Video input is enabled by SW[1] up, Copy from the input buffer to the VGA display is enabled by SW[0] up, and you may need to turn both switches off and press KEY[0] to reset. A HPS progam must be run to set up the video input and demonstrate sumultaneous access to VGA from the HPS and custom bus_master. The VGA buffer bus traffic, plus video-input to VGA bus traffic, plus HPS to VGA bus traffic can exceed the bus bandwidth. The writing rate of the HPS program and the video-input to VGA are throttled. This version of the code has the screen position of the input image hard-coded, but the buffer can be resized and moved to other screen locations.

C-code, address header, Quartus ZIP file.
Example:

Implementing on-chip memory
On-chip memory can be built in several ways.

Qsys RAM. This memory uses M10k blocks and is specified in the Qsys interface.
HDL-inferred M10k blocks. The Intel Recommended HDL Coding Styles section on Inferring Memory Functions from HDL Code. Gives several examples of how to infer memory. This easy to use, but I cannot find a way to control some memory parameters, such as the data output register, which is turned on by default, and delays output by one cycle. When inferring memory, you can give force Quartus to give you M10k blocks using a synthesis directive. The directive (as part of a memory definition module):
reg [31:0] mem [255:0] /* synthesis ramstyle = "no_rw_check, M10K" */;
forces M10k blocks to be used to build 256 words of RAM.
IP_manager inferred M10k blocks. The IP_manager gives complete control over the memory block using a GUI with lots of options. Using this interface allows you to turn off the output register, but the input registers are always on.
HDL_inferred MLAB blocks.
HDL_inferred direct registers. This form of memory uses ALM resources quickly, and should be occasionally used if you need really small, fast register files.

The different kinds of memory are confusing so we made an example that uses Qsys RAM, HDL-inferred M10k RAM, IP-manager RAM, and HDL-inferred MLAB blocks. The program reads two floats from the HPS, then sends them to Qsys RAM. There they are read then added or multiplied, and the result is read into M10k HDL infered memory and IP-manager M10k memory, then read back to Qsys RAM, which is read by the HPS. The Mlab blocks just read/write meory to make a little counting state machine. In addition, for sanity sake, the arithmetic results are also reported via 3 PIO ports.

There are some switched-based controls:

SW0 selects output of FPmult operation to read back
SW1 selects output of FPadd
SW6 UP means send back M10k inferred output, DOWN means use IP-inferred
SW7 UP means two cycle Qsys RAM read, DOWN means one cycle
SW8 UP means two cycle M10k (or IP-inferred) RAM read, DOWN means one cycle
SW9 UP means two cycle MLAB RAM read, DOWN means one cycle

Observations:

Dropping to a one-cycle read for dual-port Qsys memory always fails. Two cycles is required.
Dropping to a one-cycle read for HDL-infered or IP-infered M10k blocks mostly works, but after some re-compiles fails. My guess is that since the output is unregistered and combinational, the data may not be ready under some path conditions.
If you need speed, then test the one-cycle option carefully, otherwise use two cycles.
Dropping to one-cycle for for MLAB blocks does not seem to work, but since the symptom is that the counting state machine runs half as fast with a one-cycle read, I am only guessing that two reads were necessary for the one-cycle case.

C-code, Verilog, ZIP

References:

DE1-SOC literature list

Using the DE1-SOC FPGA by Ahmed Kamel

Stereoscopic Depth on an FPGA via OpenCL by Ahmed Kamel and Aashish Agarwal

Running Linux on DE1-SOC by MANISH PATEL and SYED TAHMID MAHBUB

OpenCL on DE1-SOC Sahil P Potnis (spp66@cornell.edu) Aashish Agarwal (aa2264@cornell.edu) Ahmed Kamel (ayk33@cornell.edu)

Audio Core (Qsys University Program 15.1) local copy

Video Core (Qsys University Program 15.1) local copy

Analog input Core (Qsys University Program 15.1) local copy

External to Avalon Bus Master (external here means in the FPGA, but not in the Qsys bus structure)

Avalon to External Bus Slave (external here means in the FPGA, but not in the Qsys bus structure)

DE1-SoC: Examples verified for Quartus version18.1 Cornell ece5760

DE1-SoC: Examples verified
for Quartus version18.1
Cornell ece5760