DE1-SoC: University Computer
IPC, and UDP
Cornell ece5760

 

University Program DE1-SoC_Computer_15_1

test image This computer system includes support for ARM, Nios, video, audio, and many other items. I converted some code from bare-metal to Linux to run on the UP-Linux distribution. First test is to get VGA display running and test the writing speed. I did a minor reorganization of the address map file and converted one C example to just run the VGA, and update 10,000 pixels as fast as possible.The update takes 1.8 mSec, so the effective pixel writing rate is about 5.5 million pixels/sec. The example also defines a line-drawing routine, but does NOT check pixel bounds. If you write outside the screen bounds, the program segfaults. The image to the left shows one update frame (at 320x340 resolution)..
random rectsThe code was modified to write random rectangles. The write-rate is too fast to see, but the colors are nice.
(at 320x340 resolution). Colors are 16 bit: top 5 bits red, middle 6 green, lower 5 blue.

--Converting DE1-SoC_Computer_15_1 to 640x480, 8-bit color

640x480 The directions written by Shiva Rajagopal for Qsys 640x480 converstion worked for this system. (system ZIP) The span of the addresses in the virtual-to-real memory map had to be doubled. and, of course, the addressing and colors of pixels had to be modified in the main program. The size of the character buffer was not changed. The color encoding is now 8-bit with top 3 bits red, next 3 green, lower 2 bits blue.
VGA_line(0, 0, 320, 240, 0xe0) ; // red 3-bits
VGA_line(639, 0, 320, 240, 0x1c) ; // green 3-bits
VGA_line(639, 479, 320, 240, 0x03) ; // blue
2-bits
The design was very slow to generate (Qsys) and compile (Quartus). It took around an hour (on my 5 year old machine). Next step is to speed it up. Chopping out the Nios CPUs and some of the support, but leaving the video in/out and audio reduces the generate time to 5 minutes and the compile time to about 22 minutes. (archive). Stripping out the rest of the LED and switch i/o and removing the video-input funciton reduces the compile time to 18 minutes.

A better chopped down system keeps the LEDs, switches, 640x480 video out, and audio. The design is partitioned (Assigments>Design Partitions Window) so that the DE1-SoC computer is in its own partition. Two other partitions are top and the hex display modules. On my new computer (4 core, 32 GB memory, SSD, July 2016), this takes 12 minutes for a full compile. A small change to the hex display partition takes about 8.5 minutes to recompile. A small C code tests the hex display partition. (C code, address header, project ZIP).

-- Mandelbrot set on VGA/HPS , 8-bit color
This example is a base-line implementation of a mandelbrot solver which displays using the DE1-SoC computer system explained above. It computes a 640x480 approximation with a maximum of 1000 iterations in about 3.4 seconds, using level -O2 compiler optimization. The code computes about 23 million complex iterations/sec (40 cycles/iteration). The colors are approximately logarithmic in number of iterations at that point. Image. The total number of iterations for all points on the screen and total execution time are displayed. Also included is a routine to erase all text on the screen. Use the sof file from the "better chopped down system" above. Converting the code to 4:28 fixed point arithmetic lowers the time to 2.02 seconds, or about 39 million iterations/sec. Detecting circular regions of the slowest areas (in blue) and just setting the count to maximum in those regions lowers the drawing time to 0.85 seconds. Code

-- Conway's game of life on VGA/HPS , 8-bit color
The game of life is a 2D, totalistic, cellular automaton which is compute-universal. The HPS program displays using the DE1-SoC computer system explained above. It computes a 640x480 cell automaton at approximately 14 frames/sec, using level -O2 compiler optimization. This corresponds to about 4.25 million cell updates/sec. The slow step here is writing the pixels to the frame buffer, which is limited by the bus rate to about 5 million/sec. If you modify the code to be smarter about writing pixels, the speed goes as high as 60 frames/sec, or about 18 million cells/sec, but will depend on the specific content of the screen. More cell state changes will slow down execution. Use the sof file from the "better chopped down system" above.

-- Graphics primitives on VGA/HPS, 8-bit color
A few more 2D drawing primitives were added to draw:
points, lines (general, fast vertical, fast horizontal), filled circles, circles, filled rectangles and rectangle edges.
As above, color is 8-bit, resolution is 640x480. Text is drawn in white only. Also, there is a routine to clear text from the frame buffer. Clearing the image plane is done by writing a large, black, filled rectangle. Use the sof file from the "better chopped down system" above.
Code

-- Color chooser on VGA/HPS, 8-bit color
I wrote a color chooser that lays out a grid of all the possible 8-bit colors in hexidecimal order , then prompts you for the location and index of up to four different colors for comparison in a larger patch. You get the index by adding the column and row numbers for a given color. Ordering is [red 7:5, green 4:2, blue 1:0] . You set the location of the larger patch as an integer 0-3. Code.
Reordering the color patches to make four 64-patch red-green planes, with increading blue content makes a nicer display. The hex equivalent is displayed on each patch. Code.

-- Video input from NTSC to VGA, 8-bit color
The board supports NTSC/PAL input through a Video Input subsystem in Qsys. A camera is attached to the yellow composite video jack. Several modifications need to be done to the Qsys layout and top-level module to make this work at 640x480 resolution. In the Video Input Subsystem:

In the top-level module the video input signals need to be defined as given in the reference design, but the TD_RESET_N signal is not correctly generated by the supplied IP, so a line was added to the top-level assign TD_RESET_N = SW[1]; . The switch may need to be cycled at power-up to enable the system. Turning off the switch freezes the video capture. Also, the system would not start unless the edge-detect option was transiently turned on by a HPS program. The HPS program also reads and displays the 8-bit color of the pixel at video input location 160x120, the middle of the input image. I do not know yet why there are two copies of the camera image displayed.
(top-level, project ZIP).


Converting video to SDRAM -- 640x480 8-bit and 16-bit color

-- Video input from NTSC to on-chip-memory, then to SDRAM VGA using HPS, in 8-bit color
The Qsys layout can be modified so that video input goes to on-chip, dual-port SRAM, while the VGA display is refreshed from SDRAM. It is then possible to use the HPS to copy pixels from the video-in SRAM to the display buffer SDRAM, or just use the pixels for computation on the HPS. Changes to the Qsys layout:

A HPS program can read/write the video_in RAM and the VGA display SDRAM to copy the pixels from video in to display. There are new functions to support the read/write. As before, switch SW[1] must be UP for the video input to run. Using the HPS (instead of an Avalon bus-master) is inefficient use of the bus, but is useful for testing. A slight timing error results a one pixel ripple in the video input diaplay. (HPS code, project ZIP). Down in the lower-left corner, the time readout gives the copy-time of about 30 fps. The color indicator reads white, which is the one-pixel dot on my neck, inserted by the program at (160,120) in the video-in buffer.

-- Video VGA 640x480 displayed from SDRAM, in 16-bit color.
A stripped down display system uses SDRAM as a frame buffer. The top level Verilog only connects the Qsys exports to the i/o pins and has no other logic. The Qsys layout is modified to support 16 bit color. The Qsys modifications:

The HPS pixel writing macro is modifed to allow 16-bit writes to the bus, and uses the consecutive format:
// pixel macro
#define VGA_PIXEL(x,y,color) do{\
int *pixel_ptr ;\
pixel_ptr = (int*)((char *)vga_pixel_ptr + (((y)*640+(x))<<1)) ; \
*(short *)pixel_ptr = (color);\
} while(0)

Color coding is 16-Bit RGB. This format uses 5 bits for red, and 6 bits for green and 5 bits for blue.
If R and B are 5-bit integers and G is a 6-bit integer then color = B+(G<<5)+(R<<11);

A color-picker program allows you to specify R, G, B values, displays the color in the lower right, and shows 2D slices through the 3D RGB space, axis aligned, which include the specifed (R,G,B) point. The top slice is the red-green plane, the middle is blue-green, and bottom is blue-red plane. Three examples are shown below through points black (0,0,0) , medium gray (15,31,15), and white (31,63,31).The HPS perfrormance program linked below prompts for color mask values to set ranges for RGB, then draws 1000 discs with random colors constrained by the RGB masks.

(HPS color picker, HPS performance measure, ZIP)

The graphics primitives were converted to 16-bit color (HPS program).
This program assumes the 16-bit hardware used above.


HPS Interprocess Communication for video and audio

--Using two ARM processors to write video and play a tone.
Starting one process to write to the video buffer as fast as possible, and keep the audio FIFO filled, failed above about 8000 pixels per loop, where the FIFO could be filled in each loop if there was space. The easy solution is to start two processes, which are migrated by Linux onto the two processors with both running at full speed. (Quartus archive, combined audio/videocode which failed at high write-rates). The audio code required the math library for sine wave synthesis, which requires compile with the -lm option. The video code is unchanged. And the address header.

-- Using two ARM processors with IPC to display time while writing video and playing a tone.
Starting two processes to maximize bandwidth, requires communication between the processes. This example uses the fixed audio synthesis frequency (48 KHz) to drive a timer/counter which then uses shared memory IPC (interprocess communication) to display the time on the VGA. The both the audio and video code were attached to the same shared memory segment using shmget and shmat. As before, the audio code required the math library for sine wave synthesis, which requires compile with the -lm option. (Quartus archive, address header).
Use the sof file from the "better chopped down system" above.
-- A minor modification of both the audio and video code plays a one-octave scale on the audio side and displays the time and frequency on the video side.
-- Cleaning up both the audio and video code puts pixel limit error checking in the video draw routines and better naming in audio program.
-- Adding a disk function (video) to the video code makes particle systems nicer.


HPS to/from PC UDP communication

--UDP communication from ARM to outside world
Sending information to/from the FPGA via the ARM ethernet would be useful for a number of projects. The first code modifies the audio generation code to send the current note being played across a UDP connection to Matlab running on a desk machine. The ARM code opens a socket on port 9090 (do not use ports below 1024) and sends data once/sec to the port. The Matlab code running on the PC opens a UDP object, then just listens and echos the string to the command window. The code is based on the useful UDP material at linuxhowtos.org, particularly server_udp.c.

--UDP communication from outside world to ARM
Sending slow (human rate) commands to a program can be done by setting up a non-blocking UDP receive function. Each time through the main event loop, the program checks for a valid packet. The Matlab code asks the human for a sine wave frequency and sends the number to the ARM. The ARM audio code computes the DDS increment for the frequency and sends the samples to the audio codec FIFO and the video process. Use the sof file from the "better chopped down system" above.

--UDP audio from Matlab >udp> ARM >bus> FPGA audio FIFO
Sending audio rate packets from Matlab code is fairly easy, but setting up a ACK function for sync is not because the Matlab UDP receive function is too slow. The result is an unsynced system that works most of the time, but has to be tuned with a spin-wait loop in Matlab. To further reduce overhead, eight audio samples were sent in each packet by matlab. At the ARM code end, the eight samples were duplicated 6 times each to expand the sample rate to 48 KHz, the default audio rate of the Qsys audio core. The main loop makes sure there is enough space in the FIFO for 48 samples, reads a packet, and loads the FIFO. Audio example file. The video process records elapsed time of audio using memory shared with the audio process. Use the sof file from the "better chopped down system" above.

--UDP audio from Matlab >udp> ARM receive process >ipc> ARM fpga process >bus> FPGA audio FIFO
Decoupling the packet receive from the audio-rate FIFO operation results in more robust timing. The ARM receive process listens for packets and fills a buffer much faster than Matlab can send the packets. The buffer is shared with a second process, which reads the buffer and fills the audio FIFO at 48 Ksamples/sec. An int buffer of length 216 can hold 8 seconds of sound. The FIFO filling process waits for samples to appear in the receive process. The receive process waits for a start/reset command from Matlab. Matlab is sending about 2700 packets/sec each with eight 32-bit audio samples. Use the sof file from the "better chopped down system" above.


References:

DE1-SOC literature list

Using the DE1-SOC FPGA by Ahmed Kamel

Stereoscopic Depth on an FPGA via OpenCL by Ahmed Kamel and Aashish Agarwal

Running Linux on DE1-SOC by MANISH PATEL and SYED TAHMID MAHBUB

OpenCL on DE1-SOC Sahil P Potnis (spp66@cornell.edu) Aashish Agarwal (aa2264@cornell.edu) Ahmed Kamel (ayk33@cornell.edu)

Audio Core (Qsys University Program 15.1) local copy

Video Core (Qsys University Program 15.1) local copy

Analog input Core (Qsys University Program 15.1) local copy

External to Avalon Bus Master (external here means in the FPGA, but not in the Qsys bus structure)

Avalon to External Bus Slave (external here means in the FPGA, but not in the Qsys bus structure)


Copyright Cornell University May 4, 2017