ECE 576, Final Project: Programmable Discrete Graphics Hardware

Brian Chu
Email username: bc226
Email domain: cornell.edu

A pdf version of this report is available here: 576_report05.pdf.

The source code and Altera Quartus II project files are available in this zip file: 576_lab5.zip.

The supporting development files used to debug and verify Verilog is includedin this zip file: 576_lab5_dev.zip.



Introduction

The goal of this final project was to create a programmable graphics processing unit with as many aspect as possible to be coded in hardware, even with object and edge generation. The main feature of the organization of this graphics unit is to be able to represent transformation operations parametrically, creating a graphics co-processor capable of rendering procedural motion. The graphics unit takes operations in a very-long instruction word format that has a one-to-one representation to a high-level scripting language, which provides a mean to moving objects and features in a scene to dynamically during run-time. The original inspiration was to create a physics simulator on an FPGA that followed Lagrangian constraints, which, if there were to have been more hardware than software, would have required being able to manipulate a set of objects in a parametric fashion. The high-level design shares many similarities multi-cycle pipelines, such as intermediate memories and registers. However, unlike a regular processor, the co-processor has one pipeline that operates on multiple pieces of data in parallel, much like a vector processor does in a single-instruction multiple-data fashion.

There are three components of the circuit: an object generation pipeline to generate edges of the target shape; a transformation pipeline that performed transformations on the unit objects1; and a rastering pipeline that generates the points for the VGA controller to display. The design has made certain tradeoffs due to the constraints imposed by the FPGA we used to synthesize the circuit. First, the transformation pipeline does not employ a generalized 4x4 matrix multiply because the limited number of multipliers on the FPGA. Instead, the transformation pipeline is currently designed as a operate-and-accumulate module, with intermediate data values stored in registers. This most definitely impacts the performance of the overall system since $ n$ transformations take $ n$ operations per data set. Alternatively with more available multipliers, by first generating a reduced matrix transformation, one data set can be transformed in one cycle. Second, the available memory on the FPGA is limited to 8.5 megabytes at most, of which about 512 kilobytes are available memory that are designed to be read from within a single cycle of exerting the desired address. If a different memory type that provides more storage is made available, several key points of the circuit would have to be redesigned to adhere to Valid memory lines.

Since this lab was the final project of a one semester course, due to the limited time available, many other decisions were made to favor rapid development and testing. We present the decisions that, if were made differently, have the most potential to improve the efficiency of the circuit. One such decisions was to represent the inter-pipeline data using an exhaustive edges list. This simplified the memory organization at the cost of increase memory usage, simplified the object generation modules used in the first pipeline2, and reduced the complexity of the data fetching mechanisms for each pipeline.