Adrian Johnson

CS 490/EE 690 Final Project

Spring 1998





Digital signal processing is used in many aspects of industry. Examples of applications include speech synthesis, speech recognition, and high-speed modems. The main advantage of digital processing over analog processing is its ability to both process data and to control data based on earlier results. Additionally, DSP systems have four advantages over analog systems. First they are insensitive to the environment since the DSP’s output is not temperature dependent. Second, they are insensitive to component tolerances meaning two functioning digital components will behave exactly the same, while two similar analog systems will never have the same output if given the same inputs. Third, DSPs are reprogrammable—each DSP can perform numerous different tasks. Finally, DSPs are small in comparison to bulky analog filters.



The most important feature of a DSP is its ability to support repetitive and numerically intensive tasks. This ability is used in its calculation of Fourier transforms, multi-filter systems and correlation calculations. The ability to perform a multiply-accumulate operation in a single clock cycle is key. The multiply-accumulate is integrated into the data path. Specialized program control is another important aspect of a DSP. Control of loops for highly iterative algorithms is accomplished with a hardware counter and repeat buffer. Additional DSP specific features include multiple-access memory architecture, specialized addressing modes for handling memory data arrays and FIFO buffers, and on-chip A/D or D/A converters.



This project attempts to design and implement a small 4-bit digital signal processor on a Xilinx FPGA. Limiting factors of the design included a single data and address lines, on-chip CLB limitations, and clock limitations. Signed fixed-point numerical representation was chosen over floating-point representation for its small arithmetic algorithms. Twos-complement representation was used to represent positive and negative numbers.



The instruction set was designed using the best features of researched DSPs in combination with general CISC instructions. The combination of these instructions produces a versatile processor. Instructions were broken into six categories: ALU functions, special operation functions, branch functions, multiplication functions, shift functions, and load functions. Instructions are further broken down depending on number of operands, source type, and special functionality. Instruction set in figure.



Three instruction lengths were chosen: eight-bit, sixteen-bit and thirty-two-bit. The large number of instruction bits are useful in creating the data path though most instructions do not take advantage of the large number of bits. The first instruction byte is loaded into instruction register 1 (IR1) and broken into two further parts. The 3-bit opcode (OP= IR1[7..5]) determines which of the six categories the instruction is from. The five bit function (func = IR1[4..0]) specifies the specific instruction. Function coding is category dependent and certain bits of function represent either operand length or some other specific related to the instruction. The second instruction byte is loaded into instruction register 2 (IR2) and accessed by a number of aliases. These aliases include register and accumulator controls, immediate accessing, and the shift operation function (shOp = IR2[3..0]). Instruction bytes three and four form a sixteen-bit displacement used by certain instructions.



ALU functions are represented by an opcode of "000." They are broken into three sub-categories. Two operand instructions (specified when func(4) = ‘0’) contain a source and destination registers. One-operand instructions (specified when func(4) = ‘1’) contain only a destination register. Immediate instructions operate with the same functions as 2-operand instructions substituting a 6-bit immediate value for the source register. Single operand arithmetic instructions require a single destination register.



Special functions (OP = "111") are zero-operand instructions including NOP, STOP, and RET. These three instructions do not affect data path registers. Six other special functions are used to set or clear the ALU flags. These instructions can be used to force certain branches. The functions of the special operations are Gray-coded. Bit zero is used for the carry flag, bit one for the negative and bit two for the zero flag. Bit four determines if the flag is set or cleared.



Branch functions (OP = "011") contain a 16-bit displacement to determine the location of a jump. Normal branch instructions determine whether to branch or not depending upon the values in the flags. The CALL instructions are used for jumping to a subroutine. Special branch instructions including branch-bit-set (BBS), branch-bit-clear (BBC), and hardware loop (RPT) are represented by func(4) = ‘1’.



Four multiplication instructions (OP = "001") are included: two-operand multiplication, immediate multiplication, square, and multiply-accumulate (MAC) instructions. Gray-scale coding is used for multiplication functions. The MAC instruction format is "00000" which is the same function code for addition.



Shift functions (OP = "010") are the most complicated. Standard one-operand shifting occurs within the data path. Shift instructions include arithmetic (ASL & ASR), logical (LSR) and rotation (ROL &ROR). In addition rounding (RND) and truncating (TNK) are implemented. Rounding and truncating do not work in the normal mathematical sense. Rather round will keep the least significant bits and truncation will keep the most significant bits. In the case of single operand instructions, eight of the ten bits are kept while the remaining bits become zero.

Because of the location of the shifter in the data path it is possible to perform a shift operation and an ALU operation in a single clock cycle. This allows for single clock cycle squares and cubes of accumulator values or single clock cycle divide by two, divide by three. The type of ALU function is determined by func while shOp determines the shift function. For normal shifting, func = "11111" is used, since this passes input-a through the ALU.

Accumulator-to-memory transfers and accumulator-to-register transfers are included as shift instructions since the length of the accumulator needs to be limited in both cases. Accumulator-to-memory transfers include limit, round and truncate (LIMA, RNDA, TNKA). The limit instruction limits the accumulator according to a limitation algorithm that determines where the MSB is. Accumulator-to-memory transfers turn a ten bit value into an eight bit value. The same operations are supported as accumulator-to-register transfers (LIMF, RNDF, TNKF) which create a four bit value from the ten bit value.



Four load instructions (OP = "100") exist. Load from memory (LD) loads a register with a number specified by a displacement. Load immediate (LDI) loads a register with a number specified in the instruction. Load direct (LDD) loads a register from a second register while load accumulator (LDA) loads an accumulator with a sign-extended register. Functions for load instructions are Gray coded.



The data path is loosely based on the Motorola DSP5600x series. While the Motorola DSP contains two data and address lines, the project DSP uses a single data and a single address line. The data path can be broken into a number of components. Four-bit values from memory are stored in a register file, which feeds into the multiplier. The ten-bit output of the multiplier connects with the ten-bit ALU / Accumulator file. A limiter is used for store instructions and accumulator-to-register operations. A shifter is along the ALU input-a feedback line. An addressing unit supplies the current address.



The multiplier is a simple unsigned multiplier that shifts and adds the inputs to produce a result. The input come from four 4-bit general purpose registers (R0-R1). The input to the register file is rin, rin can be a 4-bit data value, an immediate or a current register value. The register enable, rgEn, input allows control of storage to the register file. If rgEn is high, values can be stored. RgEn is controlled by current instruction and the state table output, start. Inputs to the multiplier are multiplexed according the ra and rb.

The multiplier contains an internal state machine that counts the number of bits shifted and added. It is controlled by a start signal and outputs a finish signal when the product has been calculated. Two four-bit numbers multiplied together will take six clock cycles. So unlike the Motorola DSP, the project DSP cannot execute single cycle multiplies.

For the square instruction rb is set equal to ra, and for the immediate multiply it is multiplexes an immediate value.



Inputs to the accumulator include ALUI1 and ALUI2. ALUI1 is the output of the shifter unless the instruction register holds the MAC instruction in which case the multiplier output is fed onto ALUI1. ALUI2 is multiplexed by aca or acb or it multiplexes an immediate value into the ALU. The ALU operation is chosen by fn and the ALU outputs four flags dependent on the current instruction: carry, zero, negative, and overflow. The output of the ALU is multiplexed with the multiplier output depending on the instruction and then loaded into the accumulator file.

The four 10-bit accumulators are loaded based on the register enable, regEn, and aca. RegEn depends upon start and the current instruction.



The limiter limits the aca accumulator’s output dependent upon the function. The function is loaded from shOp. The limiter outputs either an eight-bit value to the data bus or a four-bit value to a register. The limiter also outputs a four-bit flag that contains the bits trimmed from the input. Future instructions can use this flag for control purposes.



The shifter accepts a 10-bit input and shifts it according to shOp. The shifter outputs a ten-bit value to the ALU and a single-bit flag, which contains the shifted bit. Similar to the limiter, the flag may be used in future instructions.



The addressing unit contains a 16-bit program counter, 16-bit stack pointer, 8-bit memory address high register and 8-bit memory address low register. Depending upon the state and instruction the address line receives one of the three.



The state machine is reset by the reset input. During reset, state is set to 0 and PC is set to FFFF to allow the next clock tick to bring it to 0000. At each clock tick the state is loaded with the next state and the PC is incremented. The first state loads the first instruction byte and determines if a second byte is needed, the second state loads the second byte and determines whether a displacement is present. States three and four load the displacement. State five sets the start signal and waits for a finish signal. Finish is either set by the multiplier when the product is ready. In the case of an ALU instruction, finish is set the following clock cycle. The machine then jumps to state one and the process repeats.



The individual components were tested separately using the simulator. The ALU, multiplier, shifter and limiter were exhaustively tested for all possible inputs and correct outputs. All of these components work perfectly. The fetch state machine was tested in the same manner and worked properly for all implemented instructions. The DSP was assembled into one unit and again simulated. Currently all implemented functions except the MAC instruction function correctly. The MAC instruction does not correctly load the accumulator. Branch functions have not been implemented although an addressing unit currently works for the DSP. The normal branch instructions (BR, BEQ …) should be easy to add into the fetch state machine. CALL, BBS, BBC and RPT are more difficult.

The DSP has not been loaded into an FPGA so it is unknown whether it should fit…though the combined individual components took up only half of the CLBs.


An assembler was created to accept the DSPs assembly language and output a string of hexadecimal numbers. The assembler was written as a sequence of macros in the ASM96 assembler. Operation of the assembler is simple and self-explanatory.



The project DSP is a good stepping stone for future complex and optimized DSPs. Advanced DSP features such as hardware looping, dual memory buses, and single clock multiply-accumulate can be added with minimal amount of change to the project DSP’s structure. With a set of branch instructions the project DSP will operate as a CISC processor with strong math capabilities. With the addition of a single cycle MAC instruction the project could be considered a fully functional DSP. Currently, the biggest strength of the DSP is its shifting speed and internal accumulator operations. With current instructions Fourier transforms, correlations and signal filtering can realistically be solved using this DSP.


VHDL Code for DSP