Cyclone5 FPGA Structure
ECE 5760 Cornell University
Overall structure of the FPGA
The FPGA floor plan shows the overall layout of the generic Cyclone5.
Our FPGA has:
- Logic modules organized into Logic Array Blocks (LABs) and/or MLAB block memory (using LABs)
- Block memory as M10k blocks (10 kbits each)
- DSP blocks which can perform a variety of fast multiplies and adds
- HPS Hard Processor System with dual ARM9s and associated i/o controllers (memory, ethernet, etc)
- Clock input/generation/distribution networks including PLLs (phase-locked loops)
- Digital I/O lines
A floor plan shows that the computational fabric is arranged as a column structure, with the HPS in one corner, and I/O along the edges of the FPGA. The column structure mixes LABs, DSP, and M10k memory for fast, hopefully efficient, routing. Another view of the fabric is a screen dump from the Quartus chip planner which shows the column structure color coded for block type. Pale blue for unused ALMs, dark blue for ALMs in use. Tan for DSP and Green for memory. Zooming in to one column of the LAB structure shows some of the interconnect structure which connects ALMs within a LAB and connections between LABs.
Wiring and routing
Each LAB can drive 30 ALMs through fast-local and direct-link interconnects. Ten ALMs are in any given LAB and ten ALMs are in each of the adjacent LABs.The local interconnect can drive ALMs in the same LAB using column and row interconnects and ALMoutputs in the same LAB. Neighboring LABs, MLABs, M10K blocks, or digital signal processing (DSP) blocks from the left or right can also drive the LAB’s local interconnect using the direct link connection. Longer distance connections are handled by row/column connects which trade off speed and distance.
ALM -- Adaptive Logic Module
Each ALM can be configured in several ways.
- Normal mode
Normal mode allows two 4-bit logic functions to be implemented in one ALM, or a single function of up to six inputs.
Up to eight data inputs from the LAB local interconnect are inputs to the combinational logic. The ALM can support certain combinations of completely independent functions and various combinations of functions that have common inputs.
There is also 4 bits of (optionally) registered output.
- Extended LUT mode
In extended mode, if the 7-input function is unregistered, the unused eighth input is available for register packing.Functions that fit into the template, as shown in the figure, often appear in designs as “if-else”statements in Verilog HDL code.
- Arithmetic mode
The ALM in arithmetic mode uses two sets of two 4-input LUTs along with two dedicated full adders.The dedicated adders allow the LUTs to perform pre-adder logic; therefore, each adder can add the output of two 4-input functions. The carry chain provides a fast carry function between the dedicated adders in arithmetic or sharedarithmetic mode.The two-bit carry select feature in Cyclone V devices halves the propagation delay of carry chains withinthe ALM. Carry chains can begin in either the first ALM or the fifth ALM in a LAB. The final carry-outsignal is routed to an ALM, where it is fed to local, row, or column interconnects.
- Shared arithmetic mode
The ALM in shared arithmetic mode can implement a 3-input add in the ALM. This mode configures the ALM with four 4-input LUTs. Each LUT either computes the sum of three inputs or the carry of three inputs. The output of the carry computation is fed to the next adder using adedicated connection called the shared arithmetic chain.
- MLAB mode
You can configure each ALM in an MLAB as a 32 x 2 memory block, resulting in a configuration of 32 x 20 simple dual-port SRAM block in one MLAB. To do this, each ALM LUT is re-purposed as RAM. Each MLAB supports a maximum of 640 bits of simple dual-port SRAM
A diagram summarizing the ALM, and more ALM Detail.
A specific ALM configuration dumped from the Quartus Chip planner interface: Detail example.
The Cyclone V devices contain the following clock networks that are organized into a hierarchical structure:
- Global clock (GCLK) networks
Cyclone V devices provide GCLKs that can drive throughout the device. The GCLKs serve as low-skew clock sources for functional blocks, such as adaptive logic modules (ALMs), digital signal processing(DSP), embedded memory, and PLLs. Cyclone V I/O elements (IOEs) and internal logic can also driveGCLKs to create internally-generated global clocks and other high fan-out control signals, such assynchronous or asynchronous clear and clock enable signals. This clock region has the maximum insertion delay when compared with other clock regions, but allows the signal to reach every destination in the device. It is a good option for routing global reset and clear signals or routing clocks throughout the device.
- Regional clock (RCLK) networks
Regional clocks cover single quadrants of the FPGA. The internal logic within a given quadrant can also drive RCLKs to create internally generated regional clocks and other high fan-out control signals.
These have the lowest skew for a single quadrant. To form a dual-regional clock region, a single source (a clock pin or PLL output) generates a dual-regional clock by driving two RCLK networks (one from each quadrant). This technique allows destinations across two adjacent device quadrants to use the same low-skew clock. The routing of this signal on an entire side has approximately the same delay as a RCLK region. Internal logic can also drive a dual-regional clock network.
- Periphery clock (PCLK) net
Used mostly for i/o.
Every GCLK, RCLK, and PCLK network has its own clock control block. The control block provides the following features:
• Clock source selection (dynamic selection available only for GCLKs)
• Global clock multiplexing
• Clock power down (static or dynamic clock enable or disable available only for GCLKs and RCLKs)
In Cyclone V devices, clock input pins, PLL outputs, high-speed serial interface (HSSI) outputs, and internal logic can drive the GCLK, RCLK, and PCLK networks. The clock networks attempt to deliver minimum clock-skew signals by using a distribution tree.
- Dedicated Clock Input Pins
You can use the dedicated clock input pins for high fan-out control signals, such as asynchronous clears, presets, and clock enables. CLK pins can be either differential clocks or single-ended clocks. On our FPGA there are four CLK inputs, all at 50 MHz.
The Cyclone V PLL clock outputs can drive both GCLK and RCLK networks. The specifications show that the counter sizes are 9 bits. See the Cyclone5 handbook for more info. When you use the CLK pins as single-ended clock inputs, the clock pins have dedicated connections to the PLL. Driving a PLL over a global or regional clock can lead to higher jitter at the PLL input, and the PLL willnot be able to fully compensate for the global or regional clock.
- Internal Logic
You can drive each GCLK, RCLK, and horizontal PCLK network using LAB-routing to enable internal logic to drive a high fan-out, low-skew signal. Note:Internally-generated GCLKs, RCLKs, or PCLKs cannot drive the Cyclone V PLLs. The input clock to the PLL has to come from dedicated clock input pins.
The Cyclone V variable precision DSP blocks offer the following features:
• High-performance, power-optimized, and fully registered multiplication operations
• 9-bit, 18-bit, and 27-bit word lengths
• Two 18 x 19 complex multiplications at a rate of 250 MHz.
• Built-in addition, subtraction, and dual 64-bit accumulation unit to combine multiplication results
• Cascading 19-bit or 27-bit to form the tap-delay line for filtering applications
• Hard pre-adder supported in 19-bit, and 27-bit mode for symmetric filters
• Internal coefficient register bank for filter implementation
• 18-bit and 27-bit systolic finite impulse response (FIR) filters with distributed output adder
The DSP blocks seem quite complex to set up, once you get past simple multipication. Some functions can be inferred from Verilog (see page 13-5 of HDL styles), but you should look at the DSP summary document, and consider using the Quartus IP modules LPM_MULT, ALTERA_MULT_ADD, ALTMULT_COMPLEX to infer DSP blocks.
The Cyclone V variable precision DSP block consists of the following elements:
• Input register bank
The input register bank consists of data, dynamic control signals, and two sets of delay registers. All the registers in the DSP blocks are positive-edge triggered and cleared on power up. Each multiplier operand can feed an input register or a multiplier directly, bypassing the input registers.
Each variable precision DSP block has two 19-bit pre-adders. You can configure these pre-adders in thefollowing configurations:
-- Two independent 19-bit pre-adders
One 27-bit pre-adder
The pre-adder supports both addition and subtraction in the following input configurations:•
18-bit (signed) addition or subtraction for 18 x 19 mode•
17-bit (unsigned) addition or subtraction for 18 x 19 mode•
26-bit addition or subtraction for 27 x 27 mode
• Internal coefficients (two banks of 8 coefficients)
The DSP block has the flexibility of selecting the multiplicand from eitherthe dynamic input or the internal coefficient.The internal coefficient can support up to eight constant coefficients for the multiplicands in 18-bit and 27-bit modes. When you enable the internal coefficient feature, COEFSELA/COEFSELB are used to control the selection of the coefficient multiplexer.
One 27 x 27 multiplier, or Two 18 (signed)/(unsigned) x 19 (signed) multipliers, or Three 9 x 9 multipliers
You can use the adder in various sizes, depending on the operational mode:
One 64-bit adder with the 64-bit accumulator
Two 18 x 19 modes—the adder is divided into two 37-bit adders to produce the full 37-bit result ofeach independent 18 x 19 multiplication
Three 9 x 9 modes— you can use the adder as three 18-bit adders to produce three 9 x 9 multiplicationresults independently
• Accumulator and chainout adder
• Systolic registers
If the variable precision DSP block is not configured in systolic FIR mode, both systolic registers are bypassed.
• Double accumulation register
• Output register bank
The Cyclone V I/Os support the following features:
• Single-ended, non-voltage-referenced, and voltage-referenced I/O standards
• Low-voltage differential signaling (LVDS), RSDS, mini-LVDS, HSTL, HSUL, and SSTL I/O standards
• Serializer/deserializer (SERDES)
• Programmable output current strength
• Programmable slew-rate
• Programmable bus-hold
• Programmable pull-up resistor
• Programmable pre-emphasis
• Programmable I/O delay
• Programmable voltage output differential (VOD)
• Open-drain output
• On-chip series termination (RS OCT) with and without calibration
• On-chip parallel termination (RT OCT)
• On-chip differential termination (RD OCT)
• High-speed differential I/O support
i/o pin as ADC.
Cyclone 5 FPGA memory
The memory systems of Altera Cyclone5 FPGAs have various features and limitations.
I will not talk about the HPS side here, only the FPGA side.
Memory systems include:
- M10K blocks on Cyclone5 SE A5
There are about 390 blocks (~3900 Kbits), each capable
1-bit x 8K, 2-bit x 4K,
4-bit x 2K,
5-bit x 2K, 8-bit x 1K, 10-bit x 1K, 16-bit x 512, 20-bit x 512, 32-bit x 256, 40-bit x 256.
If you instantiate bigger memories in Verilog, blocks will be automatically concantenated to build the bigger memory.
There are optional pipeline registers on data, address, write-enable, so a M10K block read can one or 2-cycles, but can be pipelined.
Dual port read/write is supported.
- MLAB blocks
Up to about 480 blocks, each holding 16, 18 or 20 words
of 32-bit data.
MLAB does not support true dual-port RAM
MLAB supports continuos reads. For example, when you write a data at the write clock rising edge and after the write operation is complete,
you see the written data at the output port without the need for a read clock rising edge.
- Logic Element Registers
Up to 128,000 bits of memory, but this uses general logic elements very quickly.
- Qsys-attached startic RAM (M10K blocks)
Easy to use, bus attached memory, which can be accessed from FPGA and HPS
Size is configured
in Qsys and uses the pool of available M10K blocks.
- Qsys-attached external SDRAM
Easy to use, bus attached memory, which can be accessed from FPGA and HPS.
There is ONE actual external SDRAM available on this board.
Configured as 32 Mwords of 16-bit memory