CasperTutorial00

From AstroBaki
Revision as of 15:08, 1 April 2010 by Aparsons (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

An Introduction to Field-Programmable Gate Arrays for Digital Signal Processing

AN INTRODUCTION TO FIELD-PROGRAMMABLE GATE ARRAYS FOR DIGITAL SIGNAL PROCESSING

1 What’s an FPGA?

A Field-Programmable Gate Array (FPGA) is chip for performing digital logic operations. At its most basic level, and FPGA is a bunch of simple logic elements and a huge network of wires, and connections between logic elements and wires are activated by an array of bits that is loaded onto the chip. This makes an FPGA something of a chameleon; a user can rewire the chip to do many different things, depending on the application.

The only kind of chip that competes with the FPGA for flexibility is the Central Processing Unit (CPU) that sits at the heart of every modern computer. CPUs are flexible for a completely different reason than FPGAs. CPUs are not configurable chips. Instead, they are hard-wired with a number of registers for holding numbers, and a number of supported operations for how to modify or combine the numbers in those registers. What makes a CPU flexible is its capability to execute these operations in arbitrary order, according to the instructions contained in a program in memory that was written by a user. Long ago, CPUs had only a single core and so every instruction in a program was executed sequentially, one at a time. Nowadays, CPUs have multiple cores, and several instructions can be executed in parallel. This development notwithstanding, CPUs are basically serial processors, with some parallel capability allowed by multicore technology.

In contrast, FPGAs are massively parallel. Every logic element in an FPGA is like a parallel processor. The rub, of course, is that these “processors” do very simple things, and once you configure a logic element to perform an operation, that is the only operation it can perform until the FPGA is reprogrammed. The parallel-ality of an FPGA makes it incredibly powerful for processing high-speed streaming data, but it also makes FPGAs hard to program. The difference in programming paradigms between CPUs and FPGAs is probably the biggest road-block to the widespread adoption of FPGA co-processors in computers.

2 Clocks and Timing

FPGAs are spatially parallel processors. Employing spatially separated logic elements to process data in coordination with one another can greatly improve performance, but this comes at a cost. The cost is that look behind the curtain of how digital processors pass information around.

In digital electronics, wires carry a high voltage to signal 1 and a low voltage to signal 0. But what happens in between? Electronic signals, though they travel fast, do take time to arrive at their destination. It can also take a while for a wire carrying a 0 to ramp up its voltage high enough to signal a 1. When reading the state of a wire, a receiver needs to know when it is in a stable state. Likewise, a transmitter needs to know when it can change the state of the wire to send the next piece of information.

To coordinate the timing of transmission and reception, almost all digital electronics rely on a special signal called a clock to synchronize decision-making. A clock is a single-wire signal that toggles between high and low states at a fixed rate. Typically, every logic element in a synchronized system will make a decision about its next state on the rising edge of a clock (that is, when it goes from 0 to 1). Shortly after the rising edge of a clock, logic elements have the opportunity to look at their next inputs and decide what their next output will be. When the next rising clock edge occurs, a final decision is made, and all logic elements jump to their next states.

The foundational element of synchronization is called a register (another name for a register is flip-flop or flop for short). The simplest register (called a D flip-flop) has two inputs: data (D) and clock (clk). Its single output (Q) is assigned the state of D, read on the rising edge of clk.

Logic elements do not always need to be clocked. For example, simple logic gates like AND, OR, NOT, XOR, and NAND can continuously read their inputs and create the appropriate outputs after a nominal delay. It is possible for signals to propagate several logic levels in the time between rising clock edges, but at some stage the output of one of these logic elements will have to be registered. Any further logic operations will have to be applied to the output of the register on the next clock.

2.1 Timing Diagrams: Setup and Hold

Registers cannot make decisions instantaneously. Flip-flops typically have a setup time during which the D input to the register must be stable before the rising edge of a clock, and a hold time after the rising edge of a clock during which the D input must remain fixed. Furthermore, there is a clock-to-Q time, which is the time it takes the value to propogate from D to Q following the rising edge of a clock. The details of these mechanisms are not important for this discussion, but understanding that signals have a limited amount of time to propogate from one register to another in order to satisfy these timing demands is essential to designing FPGA circuits.

A typical FPGA signal might begin as a 0/1 on the D input of a flip-flop when a rising clock edge occurs. A nanosecond later, that state will appear on the Q output of the flip-flop, whereupon it will travel down a wire (incurring a couple more nanoseconds of delay), through a few logic circuits (a few more nanoseconds), to arrive at the D input of another flip-flop in time to satisfy the setup time required by that register before the next rising clock edge.

For visualizing how signals are sent and recaptured through synchronous digital circuits, timing diagrams are a useful tool.

2.2 Pipelining: Latency vs. Throughput vs. Resource Utilization

Suppose you have just designed a circuit that takes two signals from the outside world arriving on FPGA pins, ORs them together, and one clock later, ANDs the result with a signal arriving on a third pin, then outputs the result to a fourth pin. Suppose that you want a clock period of 5 ns so that the AND operation involving the third pin happens at just the write time. After telling the compiler to target a 200 MHz clock rate, you start compiling your design, only to find that it returns with an error: your design has not met timing. What does this mean, and how can you fix it?

When a design fails to meet timing, this means that there is a signal path between two registers whose total delay through layers of logic and routing down wires exceeds the 5 ns clock period you were shooting for. Though the compiler may finish compiling you design, you will not be able to run it at 200 MHz. If you do, the behavior of your design will be indeterminant–it will output junk.

To solve this problem, you have two options. If you don’t mind running your design at a lower clock rate, you can lower your clock frequency. Unfortunately, this is not commonly an option. The usual option is to add registers to your design to break up the long signal paths (called pipelining). In our example, registers could be added at the inputs of the AND and OR. However, when pipelining, one must be careful to keep signals time-aligned. For example, if we place registers at both inputs of the OR, we must also place one between the third pin and the AND. Failure to do so would mean that the “OR” side of the AND block would arrive one clock later than expected, and that would mean the circuit would be checking for a signal on pin 3 that is true 10 ns after pin 1 or pin 2 were true, instead of 5 ns.

Pipelining a design can help you reach higher clock rates, but it comes at a cost. Registers are ubiquitous on FPGAs, but they do eventually run out. Pipelining a large design raises its resource utilization, which can result in a design demanding more physical resources than are available on the chip (and an associated compiler error). Pipelining also raises the latency of a design–the number of clocks it takes to flow from input to output. For applications that require fast response to a stimulus, too much latency can pose a problem. However, many applications only require throughput, which is total sustained rate at which data flows through a design. Throughput is determined by clock rate, and so can be improved by pipelining a design.

When designing for an FPGA, a programmer must balance resource utilization and throughput by carefully pipelining a design. This involves adding latency to long signal paths, but avoiding unnecessary pipelining that consumes FPGA resources. A bizarre paradox of FPGA design is that consuming too many physical resources forces a design to be spread out across the entire chip. The wide physical spacing between circuit components incurs routing delays as signals are forced to travel farther. Higher routing delays lower the clock rate at which a design can be compiled, and hence, lower throughput. Thus, it is actually possible for excessive pipelining to decrease throughput.

3 Bits, Bytes, and Beyond

3.1 A Bit is a Wire

3.2 Value vs. Representation

3.3 The Binary Point

In decimal, a number is formulated by multiplying each character by a power of 10. We know which number is multiplied by because it lies immediately to the left of the decimal point, for example:

In binary, the number multiplied by lies immediately to the left of the binary point:

When programming for FPGAs, we often need to know the specifics of a number, namely how many bits it is made up of and where the binary point lies. Simulink displays this information using the format n_m, where n is the bit width of the number and m is how many bits lie to the right of the binary point. For example, 101.01 would be a 5_2 number. A data type is called fixed-point when the binary point is fixed in one location.

3.4 Unsigned vs. Signed Representations

In mathematics, we represent a negative number by adding a ’-’ character to the front of it. A computer deals only with 0’s and 1’s, so it must use another method. One solution is to simply use the most-significant bit to represent the sign, a method known as sign-magnitude. This method has problems: the number zero is represented twice (+0 and -0 have different representations), and adders and subtracters require extra circuitry to handle negative numbers.

A better solution is a method known as two’s complement. This is what most computers use. To negate a number in this system, you invert the number bit-wise (change 1’s to 0’s and vice versa) and then add 1 to result. The result of these rules is that the "higher half" of the unsigned numbers wrap around and become the negative numbers. With 3 bits, we have:

000

001

010

011

100

101

110

111

unsigned

0

1

2

3

4

5

6

7

sign-magnitude

0

1

2

3

-0

-1

-2

-3

two’s complement

0

1

2

3

-4

-3

-2

-1

Note that the two’s complement system only represents zero once. What’s more, adding and subtracting two’s complement numbers the normal way (as if they were unsigned) produces the correct result. In Simulink, only unsigned and two’s complement numbers are supported.

4 What Else?

4.1 Slices

4.2 Block RAMs

A Block RAM (BRAM for short) is random-access memory on an FPGA. RAM allows you to write data of a fixed size to addresses that you can then read later. Addresses are unsigned integers that simply count through locations. For example, suppose we had a RAM with 8 locations, each of which can hold a signed 32-bit number:

ddress (decimal)

Address (binary)

Value

0

000

?

1

001

?

2

010

?

3

011

?

4

100

?

5

101

?

6

110

?

7

111

?

Initially, the contents of this memory are unknown. However, if we wrote a "-3" to address location "6", our memory would then read:

ddress (decimal)

Address (binary)

Value

0

000

?

1

001

?

2

010

?

3

011

?

4

100

?

5

101

?

6

110

-3

7

111

?

Thereafter, any time we read from address "6", we would receive a value "-3", until we wrote a new value to address "6". BRAMs often have a 1-bit port called write-enable (we for short), that controls whether you write want to write to a location or read from it. If you choose to write (by passing a "1" to we), the current value of the data_in port is written to the address specified by the current value of the addr port. If you choose to read, the current value of the data_in port is ignored, and instead the value at the location specified by the addr port is copied to the data_out port.

Some BRAMs allow you to perform a read and a write at the same time. In this case, we should be "1", but the data_out port will also retrieve the value at the specified address just before the new value was written there. Alternately, you can configure BRAMs to "write-before-read", in which case the memory location is overwritten before the read occurs.

4.3 Multipliers/DSPs

4.4 CPUs