11. Performance Measures
Much of the attention of real-world engineers is devoted to the exploration of alternative approaches to solving a problem at hand, each of which involves different tradeoffs between various cost and performance parameters. To approach these engineering tradeoffs systematically, it is important to have quantitative metrics of each.Consider a digital system that performs some useful function: perhaps it plays digitally encoded music, shows a movie on a screen, or processes the payroll for a large corporation. Each of these applications has some set of performance requirements: the music player needs to generate output at some sampling rate in real time, and the payroll program must get the checks cut in time for payday. How these application-level requirements translate to choice of specific system structures and algorithms is the stuff of engineering.
11.1. Latency versus Throughput
To approach such questions, we abstract requirements on the performance of our digital components a bit. Consider a digital building block that computes some function, $P(x)$ for a given input value $x$. Given several implementations of devices computing $P(x)$, how do we compare their performance?One obvious criterion is the answer to the simple question: given $x$, how long does it take to compute $P(x)$? In the case of combinational circuits, the answer is captured in the propagation delay specification $t_{PD}$. But we need to generalize the measure to arbitrary circuits containing clocked components and running in the discrete time that results from reticulating time into clock cycles. By this metric, the device that can produce the result $P(x)$ from an input $x$ in the shortest time is clearly the performance winner.
Depending on our application, however, this may not be the performance parameter that interests us. If we have a large number of independent inputs $x_1, x_2, ...$ that we need to process into results $P(x_1), P(x_2), ...$ we might care more about the rate at which a given device can compute $P(x)$ values from consecutive values of $x$.
Definition: The latency of a digital system is an upper bound on the interval between valid inputs to that system and its production of valid outputs.
The latency of a system is thus the time it takes to perform a single computation, from start to finish. The latency of a combinational device is simply its propagation delay $t_{PD}$; for clocked devices, the latency will be a multiple of the clock period.
Definition: The throughput of a digital system is the rate at which it computes outputs from new inputs.
The throughput of a system is the rate at which it accepts inputs, or, equivalently, the rate at which it produces outputs.
Since a combinational circuit can only perform a single computation at a time, it follows that the maximum throughput of a combinational circuit whose latency is $L$ is simply $1/L$, corresponding to the processing of a sequence of inputs by starting each new computation as soon as the prior computation has finished. The throughput measure becomes more interesting when we consider sequential circuits which may be performing portions of several independent computations simultaneously, a topic we visit shortly in Section 11.5.
11.2. Ripple-carry Binary Adder
Consider the problem of adding two $N$-bit binary numbers, for some constant width $N$. Although our focus here is on simple unsigned $N$-bit binary operands, any adder that works for $N$-bit unsigned binary operands will also correctly add signed $N$-bit operands represented in the two's complement form described in Section 3.2.1.2, a major attraction of the latter representation scheme.An $N$-bit binary adder has at least $2 \cdot N$ bits of input and $N+1$ output bits, considering the two $N$-bit operands and a single output whose maximum value is twice the value of either operand. For reasonable input widths -- 16, 32, or 64-bit operands are common -- a combinational $N$-bit adder is a complex design problem. Fortunately, we can exploit symmetries in the structure of the problem to build viable $N$-bit adders by replicating a simple building block for each bit position of the sum.
11.2.1. Single-bit Adders
As a step toward adding $N$-bit binary numbers, consider the problem of adding a single bit of two $N$-bit binary numbers $A$ and $B$.Half Adder
Although a useful building block, the half-adder isn't quite the single-bit building block we need to replicate to make an $N$-bit adder. The half adder adds two single-bit inputs, producing a 2-bit sum: it produces a carry output for the next higher bit position, but lacks the additional input to accept a carry produced from an identical module in the next lower bit position. A bit of reflection convinces us that for a single-bit adder module to be cascadeable to add $N$-bit operands, carry bits must propagate between adjacent single-bit modules from low to high order bit positions.
Full Adder
Full Adder Implementation
In fact, this carry propagation path is the performance-limiting feature of the ripple-carry adder. Suppose the propagation delay specification for our full adder is $t_{FA}$. Recall that this specification is an upper bound on the delay propagating any valid input to a valid output. Then the latency of our combinational $N$-bit ripple-carry adder -- its propagation delay -- is $N \cdot t_{FA}$, reflecting a worst-case propagation path from (say) $B_0$ to $S_{N-1}$. If we are interested in minimizing the latency of our $N$-bit adder -- which typically we are -- it is useful to specify the timing of our full adder in more detail. In particular, we might specify separate propagation delays for each input/output path, and ask how each effects the latency of the ripple-carry adder. If we enumerate all input/output paths through our $N$-bit adder, we find paths that include $N$ $C_{in} \rightarrow C_{out}$ delays, but at most one delay to a sum output bit. We conclude that, at least from this standpoint, the $C_{in} \rightarrow C_{out}$ propagation delay is the critical specification of our full adder implementation.
N-bit Adder
11.3. Asymptotic cost/performance bounds
It is often useful to characterize the behavior of cost or performance dimensions of a design as some parameter of the design, such as the size in bits of its inputs, becomes large. We might, for example, characterize the latency of a 2-input ripple-carry adder as growing in proportion to the number of bits being added. It is useful to abstract this growth rate, which reflects the architectural limits of the ripple-carry approach, from the actual propagation delay numbers of a particular adder design; the latter reflects particular technology choices as well as the architecture.A common way to capture the abstracted growth rate of a cost or performance specification as some parameter (say $n$) becomes large uses one of several variations of the order of notation defined below:
Definition: We say $g(n) = \Theta(f(n))$, or "$g(n)$ is of order $f(n)$", if there exist constants $C_2 \geq C_1 \gt 0$ such that for all but finitely many integral $n \geq 0$, $$C_1 \cdot f(n) \leq g(n) \leq C_2 \cdot f(n)$$ We use the notation $g(n) = O(f(n))$ to specify that $f(n)$ satisfies the second of these inequalities for some $C_2$, but not necessarily the first.
The $\theta(...)$ notation specifies both upper and lower bounds on the growth it describes, while the weaker $O(...)$ constraint specifies only an upper bound. Thus the $O(...)$ notation is useful primarily for specifying upper bounds on costs (such as the time or space taken by an algorithm), and is commonly used in such contexts. The $\theta(...)$ is useful for specifying tight bounds on both costs (which we'd like to minimize) and performance parameters like throughput (which we'd like to maximize).
Note that the constant factors $C_1$ and $C_2$ in these definitions allow multiplicative constants to be ignored, while the finitely many exceptions clause effectively ignores additive constants in the growth characterization. One is left with a formula that captures the algorithmic form of the growth as a function of the parameter -- for example, whether it is linear, quadratic, polynomial, or exponential -- while ignoring complicating detail.
Some examples:
- $n^2+2\cdot n+3$ = $\theta(n^2)$, since $$n^2 \le n^2+2\cdot n+3 \le 2\cdot n^2$$ "almost always" (i.e., with finitely many exceptions);
- $log_2(n+7) = \theta(log(n)) = log_{10}(n+88)$, since the base of the logarithm and the added constants amount to constant factors;
- ${37589 ^ {34 \cdot 577} \over {\sqrt 17}} = \theta(1)$, as its value is constant (usually described as $\theta(1)$ or $O(1)$);
- $n^2 = O(n^3)$, but not $\theta(n^3)$, as the asymptotic growth of $n^3$ is an upper bound on that of $n^2$ but not a lower bound.
11.4. Asymptotic latency of N-bit Ripple Carry Adders
Returning to our N-bit ripple-carry adder, we quickly recognize that its asymptotic latency is $\theta (N)$ owing to the worst-case propagation path along the carry chain from low to high-order full adder modules. We have noted that the constant of proportionality is dictated by the propagation delay to the carry output of the full adder; minimization of this delay will improve the performance of the adder, but only by a constant factor. Improving the asymptotic performance, say to $\theta (log N)$, requires a more serious improvement to the architecture of the adder. We will explore such improvements shortly.Before doing so, however, it is instructive to generalize our N-bit ripple carry adder to more than two operands. Given the $\theta(N)$ latency of our $N$-bit ripple-carry adder, it might seem sensible to add a large number $M$ of $N$-bit operands using a tree of $N$-bit adders, much like the tree of gates described in Section 7.7.2. If the latency of a two-operand $N$-bit full adder is $O(N)$, we could reason that a tree of such adders of depth $d$ would add $M = 2^d$ $N$-bit operands in $O(d \cdot N)$ time, yielding a latency of $O(N \cdot log M)$ bound for our $M$ operand by $N$ bit adder.
Consider a simple cascade of $N$ bit adders to compute $S = A+(B+(C ... +Z) ... )$ for some number of $N$-bit operands $A, ... Z$, as shown on the left. Again, given a cascade of $M$ $N$-bit adders each having a $O(N)$ latency, we might expect the latency of an $M$-operand adder constructed this way to exhibit $O(N \cdot M)$ latency. While $O(N \cdot M)$ is a valid upper bound, a closer look reveals that there is no actual input-output path through this circuit whose cumulative propagation delay grows as $O(N \cdot M)$. This suggests that a more detailed analysis might lead to a tighter bound on the latency of the circuit.
NxM Ripple Carry Adder
We conclude, from this observation, that the worst-case input-output propagation delay grows in proportion to the sum of the width and height of the array. Since the width grows in proportion to $N$ and the height in proportion to $M$, we conclude that the longest propagation path -- the critical path that bounds the latency -- grows as $O(N + M)$ rather than $O(N \cdot M)$. Since we can actually demonstrate a path whose cumulative propagation grows as $N+M$, this bound is tight; thus we are entitled to characterize the latency of this adder as $\theta(N+M)$, using $\theta$ rather than $O$ to indicate both upper and lower bounds on the latency.
11.5. Pipelined Circuits
Consider a combinational circuit to compute some function $P(X) = H(F(X), G(X))$ given the value of the input $X$.Combinational P(X)
One observation we might make about our combinational circuit is that each of its components spends much of the $45ns$ propagation delay idle.
P(X) Signal Timing
We might be tempted to supply a new input $X_{i+1}$ during the second phase of the computation of $P(X_i)$, getting the jump on a new computation by the $F$ and $G$ modules while $H$ is completing the previous computation. However, a new input $X_{i+1}$ will contaminate the outputs $F(X_i)$ and $G(X_i)$ that are being used by $H$, which depends on valid inputs for the duration of the second phase. In order to guarantee a valid $P(X_i)$ value, we need to maintain valid, stable values at the inputs of $H$ during its entire propagation delay.
However, we can use relatively cheap registers to maintain stable values rather than combinational logic with stable inputs.
Pipelined P(X)
If the timing parameters of the registers is negligible compared to the delay of other modules, we can do an approximate analysis assuming "ideal" (zero-delay) registers. Our first observation is that the minimum clock period that can be used for this pipelined circuit must be long enough to accommodate the longest combinational path in the circuit: in this case, the two paths through the $25ns$ $H$ module. Thus the fastest we can clock this circuit is a clock period of $25ns$, or a clock frequency of $40$ MHz. Since it takes two clock cycles to perform the computation of $P(X_i)$, the latency of our pipelined circuit is now $50ns$, somewhat longer than the $45ns$ latency of the original combinational circuit.
However, this pipelined circuit has the virtue that we can begin a new computation by supplying a new input on every clock cycle.
Pipelined P(X) Timing
If we keep our pipeline full by supplying a new $X_i$ at each $i^{th}$ clock cycle (and reading the corresponding $P(X_i)$ at clock cycle $i+2$), the rate at which we are performing computations -- the throughput of our pipeline -- is just the clock frequency of $40$ MHz (the reciprocal of our $25ns$ clock period). Thus adding registers to convert our combinational circuit to a clocked pipelined version degraded the latency but improved the throughput, a tradeoff typical of pipelined circuits.
Any combinational circuit can be converted into a pipelined circuit performing the same computation by adding registers at carefully selected points. The addition of the registers actually lengthens input-output paths, so it actually increases latency; however, judicious selection of register locations can improve throughput significantly. The magic by which pipelining improves throughput involves (a) enabling a form of parallel computation by allowing the pipelined circuit to work on several independent computations simultaneously; (b) utilizing circuit components for a larger fraction of the time; and (c) breaking long combinational paths, allowing a high clock frequency.
11.5.1. Well-formed Pipelines
Simply adding registers at arbitrary locations within a combinational circuit does not generally increase throughput; in fact, indiscriminant insertion of registers will likely change the computation performed by the circuit in undesirable ways.
Ill-formed pipeline
The problem illustrated by this ill-formed pipeline is that it fails to cleanly separate values associated with one set of inputs from those of another: values from successive computations interfere with each other as they progress through the circuit. Happily, a simple discipline allows us to avoid this problem: a necessary and sufficient condition for a well-formed pipeline is that it have an identical number of registers on each input-output path. By this standard, our failed example is not a well-formed pipeline.
We formalize this fact in the following definition:
Definition: A K-Stage Pipeline ("$K$-pipeline", for short) is an acyclic circuit having exactly $K$ registers on every path from an input to an output.
Note that by this definition, a combinational circuit is a $0$-pipeline.
It is useful to view a $K$-stage pipeline for $K>0$ as $K$ cascaded pipeline stages, each consisting of combinational logic with a single register on each output. This view leads to our convention that every pipelined circuit ($K$-pipeline for $K>0$) has one or more registers on every output. Such a convention is useful for a number of reasons; it allows, for example, a $k$-pipeline and a $j$-pipeline to be cascaded to create a $(k+j)$-pipeline.
For our synchronous pipelines (adhering to the single-clock synchronous discipline), the period $t_{CLK}$ of the shared clock must be sufficient to cover the longest combinational delay (plus register propagation and setup times). In this case,
- the latency of the $K$-pipeline is $K \cdot t_{CLK}$;
- the throughput of the $K$-pipeline is the frequency of the shared clock, or $1 / t_{CLK}$; and
- Inputs for the $i^{th}$ computation are presented during clock cycle $i$, and corresponding outputs are available during clock cycle $i+K$.
11.6. Pipeline Methodology
Given the constraint that a well-formed pipeline has the same number of registers on every input-output path, it is useful to develop a systematic method to convert an arbitrary combinational circuit to a well-formed pipeline offering improved throughput. While there are many possible approaches to this task, we sketch here an intuitive, informal technique that serves simple pipelining problems well.
Combinational Version
11.6.1. Ensuring well-formedness
Note that each component of the combinational example is marked with its propagation delay, and that the diagram is laid out so that all signal propagation is left-to-right. While the latter constraint is not essential to our technique, it makes the pipelining task easier and less error-prone.
3-Stage Pipelined Version
Each contour we draw identifies the output edge of a pipeline stage; thus, the first contour we establish is one that intersects all outputs of the combinational circuit. Subsequent contours are strategically chosen to break long combinational paths, improving throughput. We then make a pipelined version of our combinational circuit by inserting a register at every point where a signal line crosses a pipeline contour. Every input-output path thus crosses each contour exactly once; given $K$ contours, the result is necessarily a well-formed $K$-pipeline.
Notice that, with our choice of pipeline boundaries leading to the 3-stage pipeline shown, the longest combinational path (and consequently shortest clock period) is $8ns$. Given $3$ stages at an $8ns$ clock period, our pipeline has a latency of $3 \cdot 8ns = 24ns$ and a throughput of $1 / 8ns = 125 MHz$.
To summarize our technique for generating a $K$-stage pipeline from a combinational circuit:
- Draw a first pipeline contour that intersects every output of the combinational circuit. This provides a template for implementation of a $1$-stage pipeline, usually an uninteresting implementation choice.
- Draw additional contours determining boundaries between pipeline stages. Each contour must (a) partition the circuit by intersecting signal lines, and (b) intersect lines so that every signal crosses the contour in the same direction. Each additional contour increases the number of pipeline stages by one.
- Implement the pipelined circuit by inserting a register at every point in the combinational circuit where a signal line crosses a pipeline contour.
- Choose a minimal clock period $t_{CLK}$ (equivalently, a maximal clock frequency) sufficent to cover the longest combinational path within the circuit. Unless ideal registers are assumed, such paths must include the propagation delay of source registers as well as setup times of destination registers.
- The result is a $K$-stage pipeline (where $K$ is the number of contours) whose throughput is $1 / t_{CLK}$ and whose latency is $K \cdot t_{CLK}$.
11.6.2. Optimizing Throughput
Not every well-formed pipeline offers a performance improvement over its combinational counterpart, or over alternative pipelines with fewer stages (and hence less hardware and performance cost). In general, it is worth adding a pipeline stage only when that stage reduces the critical (longest delay) combinational path, and consequently allows a faster clock and correspondingly higher throughput.
11.7. Hierarchical Pipelines
The slowest-component bottleneck illustrated in the previous section can be mitigated by replacing slow combinational components with pipelined equivalents. Suppose, in the example of the previous section, it were possible to replace the combinational $A$ component whose latency of $2$ constrains the clock period with a $2$-staged pipelined $A'$ module capable of operating at a clock period of $1$.In general, it is straightforward to incorporate pipelined components into pipelined systems. Given a pipelined system whose performance is limited by the propagation of a specific combinational component, pipelining that bottleneck component is often the preferred solution.
11.7.1. Component Interleaving
It is not always practical to pipeline a bottleneck combinational component in a design, either for technical or practical reasons relating, for example, to the inaccessibility of the internals of a critical component. The bottleneck $A$ module of the prior example might be a black box library module whose supplier offers no pipelined equivalent nor grants engineering access to its implementation details. Such a constraint would seem to restrict us to a clock period of 2 and correspondingly low throughput of our system.Fortunately, there are alternative strategies to get around this performance bottleneck. We begin by the simple observation that if one instance of some component $X$ offers throughput $T_X$, we can use $N$ instances of $X$ running in parallel to achieve an aggregate throughput of $N \cdot T_X$. We may need some surrounding logic to route inputs to and outputs from our farm of parallel $X$ modules, but clearly the throughput of any module can be scaled up by simple replication.
Armed with this insight, we might ask how to double the throughput of the bottleneck $A$ module in our example by using two $A$ modules rather than a single one. It would be convenient, for example, if we could use two $A$ modules to build a plug-in replacement for $A$ whose external behavior is precisely that of a 2-stage pipelined $A$ module. Such a replacement could simply be used as a component in the 4-stage pipelined approach shown above.
2-way Interleaved Module
The two latches connected to the $X_i$ input latch input data on even and odd cycles, respectively, and hold the data as stable output for the nearly two clock cycles it takes their $A$ module to compute a valid result. During an odd clock cycle, $Q=1$ and the current $X_i$ input is routed through the lower latch and $A_1$, stopping at the output mux (purple arrow); meanwhile the $X_{i-1}$ input latched during the previous clock cycle remains in the upper latch, where it propagates through $A_0$ and the output mux (yellow arrow) to be loaded into the output register at the next active clock edge. The roles of the latches alternate every cycle, giving each $A$ component the required two cycles to complete its computation. Consistent with the specifications of a two-stage pipeline, however, the 2-way interleaved module takes a new input on every clock cycle and produces the corresponding output two cycles later.
N-way Interleaving
11.7.2. Pipeline Issues
Any combinational circuit can be partitioned into well-formed pipeline stages, with potential gains in throughput. Since the throughput of a pipelined circuit is constrained by its clock frequency, we generally maximize throughput by minimizing the longest combinational path within any pipeline stage. The clock period chosen for such a pipeline must be sufficient to cover this propagation delay plus the propagation delay and setup time of the registers used for the pipelining.As we increase the number of pipeline stages to improve throughput, the cost and delay of the pipeline registers themselves may dominate other design parameters. Consider a complex combinational circuit constructed entirely from 2-input NAND gates having a $t_{PD}$ of 10ps. Pipelining this circuit for maximal throughput would place at least one pipeline stage boundary between every pair of connected gates, allowing the resulting circuit to be clocked with a period of 10ps plus the register $t_{PD}$ and setup times. If these were 35ps and 5ps, respectively, the minimum clock period would be 50ps, for a throughput of 20 GHz. This design would be very costly, owing to the number of flip flops required. Allowing for a 2-gate delay in each stage would cut the number of stages in half, yet allow a 60ps clock cycle for a marginally lower throughput at dramatically reduced cost.
Combinational P(X,Y)
11.8. Alternative Timing Disciplines
As we combine sequential (clocked) components to make bigger sequential systems, we need an engineering discipline to control the timing of operations by each component, and to integrate their effects into a coherent system behavior. While our examples will focus on simple, widely used approaches, it is worth noting that there are reasonable alternatives whose usefulness and popularity vary among application arenas, engineering cohorts, and chapters of the history of digital systems engineering. In this section, we briefly sketch the universe of timing disciplines and several interesting variants on our simple synchronous approaches.11.8.1. Synchrony vs. Localization
In our taxonomy of control disciplines, we distinguish between two separable engineering choices:- synchronous versus asynchronous systems, the distinction being whether signal changes are synchronized with respect to a common clock (synchronous) or may happen at arbitrary times (asynchronous).
- globally versus locally (or self) timed systems, an architectural choice between centralized logic for controlling the timing of component operations and distributing the logic among the components themselves.
11.8.2. Synchronous Globally-timed Systems
SGT System
Individual components may have control signals, whose changes are synchronized with the common clock. A simple example is the Load Enable input that dictates whether a register loads new contents at the next clock edge. The globally-timed aspect of this approach reflects a centralized source of these control signals, typically a FSM.
11.8.3. Synchronous Locally-timed Systems
Example SLT Protocol
SLT Signal Timing
Locally-timed systems generally involve additional logic that combines component timing signals to stimulate appropriate system-level timing. A glimpse of such logic appears in Section 11.8.4.1.
The elegant idea that underlies locally-timed systems is that the time taken for each operation is determined by the component that performs that operation, rather than by some externally specified plan. Each component will begin a new operation when modules supplying input data signal that the required inputs are ready, and will signal modules depending on its output when the computation has completed. Potential advantages of locally-timed system architectures include
- The timing of an operation, along with the implementation details of that operation, are enclosed in a single component. That component can be viewed externally as a "black box": knowledge of its internal details and timing are unnecessary for its effective use as a component. Thus, locally-timed component provides a more effective abstraction than one whose timing details must be part of its specification.
- Upgrading a component of a locally-timed system with an improved, faster version may improve the system performance with no additional changes. The new module will signal readiness of its output sooner, adjusting the timing of subsequent operations automatically.
- A locally-timed component can exploit a more sophisticated timing model than a conventional alternative. The model may be data dependent: for example, a self-timed multiplier might complete quickly when one of its operands is zero, allowing system timing to exploit this shortcut without any external provision in the system design.
11.8.4. Asynchronous Locally-timed Systems
Example ALT Protocol
Typically the "have/got" signals follow a protocol similar to that described for the synchronous locally-timed systems, where each of the control signals makes both a positive and a negative transition for every data transfer in a sequence. This convention requires two transitions on each control signal line per transfer, imposing an engineering challenge since control signals change at twice the frequency of the data values. An elegant alternative is to use a single transition signalling protocol: the flow of data $X$ is controlled by a "have X"/"got X" control signal pair where any transition on either control line -- positive or negative -- signals the corresponding "have X" or "got X" event.
Single-transition Signalling
11.8.4.1. Locally-timed Example
Self-timed Example
The system itself behaves as a locally-timed component following the same handshake protocol; its input $X$ has "here's X" and "got X" control lines to control the flow of data in, and its single output is likewise associated with control signals in either direction to handshake data to its destination. Within the system itself, every data path has a pair of associated control signals allowing the data source to signal readiness of the data, the destination to signal readiness to accept the data, and so on.
Since the system components are locally-timed and follow the same protocol, each internal data path is associated with a pair of "here's" and "got" control signals between the source and destination of the data. A complication arises from the output of the $A$ componenent, which is routed to two destinations, $B$ and $C$. The pair of control signals associated with $A$'s output must be routed to control signal terminals of both the $B$ and $C$ modules, each a destination whose timing must be coordinated with $A$'s output timing. We show in the diagram a yellow box, representing a library fan-out circuit module associated with our timing discipline for routing a locally-timed data path to multiple destinations.
Fan-out Module
A computation of $C(A(X), B(A(X)))$ by our example locally-timed system would typically proceed as follows:
- Initially the system is quiescent, and signals its readiness to accept a new input by a logical 0 on the "got X" control signal from $A$ to the component supplying $X$ data.
- When new $X$ data becomes available, its source (connected to the $X$ data and control lines) drives valid input data on the $X$ lines, and signals its readiness to $A$ via the "have X" control signal.
- The $A$ component recognises that it has valid data, and begins the computation of $A(X)$.
- At some point after seeing the "here's X" signal, $A$ will no longer require valid input data and signal that via the "got X" line back to its source. This may happen early or late in the computation of $A(X)$, depending on the design of the $A$ module (which may, for example, load the $X$ value into an internal register to allow a faster "got X" response).
- After the source of the $X$ data sees the "got X" signal, it will return "here's X" to 0 and stop driving the $X$ data lines.
- Once the $A$ component has completed its computation, it drives the $A(X)$ result on its output and signals its validity to the $B$ and $C$ modules via the corresponding "here's" control signal.
- Seeing the signal that its $A(X)$ input data is valid, the $B$ module begins its computation. Once again, $B$ may respond with a "got A(X)" signal at any point during its operation.
- When $B$ has computed $B(A(X))$, it drives its output line and signals readiness to the $C$ module.
- The $C$ module begins its computation after it has been signaled that its inputs from both $A$ and $B$ are ready. Again, it may respond with the appropriate "got" control signals at any point during its operation.
- When both $B$ and $C$ signal they no longer need their $A(X)$ input, the fan-out logic signals "got A(X)" to $A$, which restores the "has A(X)" signal to 0 and stops driving the $A(X)$ data lines.
- On completion of $C$'s computation, it drives its output data lines and signals "here's C(A(X),B(A(X)))" to external components taking this value as input.
11.9. Chapter Summary
This chapter introduces two related but separate measures of the performance of a system: Latency, the end-to-end time taken for each individual operation, and Throughput, the rate at which operations are performed. These metrics can be used to characterize potential usefulness of a wide range of systems, including combinational and sequential digital information processing circuitry.Latency and throughput of parameterized families of circuits, such as $N$-bit adders, may be characterized in asymptotic terms as parameters grow large using the $O(...)$ and $\theta(...)$ notations. The latency of an $M$-word by $N$-bit ripple-carry adder, for example, grows as $\theta(N+M)$.
Pipelining is introduced as a way of improving system throughput at some cost in latency. A $k$-pipeline is a circuit having:
- $k$ distinct pipeline stages, separated by registers on outputs of each stage;
- $k$ registers on every input-output path;
- potential to begin a new operation on each clock cycle, producing corresponding outputs $k$ cycles later;
- Latency of $k \cdot t_{CLK}$ and throughput of $1 / t_{CLK}$, where $t_{CLK}$ is the clock period.
Combinational circuits may be viewed as $0$-pipelines, a degenerate case whose latency is $t_{PD}$ and whose throughput is $1 / t_{PD}$.