10. Synchronization and Arbitration
The engineering discipline introduced in Chapter 8 allows us to design sequential circuits that are provably reliable so long as setup and hold time constraints -- our dynamic discipline -- is adhered to in the timing of input changes. This powerful engineering tool allows us to build arbitrarily large clocked digital systems from interconnected clocked components, but requires that the clocks of the component systems be synchronized for the components to communicate reliably. Where we can do so conveniently, we can satisfy the dynamic discipline by using a single clock throughout the system.However, it is not always convenient -- or even possible -- to guarantee that inputs to a system are well behaved with respect to the system's clock; Section 9.2.6 discusses one such common application. These cases motivate us to explore the costs of violating the dynamic discipline, costs which are deceptively easy to underestimate and have left several decades of scars on the history of digital systems engineering.
One lesson from this chapter is the parallel roles of the static and dynamic disciplines in our digital abstraction.
10.1. An Input Synchronizer?
Synchronizer
A workable synchronizer need only postpone troublesome transitions in its input until they are safely distant from active edges of the local clock. We can allow it to be conservative about choosing to delay transitions, allowing it to delay a few transitions that are just outside of the setup/hold time window. We can also be tolerant about the details of exactly when the postponed transitions appear in the $U(t)$ output, so long as they are synchronized with our clock; if our button press is neglected for an extra cycle of our gigahertz clock, the user is unlikely to notice. We might only require, for example, that each output transition occur within $N$ clock cycles of the corresponding input transition for some small constant $N$.
Given our tolerance for considerable sloppiness of this sort in the specifications of a synchronizer, it seems entirely plausible that such devices could be made to work reliably. Indeed, there was a time when major companies included synchronizers in their product lines. The remarkable and painful lesson learned during recent decades, however, is that synchronization is a much harder problem than it seems. It is, in fact, impossible to build a perfectly reliable synchronizer even given perfectly reliable components of the sort that have been developed here.
The difficulty is that, given a transition $T_B$ in the synchronizer input that is close to an active transition $T_C$ in the local clock, our synchronizer must decide whether $T_B$ comes before $T_C$. While our specifications allow the synchronizer's decisions to be imperfect, we require them to be made in a bounded amount of time -- a constraint that is problematic for fundamental reasons.
10.2. The Asynchronous Arbiter
Asynchronous Arbiter
However reasonable our specification, it unfortunately cannot escape the surprising truth:
10.2.1. Naive Arbiter Implementation
To the uninitiated, it might appear that an arbiter could be implemented simply using a D flip flop.Isn't this an Arbiter?
The only aspect of the arbiter spec that remains to be met is that, when $t_B$ and $T_C$ are sufficiently close (within $t_E$ of each other), the arbiter must produce either a valid $0$ or a valid $1$. The value is completely arbitrary, but it must be a valid logic level. Although the allure of the digital abstraction tempts us to assume that the flip flop output is valid outside of its bounded-time transitions, that assumption turns out to be a dangerous delusion. In fact, we forfeit the promise of output validity following a violation of the dynamic discipline by clocking an unsynchronized input.
We find three such intersections: two are the stable equilibria that represent lached valid $0$ and $1$ logic values. However, the geometry of our curves requires that there be a third intersection between these two, the unstable equilibrium often colloquially referred to as a meta-stable state. It represents a fixed point of the positive feedback path through the multiplexor -- an input voltage $V_M$ that causes that path to generate an output that is also $V_M$. Due to the continuity of the voltage transfer curve of the multiplexor, such a $V_M$ always exists. Typically $V_M$ is in the grey area between voltages which represent valid logic levels, representing neither a valid $0$ nor a valid $1$.
The bothersome metastable state is an inevitable consequence of bistable behavior, and is exhibited by familiar examples from our everyday lives. We recognise as an unlikely but real outcome of coin flips (landing on edge), horse races (dead heat), hockey games (overtime), and other real-world decision-making processes. The U.S. Presidential election of 2000 demonstrated that the fundamental difficulty of bounded-time decision making even extends to very large systems of arbitrary complexity.
10.3. The Metastable State
We can observe several properties of the metastatble state:- If the dynamic discipline is obeyed (by observing flop setup/hold time constraints), the metastable state will be reliably avoided. It becomes a problem only when dealing with unsynchronized inputs.
- The metastable voltage $V_M$ represents the switching threshold of our transistors, necessarily in the high-gain region of their operation. For this reason, it is normally not a valid logic level.
- The metastable state represents an unstable equilibrium: a small perturbation in either direction will cause it to accellerate in that direction, quickly settling to a valid $0$ or $1$.
- The rate at which the output voltage $V_{out}$ of a flop progresses toward a stable $0$ or $1$ value is proportional to the distance between $V_{out}$ and $V_M$.
- Since $V_{out}$ is never exactly $V_M$, $V_{out}$ will always settle to a valid $0$ or $1$ eventually.
- However, if $V_{out}$ may be arbitararily close to $V_M$, it may take arbitrarily long for $V_{out}$ to become valid.
- The probabiliy of a flip flop remaining in a metastable state for $T$ seconds falls off exponentially with $T$; thus, simply waiting for a modest interval -- a clock cycle or two -- dramatically improves reliability.
10.3.1. Mechanical Analog
Mechanical Metastability
The device is bistable. We could in principle use the position of the ball -- left or right -- as a means to store a single bit, and arrange to change the value of the stored bit by kicking the ball to the left or right with enough force to reliably get it to the opposite side of the hill. Although we won't develop details of such a mechanical storage device here, it should be quite plausible to the reader that we could devise a way to make it work reliably. Of course, there would be some rules that must be followed for reliable operation; for example, one could not kick the ball to the left and to the right at about the same time. Such confusing instructions would compromise the momentum imparted to the ball, and might result in the ball landing near the unstable equilibrium point at the top of the hill -- the metastable state of our mechanical system.
Of course, given the vagueries of inaccuracies and noise, the ball is never precisely at the metastable point; it will always feel some slight force toward the left or the right. It will always roll down the hill to one of the stable equilibria eventually; but, depending on its distance from the metastable point, this trip may take an arbitrarily long time.
10.3.2. Observable Metastable Behavior
One reason for the reluctance of the digital community to accept the inevitability of arbitration failures is that they are inherently difficult to reproduce reliably in the laboratory. Even when violating the dynamic discipline by clocking unsynchronized data into a flip flop at high frequencies, observable metastable behavior is quite rare: it depends on pathological accidents of timing or other variables with precision beyond our ability to control them reliably.However, it is reasonably straightforward to devise an experimental setup in which a flip flop is clocked repeatedly at high frequency, with a data input whose transitions are timed so as to maximize the probability of metastable failure.
Typical observed metastable behavior
It should be noted that many other observable symptoms of arbitration failure are possible. Once we violate the dynamic discipline, there are no guarantees about the value circulating around our combinational feedback loop. The multiplexor input is invalid, so it need not produce a valid output. Rather than settling to the metastable voltage, for example, the value circulating might exhibit high-frequency oscillation such as that shown as $Q'$. Depending on parasitic reactive components in the feedback loop, such dynamic invalidity is sometimes observed.
10.4. Delay and Validity Probability
We have already noted that problems with metastability can be avoided simply by assuring that the dynamic discipline is followed by our clocked circuits -- specifically, that input signals to each clocked component are valid and stable during that component's specified setup/hold time window. Although the problem is fundamental when an unsyncronized input must be passed to a clocked system, metastable behavior and its consequent system failures are still quite rare.
Synchronizer test setup
10.4.1. Probability of nearness to metastable point
Using this model, we can approximate the probability that $V_0$ -- the circulating voltage immediately following the clocking of an unsynchronized input -- falls within $\epsilon$ of the metastable voltage as \begin{equation} Pr[ | V_0 - V_M | \leq \epsilon ] \leq { ( t_S + t_H ) \over t_{CLK} } * { {2 \cdot \epsilon} \over ( V_H - V_L ) } \label{eq:prob_within_epsilon} \end{equation}
10.4.2. Delay cures metastability
Given this model for the rate at which the circulating voltage $V$ progresses away from the troublesome metastable value, we can approximate the amount of time it will take for $V$ to progress from a value near $V_M$ to a valid $0$ or $1$. We do this by a formula that specifies, for a given time interval $T$, how close $V_0$ must be to $V_M$ to require $T$ seconds for $V$ to become valid: \begin{equation*} \epsilon ( T ) \approxeq (V_H - V_M) \cdot e ^ {-T / \tau} \end{equation*} Together with equation \eqref{eq:prob_within_epsilon}, we can approximate the probability that the output of our synchronizer flop remains invalid $T$ seconds after clocking an unsynchronized input as \begin{equation*} Pr_M(T) \approxeq Pr[ |V_0 - V_M| < \epsilon ( T ) ] \approxeq K \cdot e ^ {- T / \tau } \end{equation*} where $K$ and $\tau$ are constants derived from implementation parameters.
While we have glossed over a variety of issues in arriving at this formula, it underscores an important observation about the metastable behavior that causes synchronization failures: the probability that a flipflop output will remain invalid decreases exponentially with elapsed time. This key fact allows us an engineering workaround, which reduces the unsolvability of the arbitration problem from a showstopper to an annoyance: we can, in fact, clock unsynchronized inputs to an arbitrarily high level of reliability by simply introducing modest delay for any metastable behavior to settle.
Average time | ||
---|---|---|
T | $Pr_M(T)$ | between failures |
$31$ ns | $3 \cdot 10^{-16}$ | $1$ year |
$33.2$ ns | $3 \cdot 10^{-17}$ | $10$ years |
$100$ ns | $3 \cdot 10^{-45}$ | $10 ^ {30}$ years |
10.5. Bounded-time Discrete Mapping
The problem of making "which came first?" decisions in bounded time confronted by the arbiter is, in fact, an instance of a more general difficulty: the problem of mapping a discrete variable into a continuous one. If the mapping is nontrivial (in the sense that its range includes multiple discrete values), there will be input values $i_1$ and $i_2$ that map to distinct discrete outputs $V_1$ and $V_2$, respectively. If the mapping is to be performed by a mechanism that computes $V = f(i)$ for some continuous function $f$, the continuity of $f$ assures that there will be input values between $i_1$ and $i_2$ that produce output values between $V_1$ and $V_2$. Indeed, even if $f$ has a discontinuity for some input $i_1 < i_M < i_2$, the mapping of $i_M$ is ambiguous and hence problemmatic. We have seen this problem in the mapping of continuously-variable voltages to discrete $1$s and $0$s in Chapter 5, where we conspired to avoid the difficult decision problems by an engineering discipline that excludes a range of voltages from our mapping and avoids requiring our circuits to interpret these values as valid logic levels. When we moved to discrete time in Chapter 8, we introduced an analogous discipline to duck hard problems deciding the ordering between active clock edges and input data transitions. In this case, our excluded "forbidden zone" is the range of continuously variable timing relationships ruled out by the setup/hold time window of the dynamic discipline.10.5.1. What we cannot do reliably
The difficult problems identified in this chapter share the attributes that they involve a nontrivial continuous-to-discrete mapping that must be made in bounded time. If we relax either of these constraints, solutions become plausible. The astute reader might recognise that this is is essentially the same problem, translated to the time domain, that was addressed in the voltage domain in Chapter 5 by the forbidden zone and noise margins. The static discipline avoids unbounded-time decisions about whether a continuously variable voltage represents a $1$ or a $0$; the dynamic discipline avoids unbounded-time decisions about whether a clock transition at $t_C$ comes before a data transition at $t_D$. Each of the avoided problems involves a mapping of a continuous variable into two discrete cases: in the former case the variable is a voltage, in the latter a time interval $t_C - t_D$.
10.5.2. What we CAN do reliably
Although nontrivial bounded-time mapping of continuous to discrete variables confronts fundamental problems, slight relaxation of critical problem constraints makes seemingly close relatives tractible.Perhaps the most obvious solvable variant is our clear ability to make a reliable device that performs a trivial mapping of continuous to real values in bounded time. A device implemented by a connection to ground, for example, could reliably report that $T_B < T_C$ for any relative timing of two transitions, or that $V < 3.14 volts$ for any input voltage $V$.
10.5.3. Folk Cures
The fundamentals of synchronization and arbitration failures have taken decades to be accepted by the engineering community. This history is laced with interesting proposed "solutions" to the problem of building a reliable synchronizer, most of which can be viewed with the same skepticism we might apply to a new proposal for a perpetual motion machine.
While there are many things wrong with such proposals, one that is obvious in the restrospect of decades of recent history is that the problem of deciding whether a given voltage is valid cannot itself be solved in bounded time. Once again, this decision represents a nontrivial mapping of a continuous variable (input voltage) to a discrete one (validity); there will always be an input value that will produce an output halfway between true and false.
The downside of this approach? After clocking an unsyncronized input, a flip flop whose output is metastable (producing what is now interpreted as a valid $0$) will eventually progress to a valid logic level. About half the time, it will choose to emerge from the metastable state to become a valid $1$. In our new logic family, this will be seen by surrounding logic as a spontaneous transition of the flop output from $0$ to $1$ -- a behavior not likely to be appreciated by the circuit designer.
10.6. Dealing with Asynchronous Inputs
The fundamental difficulty associated with clocking asynchronous signals is an accepted fact among the modern engineering community. Failure to obey the dynamic discipline creates a finite probability of an invalid logic level in our digital system. One invalid signal may stimulate others: the static discipline guarantees valid outputs from valid inputs, but guarantees nothing for invalid inputs. In principle, a requirement to accept asynchronous inputs threatens the integrity of our digital system engineering discipline.Once understood, however, the practical impact of the synchronization problem is modest. Keys to our coping with asynchrony include the following:
- Synchronize clocks: Avoid the problem where possible, by using a single clock discipline or synchronizing the clocks of separate subsystems that must communicate.
- Delay: Where asynchronous inputs cannot be avoided, allow sufficient delay for the output of the synchronizing flip flop to settle that the probability of failure is acceptably low. The exponential decay of failure probability with delay interval reduces the problem of failure to a fairly modest cost in latency.
- Hide delay: Under certain circumstances, enginnering tricks can be used to eliminate some or all of the delay by predicting arbitration problems and solving them in advance of the actual input event.
The second guideline reflects our recognition that we can reduce the probability of failure to an arbitrarily low level by adding modest delay.
A common informal analysis of this circuit reasons that if the probability that the output of the leftmost flop is invalid is some small number $p$, the probability of invalidity at the output of the flop to its right is $p ^ 2$, $p^3$ at the third flop, etc. While there may be some merit to this reasoning, the more general principle is the exponential falloff of failure probability with delay outlined in Section 10.4.2. Each flop the signal passes through delays its entering subsequent circuitry by a clock cycle, and consequently enhances its reliability by a large factor. Under most circumstances, the delay of an asynchronous input by a few clock cycles reduces failure probability to a ridiculously low level with negligible impact on the function or performance of the system.
In latency-sensitive applications, however, the delay required for reliable communication may compromise performance. In these cases, it may be worth exploring some special-case optimizations suggested by the third of our guidelines. For example, communication between digital subsystems that require different frequency clocks might be made reliable if one clock period is a multiple of the other that maintain a fixed synchronous relationship.
In general, any known constraints on the timing of the incoming asynchronous input can be exploited -- at some engineering cost -- to reduce or eliminate the delay needed to reliably communicate.
10.7. Further Reading
- ^{1} Sarmenta, L.F.G. et al, "Rational clocking", ICCD '95 Proceedings, 1995. Use of rationally-related clock frequencies to avoid synchronization delay penalty.
10.8. Chapter Summary
Synchronization failures have played an unexpectedly significant role in the history of digital engineering, likely due in large part to the conceptual appeal of the digital abstraction. Digital engineers, seduced by the comfortable world of $1$s and $0$s, have demonstrated a remarkable resistance to the notion that the cost of violating the dynamic discipline is abdicating the guarantee of validity. And once any invalid signal creeps into our system, it can propagate elsewhere: the "valid in $\rightarrow$ valid out" construction rule no longer prevents logically invalid levels.Important facts to understand and remember:
- Synchronization failure is a problem only when the dynamic discipline is violated, e.g. by an input that changes asynchronously.
- Every bistable storage device has, in addition to the two stable equilbria used to represent its two possible values, a third unstable equilibrium called the metastable state. Without constraints on the timing of input changes (the dynamic discipline), the bistable device may enter this state, resulting in invalid output values;
- Metastable behavior eventually resolves to a valid $1$ or $0$, but the time required is unbounded.
- The probability of metastable behavior persisting for $t$ seconds falls off exponentially with $t$, so a modest delay after an unsynchronized input can reduce the probability of invalid outputs to be negligible.
- Synchronization failures are best avoided by
- Avoiding asynchronous inputs where possible; and
- Allowing sufficient delay after clocking unavoidably asynchronous inputs.