Stacks and Procedures

# 16. Stacks and Procedures

Although the prior two chapters have detailed the hardware/software interface provided by the Beta ISA and its low-level programming tools, these specifications fall far short of a complete description of the programming model presented by the Beta. Much of the latter is provided by a set of programming conventions for organizing software and data on the machine. While prior examples have provided glimpses of several basic conventions (e.g. for representation of arrays and structures), there are many others. Certain conventions are associated with a particular language or sofware discipline, such as object-oriented programming or transaction processing; others, like the procedure linkage conventions of this chapter, are shared by most languages and disciplines.
Procedures as an Abstraction

## 16.1. Procedures as an Abstraction

Procedures -- parameterized code modules that can be defined once and instantiated many times with different parameter values -- are a common mechanism for structuring code. The ability to define a procedure is a powerful abstraction technique: a well-designed procedure reduces the space of possible computations to a tiny, parameterized subset devoted to some related set of problems identifyable by a few parameters. Lika other engineering abstractions, a procedure offers to perform an understandable service without requiring its user to understand the details involved: it isolates specification from implementation.

Our simple factorial example from Section 15.3.3.3, for example, is a natural candidate for encapsulation as a procedure.

A plausible implementation strategy is to store the code that computes $N!$ at some location, tagged fact: in memory. Code elsewhere that needs to compute $n!$ for some $n$ might load $n$ into R1 and transfer to fact:, leaving a return address in some agreed-upon register. Arbitrarily selecting R28 for the return address, the proposed convention involves:
• Associating the name of a procedure (e.g. fact) with an entry point address for that procedure;
• Calling the procedure by a BR(entry,R28) where entry is entry point of the procedure, and R28 is a register reserved (by convention) to hold the return address from procedure calls;
• Passing the argument parameter to the procedure in R1;
• ; and
• Returning the result in R0.
To the left is an example of our prior factorial code converted to a procedure using this convention. Again, click it to run it.

Unfortunately, this primitive convention is flawed in several respects. First, it needs to be generalized: suppose, for example, there are multiple arguments? What if there are more arguments than available registers?

But the approach suffers from a more serious defect: it allocates a single storage location, R28, to hold the return address from the procedure. Suppose the called procedure itself contains a procedure call?

int fact(n) { if (n > 0) return n * fact(n-1); else return 1; } We might have defined our factorial procedure recursively, for example, as shown to the right. A recursive definition involves, directly or indirectly, the object being defined; operationally, this means that executing a call to a recursively defined procedure may involve additional invocations of that procedure.

The BSim example to the left shows a failed attempt to use the flawed convention with a translation of the recursive fact procedure.

If you run this code, you will find that it gets into a strangely endless loop: the JMP at the line tagged rtn: transfers control to the MUL on line 20, after which control is passed back to rtn:.

The problem arises on the call to fact(5), which calls fact(4) in the course of its computation. As the fact(5) call is depending on R28 to maintain its return address, the resetting of R28 to a different address by the recursive call at line 19 causes loss of critical information: the return address of the original call fact(5) is overwritten by the return address for the recursive call. The net result: the procedure can never return to its original caller.
Nesting Depth and Storage Requirements

### 16.2.1. Nesting Depth and Storage Requirements

Our primitive convention fails because it reserves exactly one location to hold return addresses from procedure calls. This is inadequate if procedure calls are nested -- for example, procedure proc1 contains a call to proc2. Nesting depth of procedure calls may be greater than one, from which we conclude that we may need multiple locations to store return addresses of nested calls. Early runtime conventions supported this model via schemes which provided a constant number of return locations, a number that can be computed from the lexical analysis of simple programs. Such schemes were short-lived, however, when recursive definitions of functions became popular programming constructs.

Simple nesting of procedure calls requires a bounded amount of storage to hold return addresses -- bounded by the nesting depth of the code. We can compute the nesting depth of a block of code as follows:
• If there are no calls within the code, its nesting depth is zero.
• If the code contains calls, its nesting depth is $n+1$ where $n$ is the maximum nesting depth of a procedure called within the code block.
Note that this definition puts a finite bound on nesting depth except where its value is undefined due to a (direct or indirect) recursive procedure call. In such cases, the nesting depth is unbounded; the actual depth of the recursion may be data dependent, and indeed may be infinite. In the general case, the latter two cases may be distinguished only by running the code.

Generalizing a bit, the amount of storage required for dynamic procedure linkage in a program is $O(d)$, where $d$ is the nesting depth of the program. Absent recursion, $d$ is finite and may be computed by static analysis of the program. Given recursion, however, $d$ may be data dependent; in the case of our recursive fact procedure, $O(n)$ storage is required to compute $fact(n)$.
Activation Records

### 16.2.2. Activation Records

Generally there is a certain amount of storage required for each activation of a procedure: data that must persist for the time interval between the call and the corresponding return. This storage includes the return address, argument values, and perhaps other activation-related information. This collection of data is often gathered as a data structure called an activation record.

The failed convention of the previous section allocated a single activation record to be shared among all procedure activations -- an approach that fails in the case of nested procedures. A slightly less foolish approach, used before recursion became popular ca. 1960, involved static allocation of storage for a single activation record for each procedure. Such a scheme supports nested calls but not recursive procedures; it was used in some early FORTRAN implementations that did not support recursion. This approach has the advantage that the total amount of storage needed for activation records is bounded by the nesting depth of procedure definitions in the program, a bound that can easily be determined as the program is translated. Thus the storage can be allocated statically, prior to execution.

The recognition that the execution of a sequential program may involve multiple simultaneous activations of a single procedure leads to the conclusion that a workable discipline for procedure calls -- one that supports recursive procedures -- must be capable of allocating an unbounded number of activation records; the number may be data dependent, and in the general case may only be determined by running the program. This implies the need for a dynamic system for allocating activation records: they must be allocated whenever a procedure call is executed, and deallocated (returned to a storage pool) when the procedure returns, ending that activation.

Consider the execution of the recursive fact procedure defined above, given (say) an initial argument of 3. A recursive call to fact(2) occurs during the activation of fact(3), at which point two activation records are needed for the two simultaneous activations of fact. But the activation of fact(2) invokes fact(1), in turn invoking fact(0); at this point, there are four simultaneous activations, each with their own activation record (including argument value, return address, etc.). The fact(0) returns, releasing its activation record; then each of the other activation records are deallocated as their corresponding activations return to their caller. These deallocations are in the reverse order from that of their allocation.

A diagram of the allocation/deallocation of activation records over the time during this execution is sketched to the right. At each time (indicated by horizontal position) a "stack" of $N$ activation records must simultaneously exist, where $N$ is the number active calls to fact at that instant. At any time, however, only one of the $N$ activation records is a candidate for deallocation: the record that was most recently allocated.

It is important to note this last-in-first-out constraint on the allocation and deallocation of activation records. It stems from the proper nesting of procedure calls and returns, and is related to the proper nesting (for example) of parentheses strings, or the order of stacking/unstacking books in a single pile. Allocation/deallocation schemes consistent with this constraint are said to follow stack discipline.
Stacks

### 16.2.3. Stacks

The fact that storage requirements for nested procedure calls and many other useful programming patterns conforms to the constraints of stack discipline stimulated a revolution in programming languages and runtime disciplines in the early 1960s. This revolution was fueled by the observation that stack discipline enables a simple, fast, effective strategy for implementing a pool of dynamic storage. The discovery that single pool of this sort, called a stack, could support all procedures in a complex program (as well as other structures like interrupts) caused stacks to be institutionalized in both ISAs and programming disciplines, as they remain today.
Stack Implementation

#### 16.2.3.1. Stack Implementation

The stack discipline constraint on allocation/deallocation order admits a very simple implementation strategy:
1. Allocate a contiguous region of memory as the total available stack space;
2. Pack stacked data -- the allocated portion of the reserved stack space -- at one end of the region, leaving the rest available to be allocated;
3. Use a single register (called the stack pointer or SP within the CPU to identify the boundary between allocated and available stack space;
4. Allocate and deallocate space simply by incrementing and decrementing the stack pointer.
The critical insight into the efficacy of this implementation strategy is that any stack discipline assures that the next data to be allocated or deallocated will be located adjacent to the address in the stack pointer, allowing the operation of step 4 to suffice for either purpose.

Consider the snapshot at the right, depicting a region of memory allocated to a stack. In this implementation, stacked data is packed into lower addresses, and higher-addressed locations are available to be allocated to new data. The SP register contains the address of the lowest unallocated location, i.e. of the next location to be allocated to stacked data.

In its simplest form, a stack supports two abstract operations:
push(datum): add the value datum to the stack; and
datum = pop(): remove the most recently added datum from the stack, and return its value.
The push/pop interface assumes a stack of some fixed datum size, typically a 32-bit word. The most recently stacked datum is refered to as the top value on the stack; it is the value that will be returned from a pop() operation. Immediately following the operation push(x), x becomes the top value on the stack. Thus the sequence push(x); pop() adds then removes x, leaving the stacked data unchanged.

In addition to the abstract interface provided by the push/pop operations, it is common to directly access data near the top of the stack. Such data is typically at an address that is a constant offset from the value in the SP.
Processor Stack

#### 16.2.3.2. Processor Stack

As an abstract data structure, stacks have many applications. Often applications will create stacks for dedicated purposes -- examples include the undo/redo stack of an editor, and the history stack of a browswer.

However, the rules of stack discipline make a single stack eminently sharable among various different uses that conform to the same proper nesting constraint. As proper nesting is a property of most common program control flow patterns -- including entering/exiting blocks, procedure call/return, and interrupts -- modern processors and run-time software conventions commonly assume a single stack that is shared among these purposes. We refer to this distinguished stack as the processor stack, or simply the stack.

The evolution of the processor stack assumption is complex and interesting; indeed, the design of modern block-structured languages and consequent programming disciplines reflect the availability of a stack to at least the extent that support for a stack is demanded by these languages and disciplines. It has become popular for those committed to alternative programming structures -- involving, say, parallel processing and other control flow less amenable to stack discipline -- to complain about the extent to which stacks and consequent nesting are institutionalized by contemporary processor design. But the efficiency with which a stack handles a wide range of contemporary programming needs assures its continued support in future ISA generations.
The Beta Stack

### 16.2.4. The Beta Stack

The use of a stack is sufficiently prevalent that it is supported directly by the instruction set of most contemporary computers. Such machines typically dedicate a special CPU register as the stack pointer, and offer specific instructions (e.g., a call instruction for procedure calls) that manipulate this register to effect push and pop operations.

In the minimalist Beta ISA, however, we choose to implement stacks, and our procedure call mechanism, strictly by software convention. We implement stacks and stack operations as follows:
• As described in the previous section, we statically allocate a contiguous region of the Beta's main memory for stack storage. We associate a particular tag (say, stack: with the base (lowest address) of this memory block.
• We dedicate a particular general register, R29, for use as the stack pointer SP. start: CMOVE(stack, SP) ... stack: . = .+STACKSIZE By convention, SP points to the lowest unused word address in the stack region. On startup, we initialialize SP to hold the base address stack, indicating an initally empty stack.
• The top word of the stack is always at an offset of -4 from the SP; thus it can be read by LD(SP, -4, Rx) or written by ST(Rx, -4, SP). Larger constant offsets from SP may be used to access data deeper in the stack.

ADDC(SP, 4, SP) ST(Rx, -4, SP)
• To push the word in register Rx onto the stack, we increment SP by four, making it point to the second unused word address, and then write the contents of Rx into the location that is now the top of the stack.
• LD(SP, -4, Rx) ADDC(SP, -4, SP)
• To pop the top word from the stack, leaving its value in register Rx, we read the top of the stack into Rx and then decrement SP by four.
• To simply allocate K 32-bit words of uninitialized space from the stack, we simply increment SP by 4*K.
• SUBC(SP, K*4, SP)
• To deallocate K 32-bit words of space from the stack, discarding their values, we decrement SP by 4*K.
Each of the last four operations -- PUSH, POP, ALLOCATE, and DEALLOCATE is defined as a macro in beta.uasm.
SP Allocation Semantics

#### 16.2.4.1. SP Allocation Semantics

The careful reader may be surprised by the order of the ADDC and ST instructions in our definition of the PUSH operation: it would seem more straightforward to store the new datum in the address in SP and then increment SP, rather than doing the increment first and requiring address arithmetic in the ST.

There is a subtle reason for this peculiar order of operations. In later chapters of this course, we will find that the exception processing mechanism of Section 14.9 opens the possibility of interrupting program execution between any pair of consecutive instructions. When this happens, code in an interrupt handler will run, eventually returning control to the interrupted program. This whole sequence of events is designed to be transparent to the interrupted program, allowing it to run as no interrupt had taken place.

It is convenient, however, to allow the interrupt handling process to "borrow" use of the stack temporarily: so long as it maintains stack discipline, popping exactly whatever data it pushes, it can in theory do so transparently. To make this work reliably, we must guarantee that no stacked data (data that the interrupted program cares about) is disturbed by the "borrower" of the stack. We do so by a precise distinction between stacked data, which must be preserved, and unallocated stack space. The precise rule is this:
All data stored at addresses below the value in SP is stacked, and must be preserved; all locations at the address SP and above are unallocated and may be written during interrupt processing.
Strict adherance to this rule means that SP must be incremented prior to the store operation in a PUSH; otherwise, the datum (stored above the SP address) will be unprotected in case of an interrupt between the two instructions.
Stack Usage Patterns

## 16.3. Stack Usage Patterns

Although our introduction of the stack as a storage allocation mechanism was motivated by our need to store activation records, its availability to be "borrowed" transparently lends itself to a number of other common uses. We explore several common patterns of stack usage before returning to their support for procedures.
Register Spills

### 16.3.1. Register Spills

One commom use of the stack mitigates the problem of register allocation: all computing is done in registers, but there are too few registers to hold all variables of a complex program.

Consider, for example, the program shown to the left. After initializing the stack, it executes some arbitrary code -- perhaps allocating all general registers to hold critical program data. At some particular point in the program, the need arises to insert a loop to copy the contents of one 100-word array to another array; however, there are no unallocated registers available for the copy.

A simple solution, shown in this example, is to push the contents of several registers onto the stack, use them temporarily, and then restore their initial values before proceding. Note that the order in which the registers are popped must be the reverse of that in which they are pushed, owing to the last-in-first-out structure of stack discipline.
Local Variables

### 16.3.2. Local Variables

As noted in Section 14.6, our programming convention generally allocates locations in main memory to hold values of program variables, and their values are loaded temporarily into registers to perform computations. This plan sidesteps the limited number of registers available: we are unwilling to limit the number of program variables to fit into registers.

// Static variables: x: WORD(3) // int x=3; ... LD(x, R0) // read x ... ST(R0, x) // store x ... } Thus far, our examples have allocated space for variables statically: the address of each is a constant built into the program. Typically we associate a symbolic name with the address of each variable, and access variables directly via LD and ST operations as shown in the example to the right. Such variables are allocated and initialized at the startup of a program and their extent -- the lifetime of their allocation -- is the entire running time of the program. { int a=3., b=5; ... a = a*b; ... } In modern block structured languages, variables whose extent is more limited can be declared. The fragment of C code to the left, for example, declares two variables, a and b, that are local to an inner block of code (delimited by the curly brackets). The scope of these variables -- the range of code over which they are accessable -- is limited to that block: they cannot be accessed outside of the curly brackets. Likewise, their extent is limited to the period of execution of that block of code: new instances of a and b are allocated and initialized whenever the block is entered, and deallocated at exit from the block.

We may translate this pattern to assembly code as shown to the right. As the block is entered, we may use the ALLOCATE macro to advance SP, effectively allocating two words of uninitialized space near the top of the stack. These two variables, now allocated from the stack, may be accessed using negative offsets from SP: -8 for a, -4 for b. Within the block, the variables can be copied to and from registers using LD and ST instructions specifying these offsets and SP; otherwise, they are used similarly to statically allocated variables.

We now return to the problem of procedure linkage: the protocol and coding conventions that link code that invokes a procedure -- the caller -- with the code that implements the procedure (the callee). As with all engineering abstractions, the key issue is a clear specification of the respective responsibilities of the caller and the callee.

Our strategy is to define a general format for activation records, and arrange that an appropriate activation record be allocated from the stack when a procedure is called, and deallocated at the corresponding return. Activation records allocated from the stack are commonly called stack frames, a term we will use for the remainder of this discussion.

Candidates for inclusion in the stack frame are details particular to the call:
• Argument values for that call;
• Return address for that call;
• Temporary storage needed by the procedure (e.g., saved values from registers at the time of the call that must be preserved for the caller);
• and
• Local variables declared within that procedure.
Given that some of these values (e.g., the arguments) must be supplied by the caller and some by the callee (e.g. local variables), building a stack frame involves caller/callee collaboration.
Basic Call/Return Protocol

## 16.5. Basic Call/Return Protocol

Our convention will use several registers, which (like SP and the XP reserved for interrupts as mentioned in ) are reserved for this purpose by our software run-time conventions. Our growing list of reserved regisers is now:
XP=R30: eXception Pointer, used to hold return addresses for during exception (interrupt) processing;
SP=R29: Stack Pointer, used to indicate the first unused address of the processor stack;
LP=R28: Linkage Pointer, used to pass the return address during a procedure call; and
BP=R27: Base Pointer, used to hold a base address for the current activation record.
R0: Used, by convention, to pass a return value back to the caller.
This is the final list: general registers R26 and below may be used for application-specific purposes (with the understanding that the contents of R0 will be discarded on a procedure call).

Our protocol for a procedure call begins with the caller, which pushes values of arguments on the stack. The caller then transfers control to the callee by an instruction like BR(callee, LP) where callee designates the entry point of the procedure -- the address of the first instruction to be executed on a call. Note the specification of LP as the destination register in the BR instruction. Recall that this causes the address of the instruction following the BR -- the appropriate return address -- to be written into LP by the call.

When the callee is entered it expects the top $N$ words on the stack to be the $N$ argument values of a new procedure call, and an appropriate return address to be in LP. The first instructions of the callee push the contents of BP and LP onto the stack, to be restored just prior to the eventual return to the caller.

Immediately following these two PUSH operations, SP is copied into BP (via MOVE(SP,BP)), allowing BP to serve as a fixed base address within the newly created stack frame. Although the value in SP may vary during execution of the procedures code, the address in BP remains constant (and, consquently, a constant offset from arguments and other values in the frame). Note, in particular, that the most recently stacked argument appears at offset -12 from BP, and other arguments at increasingly negative constant offsets.

Finally, the callee allocates whatever local and temporary variables it needs, incrementing SP at each allocation. Our convention allows a return value to be passed back to the caller in R0, but requires that all other registers be restored to the values at the time of the call; thus, any registers used by the callee must be carefully saved and restored prior to return.

The called procedure then performs its advertised computation, accessing arguments via negative offsets and locals by positive offsets from BP. During the computation the stack may be used for various purposes, including nested and recursive procedure calls. Of course, it is critical that all such uses follow stack discipline, returning the stack to its earlier state after each use.

When the promised computation is complete, the callee restores any saved register values and does a bulk deallocation of all locals and temporaries via a MOVE(BP, SP) which leaves the stack containing only the saved BP, LP, and arguments. The saved values are popped back into BP and LP, and control is returned to the caller via a JMP(LP) instruction.

When control returns to the caller (specifically, to the instruction following the BR(callee, LP)) R0 contains the return value, all other registers have their values at the time of the call, and the arguments remain on the stack. In our convention, it is the responsibility of the callee to remove any stacked arguments following the call; typically, it does so by a single DEALLOCATE(N) where N is the number of arguments passed.
Resulting Stack Structure

## 16.6. Resulting Stack Structure

Given our linkage conventions, at any instant the stack may be viewed as a concatenation of successive stack frames that represent currently active procedure calls. The current stack frame, whose base address is identified in BP, represents the procedure call currently occupying the CPU's attention; stack frames at successively lower addresses represent unfinished procedure invocations that are awaiting completion of nested calls (including the current one).

A region of the stack containing typical stack frames is diagrammed to the right; the three frames shown might correspond to the invocation of h(...) during the computation of g(...) which was, in turn, invoked by h(...). Although most detail is omitted from the diagram, the diagram depicts the chain of pointers that characterizes the stack as a sequence of frames. The address in BP points to the base address of the current frame. That frame contains, in its old BP slot, the base address of the immediately preceding frame on the stack. The latter frame contains the base address for the frame of its caller, and similarly through the entire history of procedures that have been called but have not yet returned.

These pointers, reflecting a dynamic history of BP values, are often referred to as dynamic links since they reflect the program's dynamics rather than its static structure (such as textual nesting). The dynamic link in a stack frame is simply the immediately previous value that was in BP; it always points to an adjacent stack frame at a lower address.
Stack Frame Details

### 16.6.1. Stack Frame Details

To the left is shown a typical frame at the top of the stack, in somewhat more detail. As always, BP points to the base address of the frame; per our convention, the base address identifies the location immediately following the location containing the saved BP value -- the dynamic link to the caller's stack frame. The address in BP is the first address of the set of local variables stored in the frame. From this picture, we conclude that the first local variable has offset 0 from BP, the second has offset 4, and so on. The stacked arguments are similarly accessible using negative offsets from BP: the first argument has offset -12 (owing to the stored BP and LP values.

As a consequence, the code within the procedure can access locals and arguments systematically: to access
The $k^{th}$ local variable:
use LD(BP, k*4, Rx) or ST(Rx, k*4, BP);
The $j^{th}$ argument:
use LD(BP, -4*(j+3), Rx) or ST(Rx, -4*(j+3), BP).
These patterns apply throughout the code that represents the body of the called procedure; once BP is set up, its value remains constant.
Why BP?

#### 16.6.1.1. Why BP?

One might question why it is necessary to allocate a register to the function performed by BP. In fact, the use of a base register is not necessary; its possible, with additional bookkeeping, to access contents of the local stack frame via constant offsets from SP. The extra complexity of this route comes from the fact that we allow the value of SP to vary from instruction to instruction, due to various uses of the stack (to deal with register spill and other such purposes). As a consquence, offsets from SP (say, to the first argument) may vary from instruction to instruction; keeping track of proper offsets is tedious for a human programmer.

Such arithmetic is straightforward for a compiler, however, so the no-BP option is realistic in practice (where nearly all code is generated by a compiler). We prefer the more humane BP scheme for this course in large part because it makes example code easier to understand.
Order of Stacked Arguments

#### 16.6.1.2. Order of Stacked Arguments

It is important to note that the above pattern for accessing the $j{th}$ argument would not work if the arguments were stacked in the opposite order. By our convention, the first argument always appears in the stack frame at an offset of -12 from its base address. If the arguments were stacked in the inverse order, the offset between the base address and the $j{th}$ argument would depend on the number of arguments, complicating argument access considerably.

A consequence of this ordering, however, is that the caller must push arguments onto the stack in inverse order from their left-to-right sequence in common notation. CMOVE(3, R0) PUSH(R0) CMOVE(2, R0) PUSH(R0) CMOVE(1, R0) PUSH(R0) BR(proc, LP) DEALLOCATE(3) Thus the call f(1, 2, 3) to the 3-argument procedure f must be translated as shown to the right. Note that there is no real cost to this convention: it is as easy for a compiler to generate the above code as it would be to push the arguments in the more straightforward order.

This choice has other advantages as well. Since early arguments can be accessed by the callee without knowing the total number of arguments, it is possible to define procedures taking a variable number of arguments -- say, a procedure $sum(k, a_1, a_2, ..., a_k)$ that computes the sum of $a_1 ... a_k$ where $k$ is specified as its first argument.
Caller/Callee Contract

## 16.7. Caller/Callee Contract

We now have the elements of a standard "contractual" relationship beteen the callers and callees of procedures. Generally, the caller deals with arguments: putting them onto the stack prior to the call, and removing them following the call. The caller is responsible for transferring control to the callee, and ensuring that LP has a proper return address on entry. The callee is responsible for computing the promised result, leaving it in R0, and returning to the caller with its state (stack and registers) preserved, save R0 which contains the return value (if any).

There are many common variations in the details of this protocol. Some versions require that only a subset of the registers be preserved, e.g. designating R0-R3 as "temporary" registers whose contents may be trashed by a procedure call. Other schemes allow the first several arguments to be passed in registers rather than on the stack. Each of these issues has advantages and disadvantages.
Code Sequences

### 16.7.1. Code Sequences

It is common to establish standard instruction sequences that fulfill various steps of a procedure call.

PUSH($arg_N$) ... PUSH($arg_1$) BR(f, LP) DEALLOCATE(N) } The first such instruction sequence is the call itself: the calling sequence executed by the caller to invoke the procedure. Note that arguments are shown stacked by consecutive PUSH operations; in practice, there may be interspersed instructions computing the value of each argument as it is stacked.

PUSH(LP) PUSH(BP) MOVE(SP, BP) ALLOCATE(nlocals) (push other regs) } The first few instructions executed by the callee are termed the entry sequence, and establish the new stack frame and BP value. Again variations are possible; for example, the ALLOCATE may be eliminated if the callee requires no local variables.

(pop regs, reverse order) (put return value in R0) MOVE(BP, SP) POP(BP) POP(LP) JMP(LP) } The final instructions executed by the callee are the return sequence, responsible for restoring the callee's state and returning control. It performs roughly the reverse of the steps performed by the entry sequence; a notable asymmetry, however, is its lack of a DEALLOCATE to undo the ALLOCATE of the entry sequence. While a DEALLOCATE might be harmlessly included in the return sequence, it is unnecessary since the MOVE(BP,SP) does a "brute force" reset of SP to the proper value.
Example: GCD Procedure

### 16.7.2. Example: GCD Procedure

Consider a procedure gcd(a, b) which computes the greatest common divisor of two positive integers. int gcd(a, b) { if (a == b) return a; if (a > b) return gcd(a-b, b); else return gcd(a, b-a); } A time-honored algorithm to for gcd is shown, in C, to the left. Note that the algorithm itself contains two recursive calls to gcd. The reader might verify that it computes gcd(6,15), for example via the steps gcd(6,15) → gcd(6,9) → gcd(6,3) → gcd(3,3) → 3.

A runnable translation of gcd to Beta assembly language is shown to the right. Note that the first few instructions correspond to our boiler-plate entry sequence, and the instructions following the tag rtn: constitute a return sequence that is shared among the three returns in the C code. Each of the two recursive calls exemplifies the calling sequence.

Clicking this diagram will present a runnable simulation of this code, together with a tiny main program that sets up the stack and calls gcd with sample argument values. It is worth experimenting with this setup to familiarize yourself with the Beta simulator: try running, single-stepping, changing argument values, and setting breakpoints. Uncommenting the breakpoint at line 34, for example, causes the simulator to pause with arguments in R0 and R1 as each gcd call is entered.

If you feel ambitious, you might look for ways to improve the code by making it shorter and/or faster.

We might use gcd to determine whether a pair of positive integers is relatively prime or coprime by defining the simple procedure shown to the left. int coprime(a, b) { return gcd(a, b) == 1; } Note that this procedure simply calls gcd on the supplied arguments and compares the result to one; it returns a Boolean (true/false) value, reflecting the fact that a and b are relatively prime if and only if their greatest common divisor is one.

The straightforward translation of coprime is shown to the right; clicking it yields a runnable example, including gcd and a test case.

Note that the Boolean (1 or 0) value produced by the CMPEEQC instruction, comparing the value returned by gcd with 1, is itself returned as the result of the coprime call. Again, we recommend experimenting with this example: verify that while 33 and 28 are relatively prime, 35 and 28 are not.
Exploring BSim

### 16.7.3. Exploring BSim

At this point, the reader is encouraged to run this sample code and gain familiarity with BSim. Clicking the image to the right will invoke BSim in a new window, running the example code along with bsim.uasm and required setup code. After making any desired edits to the source code, click assemble to translate and load the result into the simulated Beta memory. If no errors are detected in the translation, you'll see a summary of CPU state (registers) along with scrollable windows showing several regions of memory. Buttons at the top left will single-step, reload, or run the program to completion (or until it hits a breakpoint).
Stack Detective Skills

### 16.7.4. Stack Detective Skills

Skilled assembly language programmers develop masterful techniques for debugging that involve deducing details of a program's state and history from binary snapshots of its memory. One such technique, often termed stack crawling, deals with using the critical information in the stack to reconstruct the history of nested calls that led to its current state.

For a simple but concrete example, suppose coprime(4,6) is called with the above coprime and gcd implementations.

Prior to diving into the details of stacked hex values, it is worth envisioning the general structure we expect our litte example to induce on the stack. From the C sources, we expect coprime(4,6) to invoke gcd(4,6), invoking in turn the recursive calls to gcd diagramed to the left. The recursion bottoms out with a call to gcd(2,2), which returns 2; this value is passed up through the stacked calls, and is returned as the value of coprime(6,4). Each of these nested calls generates a stack frame; at the point of deepest nesting (during the call to gcd(2,2)) we expect the laout of frames depicted above.

Lets corroborate this understanding by starting our coprime(4,6) example and, after running a while, stopping execution at the (commented) breakpoint shown at line 34 of the above gcd listing, i.e. when it is about to execute the CMPEQ instruction. Note that you can duplicate this experiment by clicking on the coprime code, editing the source file to specify arguments and enable the .breakpoint, assembling, and clicking run several times.

We find that the stopped program has BP=0x174, SP=0x17c, and a stack containing the values shown on the right. What can we determine about the history and state of the computation?

In puzzling over this programmer's analog of an archaeological dig, it is useful to annotate the stack dump with some deductions. We start by marking the locations pointed to by the values in SP and BP, identifying the top of the stack and the base of the current stack frame.

Our first deductions come from locations surrounding BP: since we know the program is stopped within gcd, we know the format of the local stack frame. We annotate the stack dump to identify the values below BP=0x174 as saved LP, saved BP and the arguments a and b. The value pointed to by BP is the saved R1, followed by the saved R2. We might note that 0x174 is the base of the stack frame corresponding to a call to gcd(2,2).

Next, we can follow the saved BP pointer in 0x16c to the base of the caller's frame at 0x15c. 124 00000006 arg a 128 00000004 arg b 12c 00000000 old BP 130 8000002c old LP 134 00000006 arg b coprime(4,6) 138 00000004 arg a 13c 00000134 old BP 140 80000060 old LP 144 00000006 old R1 gcd(4,6) 148 00000004 old R2 14c 00000002 arg b 150 00000004 arg a 154 00000144 old BP 158 8000011c old LP 15c 00000002 old R1 gcd(4,2) 160 00000001 old R2 164 00000002 arg b 168 00000002 arg a 16c 0000015c old BP 170 800000fc old LP BP->00000002 old R1 gcd(2,2) 178 00000000 old R2 SP->... Using this base address, we annotate values stacked by the gcd(4,2), and continue up the stack until we reach the initial call coprime(4,6). The result might be the annotated stack dump shown to the left.

Note that its structure is consistent with the layout we expected; of course, the annotated dump contains much greater detail. We can deduce, for example, that coprime(4,6) was called from location 0x28 since the corresponding return address points to the immediately following instruction.

Observe that all stacked PC values in our example have the Kernel bit (discussed in Section 14.9.2) set, reflecting that the Beta starts running in Kernel mode.
Chapter Summary

## 16.8. Chapter Summary

This Chapter focuses on a set of protocols for implementing procedures, an important abstraction in the construction of software systems. The principal technology relied upon is the use of a single stack, a simple and effective means for allocating and de-allocating blocks of storage in situations whose storage needs conform to the last-in-first-out constraints of stack discipline.

Happily, many common storage usage patterns fit naturally with the constraints of stack discipline. In addition to nested procedure call/return support, these include the entry/exit of code blocks in common block-structured compiled language, and the nesting of interrupt handler invocations mentioned in Section 14.9 and explored in detail in Chapters 22 and 23.

However, not all allocation/deallocation needs conform neatly to stack discipline. These include several situations that arise naturally in common languages, such as the nesting of procedure definitions mentioned in Chapter 17; support for these patterns requires the use of a storage heap, a more sophisticated allocation/deallocation pool that liberates the user from the requirements of stack discipline.

It is noteworthy that the combined allocation/deallocation pattern arising from the parallel execution of several asynchronous sequential processes does not in general obey stack discipline, even if each process obeys stack discipline. As we will see in Chapter 22, we deal with such systems by allocating a separate stack for each sequential process.