This homework is due by 1pm on Wednesday 5/5. To turn in your homeworks you may either scan and upload to courseworks, or hand-deliver it to Prof. Kim’s office in 469 CSB.

1. To date we have not explicitly stated the size (in number of bits) of the pipeline registers in our 5-stage MIPS pipeline. In the pipelined processor shown below, calculate the bitwidth of the four pipeline registers: IF/ID, ID/EX, EX/MEM, and MEM/WB. Be sure to include the necessary control signals in your final totals. NB: This version of the processor does not have support for operand forwarding or stalls.

2. Structural, data and control hazards typically require a processor pipeline to stall. Listed below are a series of optimization techniques implemented in a compiler or a processor pipeline that are designed to reduce or eliminate stalls due to these hazards. For each of the following optimizations, state which pipeline hazards it addresses and how it addresses it. Some optimization techniques may address more than one hazard, so be sure to include explanations for all addressed hazards.

   (a) Branch Prediction
   (b) Bypass/forwarding Logic
   (c) Instruction Scheduling
   (d) Increasing the number of functional units (ALUs, adders, etc.)
   (e) Caches

3. Consider the following code sequence:

   ```
   add $t0, $a0, $s0
   lw $t1, 0($t0)
   add $t1, $t1, $t1
   sw $t1, 0($t0)
   ```
(a) Assuming a MIPS processor that has the standard 5-stage pipeline with the ability to stall (i.e., insert bubbles in the execute stage) but has no forwarding paths whatsoever. Complete the execution schedule on this processor for the above code snippet as begun below. You may assume all memory accesses complete in a single cycle.

<table>
<thead>
<tr>
<th>Instr</th>
<th>cc_1</th>
<th>cc_2</th>
<th>cc_3</th>
<th>cc_4</th>
<th>cc_5</th>
<th>cc_6</th>
<th>cc_7</th>
<th>cc_8</th>
<th>cc_9</th>
<th>cc_10</th>
<th>cc_11</th>
<th>cc_12</th>
<th>cc_13</th>
<th>cc_14</th>
<th>cc_15</th>
<th>cc_16</th>
</tr>
</thead>
<tbody>
<tr>
<td>add</td>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>MEM</td>
<td>WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>lw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>add</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>sw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

(b) Assuming now that the processor does have full forwarding hardware (i.e., can forward from MEM to EX and from WB to EX) what would the schedule look like?

<table>
<thead>
<tr>
<th>Instr</th>
<th>cc_1</th>
<th>cc_2</th>
<th>cc_3</th>
<th>cc_4</th>
<th>cc_5</th>
<th>cc_6</th>
<th>cc_7</th>
<th>cc_8</th>
<th>cc_9</th>
<th>cc_10</th>
<th>cc_11</th>
<th>cc_12</th>
<th>cc_13</th>
<th>cc_14</th>
<th>cc_15</th>
<th>cc_16</th>
</tr>
</thead>
<tbody>
<tr>
<td>add</td>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>MEM</td>
<td>WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>lw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>add</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>sw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

(c) Would there be any performance benefit if the programmer substituted `sll $t1, $t1, 2` for the second `add` instruction above? Why or why not?

4. Assume a stack of 32-bit integers, where the base address of the stack is 0x00006000.

(a) Give a list of the six addresses that are accessed by the following sequence of operations:

```scala
stack.push(1)
stack.push(2)
stack.push(3)
stack.pop()
stack.pop()
stack.pop()
```

(b) Illustrate the state of a 64-byte, direct-mapped cache after these 6 memory operations. The cache can hold 8 8-byte lines (i.e., 1 line = 2 words). You may assume the cache is empty at the start. Be sure to show the state of the valid and tag bits in addition to the data values.

(c) What is the miss rate? The hit rate?