## CENG3420 Homework 2

## Due: Mar. 29, 2017

- Q1 (15%) You are required to develop some simple measures of pipeline performance and relative speedup.
  - 1. Let  $T_{k,n}$  be the total time required for a pipeline with k stages to execute n instructions. Speedup of k stage pipeline is given by,

$$S_k = \frac{T_{k,1}}{T_{k,n}}.\tag{1}$$

Determine  $S_k$  in terms of k and n.

- 2. Consider an instruction sequence of length n that is streaming through the instruction pipeline. Let p be the probability of encountering a conditional or unconditional branch instruction, and let q be the probability that execution of a branch instruction I causes a jump to a nonconsecutive address. Assume that each such jump requires the pipeline to be cleared, destroying all ongoing instruction processing, when I emerges from the last stage. Determine  $S_k$  in terms of k, n, p and q.
- Q2 (15%) Considering the single-cycle datapath shown in Fig. 1, you are required to redesign it into a multicycle path.



Figure 1: Q2

- 1. How many state registers are required? Where should those registers be placed? What are the functions of them?
- 2. In which stage control signals are generated for the above multi-cycle processor.
- Q3 (15%) Answer the following questions about 4-bit muliplication implementation of ALU design:
  - 1. A straightforward way to do multiplication is shifting and adding. Write down the details of each iteration on calculating  $5 \times 6$ .
  - 2. If we want to calculate  $5 \times -6$ , the above method will fail beacuse -6 is represented as 2's complement. An efficient algorithm to solve this problem is shown in Fig. 2. Calculate  $5 \times -6$  step by step with the alogrithm. An example of the following algorithm is shown in Table 1, whre final result 0001,0101 (21<sub>10</sub>) is stored in A,Q after the last operation.



Figure 2: Q3

| Cycle   | Operation     | Α    | Q    | <b>Q</b> <sub>-1</sub> | Μ    |
|---------|---------------|------|------|------------------------|------|
|         | Initial Value | 0000 | 0011 | 0                      | 0111 |
| Cycle 1 | A←A-M         | 1001 | 0011 | 0                      | 0111 |
|         | Shift         | 1100 | 1001 | 1                      | 0111 |
| Cycle 2 | Shift         | 1110 | 0100 | 1                      | 0111 |
| Cycle 3 | A←A+M         | 0101 | 0100 | 1                      | 0111 |
|         | Shift         | 0010 | 1010 | 0                      | 0111 |
| Cycle 4 | Shift         | 0001 | 0101 | 0                      | 0111 |

Table 1: Example of Calculating  $7 \times 3$ 

- Q4 (10%) Consider a single-level cache with an access time of 2.5 ns, a line size of 64 bytes, and a hit ratio of H = 0.95. Main memory uses a block transfer capability that has a firstword (4 bytes) access time of 50 ns and an access time of 5 ns for each word thereafter.
  - 1. What is the access time when there is a cache miss?
  - 2. Suppose that increasing the line size to 128 bytes increases the H to 0.97. Does this reduce the average memory access time?
- Q5 (10%) Consider a machine with a byte addressable main memory of 2<sup>16</sup> bytes and block size of 8 bytes. Assume that a direct mapped cache consisting of 32 lines is used with this machine.
  - 1. How is a 16-bit memory address divided into tag, line number, and byte number?
  - 2. Why is the tag also stored in the cache?
- Q6 (10%)

Consider a memory system that uses a 32-bit address to address at the byte level, plus a cache that uses a 64-byte line size.

- 1. Assume a direct mapped cache with a tag field in the address of 20 bits. Show the address format and determine the following parameters: number of addressable units, number of blocks in main memory, number of lines in cache, size of tag.
- 2. Assume a four-way set-associative cache with a tag field in the address of 9 bits. Show the address format and determine the following parameters: number of addressable units, number of blocks in main memory, number of lines in set, number of sets in cache, number of lines in cache, size of tag.
- Q7 (10%) In this exercise we look at memory locality properties of matrix computation. The following code is written in C, where elements within the same row are stored contiguously.

```
for (J=0; J<8000; J++)
for (I=0; I<8; I++)
    A[I][J]=B[J][0]+A[J][I];</pre>
```

- 1. How many 32-bit integers can be stored in a 16-byte cache line?
- 2. References to which variables exhibit temporal locality?

- 3. References to which variables exhibit spatial locality?
- Q8 (15%) Consider two processors a & b. Each processor has five stages with latencies shown in the following table. Assume that when pipelining, each pipeline stage costs 20ps extra for the registers between pipeline stages.

|             | IF    | ID    | ΕX    | MEM   | WB    |
|-------------|-------|-------|-------|-------|-------|
| processor a | 300ps | 400ps | 350ps | 550ps | 100ps |
| processor b | 200ps | 150ps | 100ps | 190ps | 140ps |

- 1. If **no** pipeline is considered, for each processor please calculate the following metrics: cycle time; latency of an instruction; the throughput. (Note that throughput is instruction number per second)
- 2. If both a & b are pipelined processors, for each one calculate the following metrics: cycle time; latency of an instruction; the throughput.
- 3. If you could split one of the pipeline stages into 2 equal halves, which one would you choose for processors a & b, respectively? For the two processors: what is the new cycle time, the new latency of an instruction, and the new throughput?