EE380 Assignment 2 Solution

A particular program expressed in a particular ISA executes 200 ALU instructions, 10 Loads, 16 Stores, and 4 Branches. A simple, non-pipelined, implementation of that ISA takes 8 CPI for each ALU instruction, 20 CPI for each load, 10 CPI for each Store, and 10 CPI for each Branch. The original clock frequency is 2GHz. How many clock cycles would the program take to execute? How many microseconds would the program take to execute?
ALU 200 8 Load 10 20 Store 16 10 Branch 4 10 (200*8)+(10*20)+(16*10)+(4*10)=2000 cycles total 2000 @2GHz = 1us
For this question, check all that apply. Given the circumstances described in question 1 above, which of the following changes by itself would yield at least 2X speedup?
A clever compiler is able to eliminate all the Branch instructions
An improved ALU design reduces ALU instruction CPI from 8 to 2
This yields 200*2+10*20+16*10+4*10=800 cycles
Rewriting the program reduces the number of ALU instructions to 100
Adding a cache reduces Load CPI from 20 to 5 and Store CPI from 10 to 5
New VLSI fabrication technology halves the clock period, but doesn't change memory speed so Load takes 40 CPI
For this question, check all that apply. Which of the following statements about performance is/are true?
In computing, more FLOPS is a good thing
Shortest-job-first is a scheduling method to improve throughput
For any given application, the SPEC benchmarks are the best predictor of performance
The time the processor spends executing your program's instructions is called the user time
For two processors implementing the same ISA, the one with the faster clock rate will complete a given program in less time
Somtimes, a synthetic benchmark will have significantly different performance from the program it models. Which do you think is more common: the synthetic benchmark performs better or worse than the real program? Why?
The synthetic benchmark probably is simpler code, thus more effectively optimized by the compiler... so it will often run faster than the real application.
You have written a program that currently takes 1 hour (60 minutes) to run on a high-end desktop computer: reading the data from the input file takes just one minute, the computation takes 57 minutes, and writing the output file takes two minutes. On a large enough parallel supercomputer, you can speed-up the computation a lot... but not the file I/O. According to Amdahl's law, what is the maximum speedup you could hope to get by using a parallel supercomputer?
If the 57 minutes is reduced to 0, we have 60/3 speedup (i.e., 20X)
With all those self-driving taxis scooting around, it would be nice if you could hail one just by giving the usual hand signal as it approaches -- rather than having to use a cell phone to make the pick-up request. You would like to build a system using a camera and quite a lot of image processing computation to recognize when a human is hailing. The currently best processor you can buy isn't fast enough to do this computation before the taxi has driven past the person who wanted a ride. Fortunately, your business plan says you don't need to ship your first product for two years. How would you go about figuring-out if fast enough versions of that processor will be available in two years when your product needs to ship? (Note: there are lots of valid answers to this question.)
Lots of possible answers here.... If it's just 4X too slow, Top500 trends suggest that computers might just be that much faster in 2 years time. Alternatively, look at the 2-year product projections from the companies making GPUs (e.g., NVIDIA and AMD) that might be good platforms for that image processing....

Computer Organization and Design.