See John P. Hayes, Computer Architecture and Organization, McGraw-Hill, 1978.
Pictures courtesy of Andrea Di Blas, from various sources. 1–4. CE202.
Computer History
Early Computing – Mechanical 1642 Pascal builds adder/subtractor • 2 6-decimal digit registers (dials with mechanical linkage) • Carry between digits • Complements for negative numbers
1–3
CE202
R. Hughey
Computer History
Performance
1671 Leibnitz includes multiplication and division • Multiplier and ’cand registers added • Use repeated add/sub for mult/div
CE202: Computer Architecture Chapter 1: History and Performance
See John P. Hayes, Computer Architecture and Organization, McGraw-Hill, 1978. Pictures courtesy of Andrea Di Blas, from various sources.
1–4
CE202
R. Hughey
1–1
CE202
R. Hughey
Computer History
Computer History
1834 Babbage’s Analytical Engine • General-purpose machine using Jacquard punch cards
The Mill
1823 Babbage’s Difference Engine • For generating, e.g., log tables
The Store
• 1823–42 Builds 6th degree, 20-digit difference engine (not completed)
Printer Card Punch
• 1837–53 Swede Georg Scheutz builds 3rd degree, 15-digit engine
+-*/
• 1990s London museom builds 40-digit, 7th degree engine to 18th century tolerances. – $2.5 million over 2 years, 2000 parts, hand-cranked
Operation Cards
Variable Cards
yi+1 = yi + ∆1yi
• Separation of opcode and address for fast decode
∆1yi+1 = ∆1yi + ∆2yi ..
• Scroll forward and backward (goto!)
∆nyi+1 = ∆nyi
• Conditional branch on sign 1930-1940 Electromechanical: Konrad Zuse Z1, Howard Aiken Harvard Mark I (First operational general-purpose computers). 1–5
Computer History
History 0: mechanical computers (1623 – 1943)
Electronic Computers
1822 Charles Babbage (1791 – 1871) builds the “Difference Engine”
1940s John Atanasoff, Iowa State. 1943–46 ENIAC Mauchly and Eckert, U Penn. Card reader, Printer, Card Punch
Multiplier
CE202
Divider and Square Root Extract
Table Lookup
Control (Switches)
• 30 Tons, 18,000 Tubes, stored program, CPU • Simple iteration for ballistic tables • Quite an early legal battle over patenting the computer
Twenty 10-digit Accumulators
R. Hughey
Computer History
Computer History
1955–64 Second Generation • Transistor, Core memory, index registers, floating-point, IO processors, Algol, Cobol, Fortran
1943 Eckert and Mauchly build the ENIAC (Electronic Numerical Integrator And Calculator) — the first stored-program “electronic computer”
• Lincoln Labs TX-0, IBM 704 (FP, OS), IBM 709 (IOPS), IBM 7090, IBM 7094 (below), IBM Stretch (pipelining)
1–11
CE202
Computer History
R. Hughey
1–9
CE202
R. Hughey
Computer History
1965–74 Third Generation • ICs (SSI and MSI) • Microprogramming (Wilkes, 1951), pipelining, cache
1945-55 Princeton IAS (below), MIT Whirlwind (core memory), Manchester Atlas (virtual memory, index registers), EDVAC (binary, mercury delay line registers). Eckert-Mauchly UNIVAC
• Multiprogramming, OS, VM, Timesharing (MIT’s CTSS) • IBM S/360 (1964, below), CDC 6600 (1964), ILLIAC IV parallel computer,Burroughts B5000, DEC PDP-8
1–10
CE202
R. Hughey
Computer History
Computer History — IBM S/360
1975–1985 Fourth Generation • LSI • Semiconductor memory, Minicomputers, Microprocessors • C, Unix • DEC PDP 11, IBM S/370, Intel 8748, 68000, 80186
1–15
CE202
R. Hughey
Computer History
Computer History — IBM S/360
1985– RISC (whatever that means!) • VLSI • Fast memory • Compiler and language development • Sparc, MIPS, HP-PA, Alpha, 80486, . . .
1–16
CE202
R. Hughey
1–14
CE202
R. Hughey
Computer History
RISC Geneology
Current Technology: Pentium IV Vdd 1.4V 2+ GHz 52.4 Watts 2 217mm 0.18 µm CMOS 42M Transistors 478-PIN PGA Over 100 Power and Ground pins
1–19
CE202
R. Hughey
Semiconductor Market
1–17
CE202
R. Hughey
Chip Picture: AMD Athlon
• 60 million transistors/person/year • $150 billion in 2002. • Processors are 2% of semiconductors. The Two Percent Solution; Jim Turley; Embedded Systems Programming; Dec 18 2002 (10:10 AM); URL:
http://www.embedded.com/showArticle.jhtml?articleID=9900861
1–20
CE202
R. Hughey
1–18
CE202
R. Hughey
Performance is like a car
Processor Market
• PC processors are 2% of processors. • 0.04% of units are 15% of semiconductor cash. The Two Percent Solution; Jim Turley; Embedded Systems Programming; Dec 18 2002 (10:10 AM); URL:
http://www.embedded.com/showArticle.jhtml?articleID=9900861
1–23
CE202
R. Hughey
1–21
CE202
R. Hughey
Performance is like a car
Computer Architecture Includes . . .
1–24
CE202
R. Hughey
1–22
CE202
R. Hughey
Performance
Performance is like a car
Principles to Make Things Fast • Make common case fast • Use locality of reference – Temporal locality: If it was used, it may be used again – Spatial locality: If it was used, its neighbor may be as well
60% 50% 40%
Fraction of the program
30% 20% 10%
mdljdp
su2cor
ear
hydro2d
li
doduc
gcc
eqntott
espresso
compress
0%
SPEC benchmark
FIGURE 1.14 This plot shows what percentage of the instructions are responsible for 80% and for 90% of the instruction executions.
1–25 1–27
CE202
CE202
R. Hughey
R. Hughey
Performance
Performance is like a car
To measure performance use speedup. A is S times faster than B on problem P: TBP TAP Do not use % faster or % slower (why?) S=
1–28
CE202
R. Hughey
1–26
CE202
R. Hughey
Performance
Performance
What is performance? Time • Elapsed time: Computation, memory, disk access, I/O, OS overhead, other jobs, networks, . . . • CPU time: Exclude I/O wait and swapped out time. – User CPU time: running program – System CPU time: CPU time spent on OS requests exclusive of waits. Unix ‘time’ provides all three.
Amdahl’s Law: Quantification of “common case fast” Suppose Fe of original time can be sped up by speedup Se, then new time Tn in terms of old time To is To Tn = To(1 − Fe) + Fe Se or To 1 Sn = = Tn 1 − Fe + FSee
Which is best?
1–31
CE202
R. Hughey
Performance
1–29
CE202
R. Hughey
Performance
CPU Time = CPU clock cycles in program × Clock cycle time
Ahmdahl’s law example Add a vector processor that runs 20 times faster on 20% of the code . . . So, Fe = 0.2 and Se = 20 and 1 = 1.23 Sn = 0.8 + 0.2 20
Per program cpi = Clocks per Instruction = Cycles Instruction Count
What would happen if the VP was only 10 times faster? 1 = 1.22 Sn = 0.8 + 0.2 10
CPU Time = Instruction Count × cpi × Clock Cycle Time
5× produces 1.19 2× produces 1.11 What does this mean for the vector processor designer?
These are affected by. . . Instruction Count Instruction set arch
CPI
Clock Cycle Time
Hardware Organization Hardware Organization
Compiler Technology Compiler Optimization Fabrication Technology Algorithm choice ... CP I > 1: CISC
Instruction set ... CP I = 1: RISC
System design ... CP I < 1: Superscalar 1–30
1–32
CE202
R. Hughey
CE202
R. Hughey
Instruction mixes
Instruction mixes
Example: gcc on load/store machine (meaning. . . ) Operation ALU Load Store Branch
With no change in cycle time, Iold · 1.57 · C T speedup = old = Tnew (0.893)(Iold) · 1.908 · C 1.57 = = 0.92 1.70 cpi =
X
i
Frequency 0.43 0.21 0.12 0.24
CPI 1 2 2 2
freqicpii = 0.43 + 0.42 + 0.24 + 0.48 = 1.57
Suppose. . .
1–35
CE202
R. Hughey
Metrics
1–33
CE202
R. Hughey
Instruction mixes
Performance Metrics (good and bad) • Clock speed – Useful for a fixed architecture, such as tradeoff between Si and GaAs Sparc implementations. – Not entirely useful within an architecture family as newer machine may have more cache, functional units, etc. Clock Rate (Hz) 1 = • MIPS= cpi×106 Clock Period (s)·cpi×106 – Instruction set dependent
Register-memory instructions are added, usable for 25% of ALU operations (replacing Load/ALU pairs) with cpi= 2, and branches increase to cpi = 3. Old ALU cpinew =
Old Load
New L/A
Store
Branch
(.43 − (.24 · .43))1+ (.21 − (.25 · .43))2+ (.25 · .43)2+ .12 · 2+ .24 · 3 1 − (.25)(.43) New Instruction Count
1.703 = = 1.908 0.893
– Program dependent – Little relation to performance – Relative MIPS a little better: T · MIPSVAX ∗ VAX-11/780 MIPS = VAX T
1–36
CE202
R. Hughey
1–34
CE202
R. Hughey
Metrics
Fooling the Masses — D. Bailey
4. Scale up the problem size with the number of processors, but omit any mention of this fact. Graphs of performance rates versus the number of processors have a nasty habit of trailing off. This problem can easily be remedied by plotting the performance rates for problems whose sizes scale up with the number of processors. The important point is to omit any mention of this scaling in your plots and tables. Clearly disclosing this fact might raise questions about the efficiency of your implementation.
Few labs can afford a full-scale parallel computer — such systems cost millions of dollars. Unfortunately, the performance of a code on a scaled down system is often not very impressive. There is a straightforward solution to this dilemma — project your performance results linearly to a full system, and quote the projected results, without justifying the linear scaling. Be very careful not to mention this projection, however, since it could seriously undermine your performance claims for the audience to realize that you did not actually obtain your results on real full-scale hardware.
6. Compare your results against scalar, unoptimized code on Crays. It really impresses the audience when you can state that your code runs several times faster than a Cray, currently the world’s dominant supercomputer. Unfortunately, with a little tuning many applications run quite fast on Crays. Therefore you must be careful not to do any tuning on the Cray code. Do not insert vectorization directives, and if you find any, remove them. In extreme cases it may be necessary to disable all vectorization with a command line flag. Also, Crays often run much slower with bank conflicts, so be sure that your Cray code accesses data with large, power-of-two strides whenever possible. It is also important to avoid multitasking and autotasking on Crays — imply in your paper that the one processor Cray performance rates you are comparing against represent the full potential of a $25 million Cray system.
CE202
R. Hughey
Fooling the Masses — D. Bailey
Direct run time comparisons can be quite embarrassing, especially if your parallel code runs significantly slower than an implementation on a conventional system. If you are challenged to provide such figures, compare your results with the performance of an obsolete code running on obsolete hardware with an obsolete compiler. For example, you can state that your parallel performance is “100 times faster than a VAX 11/780”. A related technique is to compare your results with results on another less capable parallel system or minisupercomputer. Keep in mind the bumper sticker “We may be slow, but we’re ahead of you.”
8. If MFLOPS rates must be quoted, base the operation count on the parallel implementation, not on the best sequential implementation. We know that MFLOPS rates of a parallel codes are often not very impressive. Fortunately, there are some tricks that can make these figures more respectable. The most effective scheme is to compute the operation count based on an inflated parallel implementation. Parallel implementations often perform far more floating point operations than the best sequential implementation. Often millions of operations are masked out or merely repeated in each processor. Millions more can be included simply by inserting a few dummy loops that do nothing. Including these operations in the count will greatly increase the resulting MFLOPS rate and make your code look like a real winner.
9. Quote performance in terms of processor utilization, parallel speedups or MFLOPS per dollar. As mentioned above, run time or even MFLOPS comparisons of codes on parallel systems with equivalent codes on conventional supercomputers are often not favorable. Thus whenever possible, use other performance measures. One of the best is “processor utilization” figures. It sounds great when you can claim that all processors are busy nearly 100% of the time, even if what they are actually busy with is synchronization and communication overhead. Another useful statistic is “parallel speedup” — you can claim “fully linear” speedup simply by making sure that the single processor version runs sufficiently slowly. For example, make sure that the single processor version includes synchronization and communication overhead, even though this code is not necessary when running on only one processor. A third statistic that many in the field have found useful is “MFLOPS per dollar”. Be sure not to use “sustained MFLOPS per dollar”, i.e. actual delivered computational throughput per dollar, since these figures are often not favorable to new computer systems.
CE202
flop/program Tprogram × 106
– Only for FP-intensive code
– Normalized FLOP for standard program (i.e., Livermore Loop): Operation FLOP +−× 1 √ ÷ 4 exp(), sin(), ln() . . . 8 – See D. Bailey, “12 ways to fool the masses when giving performance results on parallel computers,” RNR TR RNR-91-020. Question: Which will be higher, MIPS or MFLOPS?
1–37
CE202
R. Hughey
Fooling the Masses — D. Bailey
7. When direct run time comparisons are required, compare with an old code on an obsolete system.
1–40
mflops =
– Problem with inconsistent FP operations (depends on mix).
5. Quote performance results projected to a full system.
1–39
• MFLOPS
R. Hughey
Many of us in the field of highly parallel scientific computing recognize that it is often quite difficult to match the run time performance of the best conventional supercomputers. But since lay persons usually don’t appreciate these difficulties and therefore don’t understand when we quote mediocre performance results, it is often necessary for us to adopt some advanced techniques in order to deflect attention from possibly unfavorable facts. Here are some of the most effective methods, as observed from recent scientific papers and technical presentations:
1. Quote only 32-bit performance results, not 64-bit results. We all know that it is hard to obtain impressive performance using 64-bit floating point arithmetic. Some research systems do not even have 64-bit hardware. Thus always quote 32-bit results, and avoid mentioning this fact if at all possible. Better still, compare your 32-bit results with 64-bit results on other systems. 32-bit arithmetic may or may not be appropriate for your application, but the audience doesn’t need to be bothered with such details.
2. Present performance figures for an inner kernel, and then represent these figures as the performance of the entire application. It is quite difficult to obtain high performance on a complete large-scale scientific application, timed from beginning of execution through completion. There is often a great deal of data movement and initialization that depresses overall performance rates. A good solution to this dilemma is to present results for an inner kernel of an application, which can be souped up with artificial tricks. Then imply in your presentation that these rates are equivalent to the overall performance of the entire application.
3. Quietly employ assembly code and other low-level language constructs. It is often hard to obtain good performance from straightforward Fortran or C code that employs the usual parallel programming constructs, due to compiler weaknesses on many highly parallel computer systems. Thus you should feel free to employ assembly-coded computation kernels, customized communication routines and other low-level code in your parallel implementation. Don’t mention such usage, though, since it might alarm the audience to learn that assembly-level coding is necessary to obtain respectable performance.
1–38
CE202
R. Hughey
Metrics
Fooling the Masses — D. Bailey
• Benchmark programs — real programs and suites
10. Mutilate the algorithm used in the parallel implementation to match the architecture.
– Specific test program (i.e., SABRE if you’re American Airlines, or TEX, gcc, spice if H&P). – Test suite ∗ SPEC — System Performance Cooperative
Everyone is aware that algorithmic changes are often necessary when we port applications to parallel computers. Thus in your parallel implementation, it is essential that you select algorithms which exhibit high MFLOPS performance rates, without regard to fundamental efficiency. Unfortunately, such algorithmic changes often result in a code that requires far more time to complete the solution. For example, explicit linear system solvers for partial differential equation applications typically run at rather high MFLOPS rates on parallel computers, although they in many cases converge much slower than implicit or multigrid methods. For this reason you must be careful to downplay your changes to the algorithm, because otherwise the audience might wonder why you employed such an inappropriate solution technique.
· SPEC1.0 — 10 programs
11. Measure parallel run times on a dedicated system, but measure conventional run times in a busy environment.
· SPEC92 — Split into 6 integer and 14 FP.
There are a number of ways to further boost the performance of your parallel code relative to the conventional code. One way is to make many runs on both systems, and then publish the best time for the parallel system and the worst time for the conventional system. Another is to time your parallel computer code on a dedicated system and time your conventional code in a normal loaded environment. After all, your conventional supercomputer is very busy, and it is hard to arrange dedicated time. If anyone in the audience asks why the parallel system is freely available for dedicated runs, but the conventional system isn’t, change the subject.
· SPEC95 — Further revision — larger footprint, longer runtimes, more applications, 8 integer, 10 FP. · SPEC2000 —- Many revisions, larger datasets, currently in 6-month transition period, 12 integer, 14 FP. · Other benchmarks for WWW servers, transaction processing, . . . · http://open.specbench.org
12. If all else fails, show pretty pictures and animated videos, and don’t talk about performance. It sometimes happens that the audience starts to ask all sorts of embarrassing questions. These people simply have no respect for the authorities of our field. If you are so unfortunate as to be the object of such disrespect, there is always a way out — simply conclude your technical presentation and roll the videotape. Audiences love razzle-dazzle color graphics, and this material often helps deflect attention from the substantive technical issues.
Acknowledgments
∗ NAS Parallel benchmarks — a suite of programs for evaluating parallel processors, including sorting, “embarassingly parallel,” and several fluid dynamics codes.
The author wishes to acknowledge helpful contributions and comments by the following persons: R. Bailey, E. Barszcz, R. Fatoohi, P. Frederickson, J. McGraw, J. Riganati, R. Schreiber, H. Simon, V. Venkatakrishnan, S. Weeratunga, J. Winget and M. Zosel.
1–41 1–43
CE202
CE202
R. Hughey
R. Hughey
Metrics SPEC CPU95
Metrics
• Benchmark programs — synthetic Attempt to provide typical performance and program mix, but some compilers optimize away the benchmark. such as reducing:
SPEC CFP95 — 8 drop , add 4 101.tomcatv Vectorized mesh generation 102.swim
Shallow water equations
103.su2cor
Monte-Carlo method
X = SQRT (EXP(ALOG(X)/T1))
104.hydro2d Navier Stokes equations 107.mgrid
3d potential field
110.applu
Partial differential equations
125.turb3d
Turbulence modeling
141.apsi
Weather prediction
145.fpppp
From Gaussian series of quantum chemistry benchmarks
146.wave5
Maxwell’s equations
to X = EXP (ALOG(X)/(2*T1)) in Whetstone, the classic FP synthetic. Dhrystone is the integer one. Copies in /cse/classes/cmpe202/benchmarks/standard • Benchmark programs – kernals Isolated routines of prime importance: Linpack (Linear algebra), Livermore loops (FP)
1–44
CE202
R. Hughey
1–42
CE202
R. Hughey
Metrics SPEC CPU2000
Metrics SPEC CPU95
SPEC INT2000 — modify 2, add 10, drop 6 164.gzip
Data compression utility
175.vpr
FPGA circuit placement and routing
099.go
176.gcc
C compiler
124.m88ksim A chip simulator for the Motorola 88100 microprocessor
181.mcf
Minimum cost network flow solver
126.gcc
186.crafty
Chess program
129.compress A in-memory version of the common UNIX utility
197.parser
Natural language processing
130.li
252.eon
Ray tracing
SPEC INT95 — drop 3, add 5 An internationally ranked go-playing program Based on the GNU C compiler version 2.5.3 Xlisp interpreter
132.ijpeg
Image compression/decompression on in-memory images
253.perlbmk Perl
134.perl
An interpreter for the Perl language
254.gap
Computational group theory
147.vortex
An object oriented database
255.vortex
Object-oriented database
256.bzip2
Data compression utility
300.twolf
Place and route simulator
1–47
CE202
R. Hughey
Metrics
1–45
CE202
R. Hughey
Metrics SPEC CPU2000
Combining program times of P programs. PP Sum i=1 Ti 1 PP Mean P i=1 Ti Weighted mean PPi=1 wiTi r Geometric mean P QPi=1 Ti PP Geometric mean e i=1 ln(Ti)/P
SPEC CFP2000 — modify 3, add 11, drop 7
• SPEC uses the geometric mean normalized to a specific machine given machine. – Spec1.0 was a VAX 11/780. – Spec95 is SPARCstation 10 Model 40 • SpecRatio is time on the reference machine divided by time on the test machine. • SpecMark is the geometric mean of the SpecRatios.
8.wupwise
Quantum chromodynamics
171.swim
Shallow water modeling
172.mgrid
Multi-grid solver in 3D potential field
173.applu
Parabolic/elliptic partial differential equations
177.mesa
3D graphics library
178.galgel
Fluid dynamics: analysis of oscillatory instability
179.art
Neural network simulation: adaptive resonance theory
183.equake
Finite element simulation: earthquake modeling
187.facerec
Computer vision: recognizes faces
188.ammp
Computational chemistry
189.lucas
Number theory: primality testing
191.fma3d
Finite-element crash simulation
200.sixtrack Particle accelerator model 301.apsi
Weather prediction/pollutants
• This is the same as the ratio of the geometric mean. • Geometric mean does not correspond to a real instruction mix. 1–48
CE202
R. Hughey
1–46
CE202
R. Hughey