CE202: Computer Architecture Chapter 1: History and Performance

Computer History

Early Computing – Mechanical 1642 Pascal builds adder/subtractor • 2 6-decimal digit registers (dials with mechanical linkage) • Carry between digits • Complements for negative numbers

1–3

CE202

R. Hughey

Computer History

Performance

1671 Leibnitz includes multiplication and division • Multiplier and ’cand registers added • Use repeated add/sub for mult/div

CE202: Computer Architecture Chapter 1: History and Performance

See John P. Hayes, Computer Architecture and Organization, McGraw-Hill, 1978. Pictures courtesy of Andrea Di Blas, from various sources.

1–4

CE202

R. Hughey

1–1

CE202

R. Hughey

Computer History

Computer History

1834 Babbage’s Analytical Engine • General-purpose machine using Jacquard punch cards

The Mill

1823 Babbage’s Difference Engine • For generating, e.g., log tables

The Store

• 1823–42 Builds 6th degree, 20-digit difference engine (not completed)

Printer Card Punch

• 1837–53 Swede Georg Scheutz builds 3rd degree, 15-digit engine

+-*/

• 1990s London museom builds 40-digit, 7th degree engine to 18th century tolerances. – $2.5 million over 2 years, 2000 parts, hand-cranked

Operation Cards

Variable Cards

yi+1 = yi + ∆1yi

• Separation of opcode and address for fast decode

∆1yi+1 = ∆1yi + ∆2yi ..

• Scroll forward and backward (goto!)

∆nyi+1 = ∆nyi

• Conditional branch on sign 1930-1940 Electromechanical: Konrad Zuse Z1, Howard Aiken Harvard Mark I (First operational general-purpose computers). 1–5

Computer History

History 0: mechanical computers (1623 – 1943)

Electronic Computers

1822 Charles Babbage (1791 – 1871) builds the “Difference Engine”

1940s John Atanasoff, Iowa State. 1943–46 ENIAC Mauchly and Eckert, U Penn. Card reader, Printer, Card Punch

Multiplier

CE202

Divider and Square Root Extract

Table Lookup

Control (Switches)

• 30 Tons, 18,000 Tubes, stored program, CPU • Simple iteration for ballistic tables • Quite an early legal battle over patenting the computer

Twenty 10-digit Accumulators

R. Hughey

Computer History

Computer History

1955–64 Second Generation • Transistor, Core memory, index registers, floating-point, IO processors, Algol, Cobol, Fortran

1943 Eckert and Mauchly build the ENIAC (Electronic Numerical Integrator And Calculator) — the first stored-program “electronic computer”

• Lincoln Labs TX-0, IBM 704 (FP, OS), IBM 709 (IOPS), IBM 7090, IBM 7094 (below), IBM Stretch (pipelining)

1–11

CE202

Computer History

R. Hughey

1–9

CE202

R. Hughey

Computer History

1965–74 Third Generation • ICs (SSI and MSI) • Microprogramming (Wilkes, 1951), pipelining, cache

1945-55 Princeton IAS (below), MIT Whirlwind (core memory), Manchester Atlas (virtual memory, index registers), EDVAC (binary, mercury delay line registers). Eckert-Mauchly UNIVAC

• Multiprogramming, OS, VM, Timesharing (MIT’s CTSS) • IBM S/360 (1964, below), CDC 6600 (1964), ILLIAC IV parallel computer,Burroughts B5000, DEC PDP-8

1–10

CE202

R. Hughey

Computer History

Computer History — IBM S/360

1975–1985 Fourth Generation • LSI • Semiconductor memory, Minicomputers, Microprocessors • C, Unix • DEC PDP 11, IBM S/370, Intel 8748, 68000, 80186

1–15

CE202

R. Hughey

Computer History

Computer History — IBM S/360

1985– RISC (whatever that means!) • VLSI • Fast memory • Compiler and language development • Sparc, MIPS, HP-PA, Alpha, 80486, . . .

1–16

CE202

R. Hughey

1–14

CE202

R. Hughey

Computer History

RISC Geneology

Current Technology: Pentium IV Vdd 1.4V 2+ GHz 52.4 Watts 2 217mm 0.18 µm CMOS 42M Transistors 478-PIN PGA Over 100 Power and Ground pins

1–19

CE202

R. Hughey

Semiconductor Market

1–17

CE202

R. Hughey

Chip Picture: AMD Athlon

• 60 million transistors/person/year • $150 billion in 2002. • Processors are 2% of semiconductors. The Two Percent Solution; Jim Turley; Embedded Systems Programming; Dec 18 2002 (10:10 AM); URL:

http://www.embedded.com/showArticle.jhtml?articleID=9900861

1–20

CE202

R. Hughey

1–18

CE202

R. Hughey

Performance is like a car

Processor Market

• PC processors are 2% of processors. • 0.04% of units are 15% of semiconductor cash. The Two Percent Solution; Jim Turley; Embedded Systems Programming; Dec 18 2002 (10:10 AM); URL:

http://www.embedded.com/showArticle.jhtml?articleID=9900861

1–23

CE202

R. Hughey

1–21

CE202

R. Hughey


Computer Architecture Includes . . .

1–24

CE202

R. Hughey

1–22

CE202

R. Hughey

Performance


Principles to Make Things Fast • Make common case fast • Use locality of reference – Temporal locality: If it was used, it may be used again – Spatial locality: If it was used, its neighbor may be as well

60% 50% 40%

Fraction of the program

30% 20% 10%

mdljdp

su2cor

ear

hydro2d

li

doduc

gcc

eqntott

espresso

compress

0%

SPEC benchmark

FIGURE 1.14 This plot shows what percentage of the instructions are responsible for 80% and for 90% of the instruction executions.

1–25 1–27

CE202

CE202

R. Hughey

R. Hughey

Performance


To measure performance use speedup. A is S times faster than B on problem P: TBP TAP Do not use % faster or % slower (why?) S=

1–28

CE202

R. Hughey

1–26

CE202

R. Hughey

Performance

Performance

What is performance? Time • Elapsed time: Computation, memory, disk access, I/O, OS overhead, other jobs, networks, . . . • CPU time: Exclude I/O wait and swapped out time. – User CPU time: running program – System CPU time: CPU time spent on OS requests exclusive of waits. Unix ‘time’ provides all three.

Amdahl’s Law: Quantification of “common case fast” Suppose Fe of original time can be sped up by speedup Se, then new time Tn in terms of old time To is To Tn = To(1 − Fe) + Fe Se or To 1 Sn = = Tn 1 − Fe + FSee

Which is best?

1–31

CE202

R. Hughey

Performance

1–29

CE202

R. Hughey

Performance

CPU Time = CPU clock cycles in program × Clock cycle time

Ahmdahl’s law example Add a vector processor that runs 20 times faster on 20% of the code . . . So, Fe = 0.2 and Se = 20 and 1 = 1.23 Sn = 0.8 + 0.2 20

Per program cpi = Clocks per Instruction = Cycles Instruction Count

What would happen if the VP was only 10 times faster? 1 = 1.22 Sn = 0.8 + 0.2 10

CPU Time = Instruction Count × cpi × Clock Cycle Time

5× produces 1.19 2× produces 1.11 What does this mean for the vector processor designer?

These are affected by. . . Instruction Count Instruction set arch

CPI

Clock Cycle Time

Hardware Organization Hardware Organization

Compiler Technology Compiler Optimization Fabrication Technology Algorithm choice ... CP I > 1: CISC

Instruction set ... CP I = 1: RISC

System design ... CP I < 1: Superscalar 1–30

1–32

CE202

R. Hughey

CE202

R. Hughey

Instruction mixes

Instruction mixes

Example: gcc on load/store machine (meaning. . . ) Operation ALU Load Store Branch

With no change in cycle time, Iold · 1.57 · C T speedup = old = Tnew (0.893)(Iold) · 1.908 · C 1.57 = = 0.92 1.70 cpi =

X

i

Frequency 0.43 0.21 0.12 0.24

CPI 1 2 2 2

freqicpii = 0.43 + 0.42 + 0.24 + 0.48 = 1.57

Suppose. . .

1–35

CE202

R. Hughey

Metrics

1–33

CE202

R. Hughey

Instruction mixes

Performance Metrics (good and bad) • Clock speed – Useful for a fixed architecture, such as tradeoff between Si and GaAs Sparc implementations. – Not entirely useful within an architecture family as newer machine may have more cache, functional units, etc. Clock Rate (Hz) 1 = • MIPS= cpi×106 Clock Period (s)·cpi×106 – Instruction set dependent

Register-memory instructions are added, usable for 25% of ALU operations (replacing Load/ALU pairs) with cpi= 2, and branches increase to cpi = 3. Old ALU cpinew =

Old Load

New L/A

Store

Branch

(.43 − (.24 · .43))1+ (.21 − (.25 · .43))2+ (.25 · .43)2+ .12 · 2+ .24 · 3 1 − (.25)(.43) New Instruction Count

1.703 = = 1.908 0.893

– Program dependent – Little relation to performance – Relative MIPS a little better: T · MIPSVAX ∗ VAX-11/780 MIPS = VAX T

1–36

CE202

R. Hughey

1–34

CE202

R. Hughey

Metrics

Fooling the Masses — D. Bailey

4. Scale up the problem size with the number of processors, but omit any mention of this fact. Graphs of performance rates versus the number of processors have a nasty habit of trailing off. This problem can easily be remedied by plotting the performance rates for problems whose sizes scale up with the number of processors. The important point is to omit any mention of this scaling in your plots and tables. Clearly disclosing this fact might raise questions about the efficiency of your implementation.

Few labs can afford a full-scale parallel computer — such systems cost millions of dollars. Unfortunately, the performance of a code on a scaled down system is often not very impressive. There is a straightforward solution to this dilemma — project your performance results linearly to a full system, and quote the projected results, without justifying the linear scaling. Be very careful not to mention this projection, however, since it could seriously undermine your performance claims for the audience to realize that you did not actually obtain your results on real full-scale hardware.

6. Compare your results against scalar, unoptimized code on Crays. It really impresses the audience when you can state that your code runs several times faster than a Cray, currently the world’s dominant supercomputer. Unfortunately, with a little tuning many applications run quite fast on Crays. Therefore you must be careful not to do any tuning on the Cray code. Do not insert vectorization directives, and if you find any, remove them. In extreme cases it may be necessary to disable all vectorization with a command line flag. Also, Crays often run much slower with bank conflicts, so be sure that your Cray code accesses data with large, power-of-two strides whenever possible. It is also important to avoid multitasking and autotasking on Crays — imply in your paper that the one processor Cray performance rates you are comparing against represent the full potential of a $25 million Cray system.

CE202

R. Hughey


Direct run time comparisons can be quite embarrassing, especially if your parallel code runs significantly slower than an implementation on a conventional system. If you are challenged to provide such figures, compare your results with the performance of an obsolete code running on obsolete hardware with an obsolete compiler. For example, you can state that your parallel performance is “100 times faster than a VAX 11/780”. A related technique is to compare your results with results on another less capable parallel system or minisupercomputer. Keep in mind the bumper sticker “We may be slow, but we’re ahead of you.”

8. If MFLOPS rates must be quoted, base the operation count on the parallel implementation, not on the best sequential implementation. We know that MFLOPS rates of a parallel codes are often not very impressive. Fortunately, there are some tricks that can make these figures more respectable. The most effective scheme is to compute the operation count based on an inflated parallel implementation. Parallel implementations often perform far more floating point operations than the best sequential implementation. Often millions of operations are masked out or merely repeated in each processor. Millions more can be included simply by inserting a few dummy loops that do nothing. Including these operations in the count will greatly increase the resulting MFLOPS rate and make your code look like a real winner.

9. Quote performance in terms of processor utilization, parallel speedups or MFLOPS per dollar. As mentioned above, run time or even MFLOPS comparisons of codes on parallel systems with equivalent codes on conventional supercomputers are often not favorable. Thus whenever possible, use other performance measures. One of the best is “processor utilization” figures. It sounds great when you can claim that all processors are busy nearly 100% of the time, even if what they are actually busy with is synchronization and communication overhead. Another useful statistic is “parallel speedup” — you can claim “fully linear” speedup simply by making sure that the single processor version runs sufficiently slowly. For example, make sure that the single processor version includes synchronization and communication overhead, even though this code is not necessary when running on only one processor. A third statistic that many in the field have found useful is “MFLOPS per dollar”. Be sure not to use “sustained MFLOPS per dollar”, i.e. actual delivered computational throughput per dollar, since these figures are often not favorable to new computer systems.

CE202

flop/program Tprogram × 106

– Only for FP-intensive code

– Normalized FLOP for standard program (i.e., Livermore Loop): Operation FLOP +−× 1 √ ÷ 4 exp(), sin(), ln() . . . 8 – See D. Bailey, “12 ways to fool the masses when giving performance results on parallel computers,” RNR TR RNR-91-020. Question: Which will be higher, MIPS or MFLOPS?

1–37

CE202

R. Hughey


7. When direct run time comparisons are required, compare with an old code on an obsolete system.

1–40

mflops =

– Problem with inconsistent FP operations (depends on mix).

5. Quote performance results projected to a full system.

1–39

• MFLOPS

R. Hughey

Many of us in the field of highly parallel scientific computing recognize that it is often quite difficult to match the run time performance of the best conventional supercomputers. But since lay persons usually don’t appreciate these difficulties and therefore don’t understand when we quote mediocre performance results, it is often necessary for us to adopt some advanced techniques in order to deflect attention from possibly unfavorable facts. Here are some of the most effective methods, as observed from recent scientific papers and technical presentations:

1. Quote only 32-bit performance results, not 64-bit results. We all know that it is hard to obtain impressive performance using 64-bit floating point arithmetic. Some research systems do not even have 64-bit hardware. Thus always quote 32-bit results, and avoid mentioning this fact if at all possible. Better still, compare your 32-bit results with 64-bit results on other systems. 32-bit arithmetic may or may not be appropriate for your application, but the audience doesn’t need to be bothered with such details.

2. Present performance figures for an inner kernel, and then represent these figures as the performance of the entire application. It is quite difficult to obtain high performance on a complete large-scale scientific application, timed from beginning of execution through completion. There is often a great deal of data movement and initialization that depresses overall performance rates. A good solution to this dilemma is to present results for an inner kernel of an application, which can be souped up with artificial tricks. Then imply in your presentation that these rates are equivalent to the overall performance of the entire application.

3. Quietly employ assembly code and other low-level language constructs. It is often hard to obtain good performance from straightforward Fortran or C code that employs the usual parallel programming constructs, due to compiler weaknesses on many highly parallel computer systems. Thus you should feel free to employ assembly-coded computation kernels, customized communication routines and other low-level code in your parallel implementation. Don’t mention such usage, though, since it might alarm the audience to learn that assembly-level coding is necessary to obtain respectable performance.

1–38

CE202

R. Hughey

Metrics


• Benchmark programs — real programs and suites

10. Mutilate the algorithm used in the parallel implementation to match the architecture.

– Specific test program (i.e., SABRE if you’re American Airlines, or TEX, gcc, spice if H&P). – Test suite ∗ SPEC — System Performance Cooperative

Everyone is aware that algorithmic changes are often necessary when we port applications to parallel computers. Thus in your parallel implementation, it is essential that you select algorithms which exhibit high MFLOPS performance rates, without regard to fundamental efficiency. Unfortunately, such algorithmic changes often result in a code that requires far more time to complete the solution. For example, explicit linear system solvers for partial differential equation applications typically run at rather high MFLOPS rates on parallel computers, although they in many cases converge much slower than implicit or multigrid methods. For this reason you must be careful to downplay your changes to the algorithm, because otherwise the audience might wonder why you employed such an inappropriate solution technique.

· SPEC1.0 — 10 programs

11. Measure parallel run times on a dedicated system, but measure conventional run times in a busy environment.

· SPEC92 — Split into 6 integer and 14 FP.

There are a number of ways to further boost the performance of your parallel code relative to the conventional code. One way is to make many runs on both systems, and then publish the best time for the parallel system and the worst time for the conventional system. Another is to time your parallel computer code on a dedicated system and time your conventional code in a normal loaded environment. After all, your conventional supercomputer is very busy, and it is hard to arrange dedicated time. If anyone in the audience asks why the parallel system is freely available for dedicated runs, but the conventional system isn’t, change the subject.

· SPEC95 — Further revision — larger footprint, longer runtimes, more applications, 8 integer, 10 FP. · SPEC2000 —- Many revisions, larger datasets, currently in 6-month transition period, 12 integer, 14 FP. · Other benchmarks for WWW servers, transaction processing, . . . · http://open.specbench.org

12. If all else fails, show pretty pictures and animated videos, and don’t talk about performance. It sometimes happens that the audience starts to ask all sorts of embarrassing questions. These people simply have no respect for the authorities of our field. If you are so unfortunate as to be the object of such disrespect, there is always a way out — simply conclude your technical presentation and roll the videotape. Audiences love razzle-dazzle color graphics, and this material often helps deflect attention from the substantive technical issues.

Acknowledgments

∗ NAS Parallel benchmarks — a suite of programs for evaluating parallel processors, including sorting, “embarassingly parallel,” and several fluid dynamics codes.

The author wishes to acknowledge helpful contributions and comments by the following persons: R. Bailey, E. Barszcz, R. Fatoohi, P. Frederickson, J. McGraw, J. Riganati, R. Schreiber, H. Simon, V. Venkatakrishnan, S. Weeratunga, J. Winget and M. Zosel.

1–41 1–43

CE202

CE202

R. Hughey

R. Hughey

Metrics SPEC CPU95

Metrics

• Benchmark programs — synthetic Attempt to provide typical performance and program mix, but some compilers optimize away the benchmark. such as reducing:

SPEC CFP95 — 8 drop , add 4 101.tomcatv Vectorized mesh generation 102.swim

Shallow water equations

103.su2cor

Monte-Carlo method

X = SQRT (EXP(ALOG(X)/T1))

104.hydro2d Navier Stokes equations 107.mgrid

3d potential field

110.applu

Partial differential equations

125.turb3d

Turbulence modeling

141.apsi

Weather prediction

145.fpppp

From Gaussian series of quantum chemistry benchmarks

146.wave5

Maxwell’s equations

to X = EXP (ALOG(X)/(2*T1)) in Whetstone, the classic FP synthetic. Dhrystone is the integer one. Copies in /cse/classes/cmpe202/benchmarks/standard • Benchmark programs – kernals Isolated routines of prime importance: Linpack (Linear algebra), Livermore loops (FP)

1–44

CE202

R. Hughey

1–42

CE202

R. Hughey

Metrics SPEC CPU2000

Metrics SPEC CPU95

SPEC INT2000 — modify 2, add 10, drop 6 164.gzip

Data compression utility

175.vpr

FPGA circuit placement and routing

099.go

176.gcc

C compiler

124.m88ksim A chip simulator for the Motorola 88100 microprocessor

181.mcf

Minimum cost network flow solver

126.gcc

186.crafty

Chess program

129.compress A in-memory version of the common UNIX utility

197.parser

Natural language processing

130.li

252.eon

Ray tracing

SPEC INT95 — drop 3, add 5 An internationally ranked go-playing program Based on the GNU C compiler version 2.5.3 Xlisp interpreter

132.ijpeg

Image compression/decompression on in-memory images

253.perlbmk Perl

134.perl

An interpreter for the Perl language

254.gap

Computational group theory

147.vortex

An object oriented database

255.vortex

Object-oriented database

256.bzip2

Data compression utility

300.twolf

Place and route simulator

1–47

CE202

R. Hughey

Metrics

1–45

CE202

R. Hughey

Metrics SPEC CPU2000

Combining program times of P programs. PP Sum i=1 Ti 1 PP Mean P i=1 Ti Weighted mean PPi=1 wiTi r Geometric mean P QPi=1 Ti PP Geometric mean e i=1 ln(Ti)/P

SPEC CFP2000 — modify 3, add 11, drop 7

• SPEC uses the geometric mean normalized to a specific machine given machine. – Spec1.0 was a VAX 11/780. – Spec95 is SPARCstation 10 Model 40 • SpecRatio is time on the reference machine divided by time on the test machine. • SpecMark is the geometric mean of the SpecRatios.

8.wupwise

Quantum chromodynamics

171.swim

Shallow water modeling

172.mgrid

Multi-grid solver in 3D potential field

173.applu

Parabolic/elliptic partial differential equations

177.mesa

3D graphics library

178.galgel

Fluid dynamics: analysis of oscillatory instability

179.art

Neural network simulation: adaptive resonance theory

183.equake

Finite element simulation: earthquake modeling

187.facerec

Computer vision: recognizes faces

188.ammp

Computational chemistry

189.lucas

Number theory: primality testing

191.fma3d

Finite-element crash simulation

200.sixtrack Particle accelerator model 301.apsi

Weather prediction/pollutants

• This is the same as the ratio of the geometric mean. • Geometric mean does not correspond to a real instruction mix. 1–48

CE202

R. Hughey

1–46

CE202

R. Hughey