architecture and arithmetic for multimedia enhanced ... - CiteSeerX

ARCHITECTURE AND ARITHMETIC FOR MULTIMEDIA ENHANCED PROCESSORS

a dissertation submitted to the department of electrical engineering and the committee on graduate studies of stanford university in partial fulfillment of the requirements for the degree of doctor of philosophy

Daniel Frederick Zucker June 1997

c Copyright 1997 by Daniel Frederick Zucker All Rights Reserved

ii

I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Michael J. Flynn (Principal Adviser)

I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. Ruby B. Lee

I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy. James S. Harris

Approved for the University Committee on Graduate Studies:

iii

iv

Abstract In the past, displaying video on desktop systems has required high cost special purpose hardware to handle the computationally intensive task of video compression. Recently, special purpose multimedia instruction sets have allowed software-only real time video decompression without extra hardware. This means video functionality can now be handled by a general purpose CPU. With this low cost video capability, the video data type is becoming truly ubiquitous. Most major CPU vendors have adopted this strategy of enhancing a general purpose processor for multimedia applications. Examples include Hewlett Packard's MAX instruction set, Sun's VIS (Visual Instruction Set), Digital's MVI instructions, MIPS MDMX instructions, and Intel's MMX instructions. This work investigates similar techniques for applying cost-eective enhancements to a general purpose processor. Using public domain MPEG implementations as benchmarks and trace based simulation, we investigate performance for typical MPEG video decompression applications. Beginning with a system level breakdown of execution time, we propose techniques to improve execution time in three separate architectural components: I/O, arithmetic, and cache memory. For I/O, we show how applying traditional techniques of I/O cache prefetching can reduce the time for reading compressed video data. For arithmetic, we propose software-only techniques to pack multiple data words into a single oating point operand to achieve SIMD type parallelism. A theoretical framework is presented to de ne the technique's capabilities and limitations. We also propose hardware extensions to increase the robustness of this technique. For cache memory accesses, we compare several hardware based prefetching techniques v

and then de ne a stream cache that eliminates up to 80% of cache misses across a range of cache sizes. Finally, we show how the stream cache motivates a technique for inserting software prefetch instructions that improves performance with little or no extra hardware cost. Combining these techniques, we can achieve a speedup of 1.5x for a typical MPEG application.

vi

Acknowledgments This work marks the completion of three degrees earned at Stanford University. I am immensely grateful for the opportunity of having received what I consider the nest engineering education available. I would like to thank some of the people that have made this possible. I begin by thanking my nancial benefactors. Your money has been well spent. I would like to thank the United States Navy for funding my rst undergraduate year, and the generous support of my parents, Donald and Dorothy Zucker, and my Uncle Sheldon and Aunt Marianne Lubar for the rest of my undergraduate education. I would like to thank the National Science Foundation for providing the NSF fellowship that funded my rst three years of graduate education. Contracts from NASA and the United States Army directed by Professor Michael J. Flynn and a generous fellowship from Hitachi America, Ltd. funded the rest of my time at Stanford. Finally, I would like to thank the Hewlett Packard Corporation for donating the equipment with which much of this work was performed. Next I would like to thank my advisors without whom none of this work would have been possible. I would like to thank Professor Robert Helliwell for serving as my undergraduate advisor. My graduate advisor, Professor Michael J. Flynn, in addition to excellent technical advice, showed me much about how to work with people. I am also grateful for the opportunity Mike gave me to travel to Japan in 1994 as a Teaching Fellow at the Stanford Japan Center. I consider the week we spent traveling and lecturing in Japan as one of the best weeks of my life. I would also like to thank my co-advisor Ruby Lee for her intimate involvement in this work. Her early help in de ning a research problem was especially critical. Additionally, she lent industrial vii

insight to the work and gave much help with publications. Also, I would like to thank the current and alumni members of the Stanford Architecture and Arithmetic Research Group. For moral support, much of which was needed in the years completing this work, I would like to again thank my parents for unfailing support. I would also like to thank my brother and sister and their families who also contributed greatly to my general state of well being as did all my roommates on Barbara Drive. Finally, I would like to thank my best friend, Mary Elizabeth Shannon, for showing me the single minded determination to graduate that led to the completion of this work.

There's gold and it's haunting and haunting; It's luring me on as of old; Yet it isn't the gold that I'm wanting So much as just nding the gold. It's the great, big, broad land 'way up yonder, It's the forests where silence has lease; It's the beauty that thrills me with wonder, It's the stillness that lls me with peace. Robert Service, The Spell of the Yukon

viii

Contents Abstract

v

Acknowledgments

vii

1 Introduction

1

2 Methodology

5

1.1 Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Why this strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . 2.2 Applications . . . . . . . . . 2.3 Instruction Instrumentation 2.3.1 RYO . . . . . . . . . 2.3.2 ATOM . . . . . . . . 2.4 Simulators . . . . . . . . . .

3 System Level Overview 3.1 3.2 3.3 3.4

Introduction . . . . . . . Methodology . . . . . . Overview . . . . . . . . . Windowing Environment 3.4.1 Shared Memory . 3.4.2 Remote Display .

. . . . . .

. . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . . ix

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

1 2 3

5 5 7 8 9 9

11 11 11 13 19 20 20

3.5 Dithering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Memory System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 IO 4.1 4.2 4.3 4.4 4.5

Introduction . . . I/O Breakdown . Simulated I/O . . Prefetch Strategy Summary . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

5 Doubly Enhanced Arithmetic

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Packing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Unpacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Informal Discussion of Length, Magnitude, and Relative Magnitude . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Length and Magnitude . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Requirements for DE . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Conditions for Implementing DE in DCT . . . . . . . . . . . . 5.4.5 Error in DCT . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.6 Comparison with State of the Art . . . . . . . . . . . . . . . . 5.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Packed Formats . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Rounding Implementation . . . . . . . . . . . . . . . . . . . . 5.5.3 Method for Extracting Parallelism . . . . . . . . . . . . . . . . 5.6 Study 1: JPEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Study 2: MPEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

22 25 27

29 29 30 31 32 34

35 35 36 38 38 39 41 44

44 45 53 55 60 64 72 72 72 74 74 75

5.7.1 Experimental Methodology . 5.7.2 Speedup Model . . . . . . . 5.7.3 Base . . . . . . . . . . . . . 5.7.4 Faster DCT . . . . . . . . . 5.7.5 Sparse DCT . . . . . . . . . 5.7.6 Quality . . . . . . . . . . . 5.7.7 Analysis . . . . . . . . . . . 5.7.8 Increased Parallelism . . . . 5.8 Summary . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

6 Hardware Modi cations for DE Arithmetic

6.1 The Reasons for Hardware DE Modi cations . . 6.1.1 Problems with Software DE Arithmetic . 6.1.2 Three Speci c Implementations . . . . . 6.2 Reset Instruction . . . . . . . . . . . . . . . . . 6.3 4 way DE Arithmetic . . . . . . . . . . . . . . . 6.3.1 The Bit Alignment Problem . . . . . . . 6.3.2 Dividing the Multiplier . . . . . . . . . . 6.3.3 Implementation . . . . . . . . . . . . . . 6.4 Double Floating Point . . . . . . . . . . . . . . 6.5 Summary . . . . . . . . . . . . . . . . . . . . .

7 Cache Prefetching

7.1 Introduction . . . . . . . . . 7.2 Related Work . . . . . . . . 7.3 Methodology . . . . . . . . 7.3.1 Simulation Methods 7.3.2 Performance Metrics 7.3.3 Memory Bandwidth . 7.4 Stream Buers . . . . . . . 7.5 Stride Prediction Table . . . 7.6 Stream Cache . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . . xi

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

75 76 80 81 83 85 87 90 91

93

93 94 95 95 97 97 102 111 115 123

125

125 125 127 127 127 128 129 132 136

7.6.1 Series Stream Cache . . . . . 7.6.2 Parallel Stream Cache . . . . 7.7 Time/Area Tradeos . . . . . . . . . 7.7.1 Additional Area Requirement 7.7.2 Execution Time . . . . . . . . 7.8 Eectiveness of Stride Prediction . . 7.8.1 Eect of Application Movie . 7.8.2 Eect of Cache Size . . . . . . 7.8.3 Eect of Cache Type . . . . . 7.8.4 Eect of Associativity . . . . 7.9 Summary . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

8 Software Directed Stride Based Prefetching

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Eect of Prediction File . . . . . . . . . . . . . . . . . . . 8.2.1 Eect of Prediction File Movie . . . . . . . . . . . . 8.2.2 Eect of Prediction File Cache Type . . . . . . . . 8.2.3 Eect of Prediction File Cache Size . . . . . . . . . 8.3 Eect of Cache . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Eect of Cache Type . . . . . . . . . . . . . . . . . 8.3.2 Eect of Cache Associativity . . . . . . . . . . . . . 8.4 Software Prefetching with No Stream Cache . . . . . . . . 8.4.1 Choosing Hint File Cache Type . . . . . . . . . . . 8.4.2 Choosing Fraction of Prefetch Instructions to Add . 8.4.3 Con rmation of Hint File Parameter Selection . . . 8.4.4 Eect of Modifying Execution Model Assumptions . 8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .

9 Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

136 139 142 142 142 144 145 146 146 149 150

153

153 157 157 158 162 163 163 164 166 167 171 173 173 180

181 xii

A RYO

A.1 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . A.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.1 Branch Delay Slot . . . . . . . . . . . . . . . . . . . . . . . A.2.2 Saving and Restoring Machine State and Parameter Passing A.3 Overcoming Compiler Errors . . . . . . . . . . . . . . . . . . . . . . A.3.1 Branch Targets . . . . . . . . . . . . . . . . . . . . . . . . . A.3.2 Indirect Procedure Calls . . . . . . . . . . . . . . . . . . . . A.4 Example Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

183

183 186 186 189 191 191 192 192

B MPEG1 and MPEG2 Miss Rate Data

195

C Hardware Design Truth Tables

211

D Software Prefetch Miss Rate Figures

215

B.1 mpeg play{hula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 B.2 mpeg play{easter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 B.3 mpeg2play{tennis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

D.1 Series vs. Parallel Stream Cache Hint Generation . D.1.1 Graphs . . . . . . . . . . . . . . . . . . . . . D.1.2 Table . . . . . . . . . . . . . . . . . . . . . . D.2 Adding Dierent Numbers of Prefetch Instructions D.2.1 Graphs . . . . . . . . . . . . . . . . . . . . . D.2.2 Table . . . . . . . . . . . . . . . . . . . . . . D.3 Varying Instruction Mix and Memory Parameters . D.3.1 Graphs . . . . . . . . . . . . . . . . . . . . . D.3.2 Table . . . . . . . . . . . . . . . . . . . . . .

Bibliography

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

215 216 223 228 228 235 240 240 245

255

xiii

xiv

List of Tables 2.1 Benchmark movie characteristics. . . . . . . . . . . . . . . . . . . . .

6

3.1 Breakdown by functional block. . . . . . . . . . . . . . . . . . . . . .

18

4.1 Measured I/O time breakdown where compressed movie data is read from disk. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Measured I/O time breakdown where compressed movie data is read from main memory I/O cache. . . . . . . . . . . . . . . . . . . . . . . 4.3 Measured vs. simulated times. . . . . . . . . . . . . . . . . . . . . . . 4.4 Simulated IO cache prefetching. . . . . . . . . . . . . . . . . . . . . . 5.1 5.2 5.3 5.4 5.5 5.6

Benchmark movie characteristics for DE experiments. . . . . . . . . . Error compared to Loeer integer DCT. . . . . . . . . . . . . . . . . Kernel calculation comparison for DE, MAX, and MMX code kernel. Code scheduling for 1 iteration of DE kernel on PA-8000. . . . . . . . Code scheduling for 2 iterations of DE kernel on PA-8000. . . . . . . Code scheduling for 4 iterations of DE kernel on PA-8000. . . . . . .

30 30 32 33 36 64 67 69 69 70

6.1 Truth table for typical Booth decoding where X represents the multiplicand. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.2 Truth table for modi ed Booth decoding where X represents the multiplicand. This logic is for group 3 of gure 6.9 when the the multiplication is divided into two subwords between bits X and X . Yi? is treated as if it is always zero. . . . . . . . . . . . . . . . . . . . . . . 109 4

xv

3

1

6.3 Truth table for modi ed Booth decoding where X represents the multiplicand. This logic is for group 3 of gure 6.9 when the the multiplication is divided into two subwords between bits X and X . Yi is treated as if it is identical to Yi. . . . . . . . . . . . . . . . . . . . . . 109 3

2

+1

8.1 Number of prefetch instructions that capture the indicated percentage of useful prefetches for movie easter in a 32KB direct mapped main cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 8.2 Relative execution time for inserting dierent numbers of prefetch instruction for movie hula in a direct mapped cache. . . . . . . . . . . . 172 8.3 Summary of memory access models simulated. . . . . . . . . . . . . . 175 C.1 Truth table for 4 way DE mantissa shifter. . . . . . . . . . . . . . . . 212 C.2 Truth table for 4 way DE sticky bit logic (continued in table C.3). . . 213 C.3 Truth table for 4 way DE sticky bit logic (continued from table C.2). 214

xvi

List of Figures 2.1 MPEG overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Baseline miss rates. (a) shows data for a direct mapped cache and (b) for a 4 way associative cache. . . . . . . . . . . . . . . . . . . . . . .

6

3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11

Time break down for mpeg play hula -dither ordered. . . . . . . . . . Time breakdown for mpeg play easter -dither ordered. . . . . . . . . Time breakdown for mpeg play mjackson -dither ordered. . . . . . . . Time breakdown for mpeg2play hula. . . . . . . . . . . . . . . . . . . Time breakdown for mpeg2play tennis. . . . . . . . . . . . . . . . . . Time breakdown for mpeg play hula without using shared memory. . Time breakdown for mpeg play hula displayed on a remote display. . Time breakdown for mpeg play hula -dither fs4. . . . . . . . . . . . . Time breakdown for mpeg play hula -dither color. . . . . . . . . . . . Time breakdown for mpeg play hula -dither ordered2. . . . . . . . . . Functional breakdown compared to the same functional breakdown with time spent in the memory system subtracted out. . . . . . . . .

12 14 15 16 17 19 21 22 23 24

5.1 5.2 5.3 5.4 5.5 5.6

Partial product terms for a 6 bit x 5 bit multiply. . . . . . . . . . . . DCT calculation kernel. . . . . . . . . . . . . . . . . . . . . . . . . . DCT calculation tree. Each box represents the kernel shown in gure 5.2. Code kernel for HP's MAX instructions. . . . . . . . . . . . . . . . . Code kernel for software DE arithmetic. . . . . . . . . . . . . . . . . Code kernel for Intel's MMX instructions. . . . . . . . . . . . . . . .

51 55 56 65 66 66

xvii

8

26

5.7 Percent dierence in PSNR for Y,U, and V frame components versus base PSNR. Circles are used to mark components compressed using typical quantization values. . . . . . . . . . . . . . . . . . . . . . . . 5.8 Speedup for ordered dither. . . . . . . . . . . . . . . . . . . . . . . . 5.9 Speedup for V=0.2 and F=1. . . . . . . . . . . . . . . . . . . . . . . 5.10 Speedup for dierent numbers of subwords. . . . . . . . . . . . . . . . 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14

A multiply reset instruction. . . . . . . . . . . . . . . . . . . . . . . . Adder modi cations for variable breakpoint. . . . . . . . . . . . . . . A vertically divided multiplier. . . . . . . . . . . . . . . . . . . . . . . A modi ed multiplier. . . . . . . . . . . . . . . . . . . . . . . . . . . A horizontally divided multiplier. . . . . . . . . . . . . . . . . . . . . A horizontally divided multiplier with zeroed partial product terms. . A multiplier modi ed to perform two independent multiplies. . . . . . Modi cation of rounding to divide a multiplier in both directions. Prime notation is used to indicate where rounding is performed. . . . Booth encoding example (with complement formation not shown). . . A Booth multiplier modi ed to perform two independent multiplies. . Modi cation for 4 way DE arithmetic. Dots represent unmodi ed partial product terms, zeros are zeroed terms, and stars are sign extended terms from the partial product term on the right. . . . . . . . . . . . Error due to truncation for rst subword. . . . . . . . . . . . . . . . . Base adder (after Waser and Flynn). . . . . . . . . . . . . . . . . . . Modi ed mantissa portion of adder. . . . . . . . . . . . . . . . . . . .

86 88 89 90 96 101 103 103 105 105 106 107 108 110 112 116 118 121

7.1 Fraction of misses eliminated for 4 and 8 way stream buers: (a) is a 4 way stream buer with a direct mapped main cache, (b) is a 4 way stream buer with a 4 way associative main cache, (c) is a 8 way stream buer with a direct mapped main cache, and (d) is a 8 way stream buer with a 4 way associative main cache. . . . . . . . . . . 130 xviii

7.2 Fraction of misses eliminated for 16 way stream buers: (a) is a 16 way stream buer with a direct mapped main cache and (b) is a 16 way stream buer with a 4 way associative main cache. . . . . . . . . 131 7.3 5 way stream buer architecture. . . . . . . . . . . . . . . . . . . . . 131 7.4 Stride prediction table architecture. . . . . . . . . . . . . . . . . . . . 132 7.5 Fraction of misses eliminated for 64 and 128 entry stride prediction tables: (a) is a 64 entry SPT with a direct mapped main cache, (b) is a 64 entry SPT with a 4 way associative main cache, (c) is a 128 entry SPT with a direct mapped main cache, and (d) is a 128 entry SPT with a 4 way associative main cache. . . . . . . . . . . . . . . . . . . 133 7.6 Fraction of misses eliminated for 512 and 1024 entry stride prediction tables: (a) is a 512 entry SPT with a direct mapped main cache, (b) is a 512 entry SPT with a 4 way associative main cache, (c) is a 1024 entry SPT with a direct mapped main cache, and (d) is a 1024 entry SPT with a 4 way associative main cache. . . . . . . . . . . . . . . . 134 7.7 Series stream cache architecture. . . . . . . . . . . . . . . . . . . . . . 136 7.8 Fraction of misses eliminated for 128 and 256 entry series stream cache using a 128 entry SPT: (a) is a 128 entry series stream cache with a direct mapped main cache, (b) is a 128 entry series stream cache with a 4 way associative main cache, (c) is a 256 entry series stream cache with a direct mapped main cache, and (d) is a 256 entry series stream cache with a 4 way associative main cache. . . . . . . . . . . . . . . . 137 7.9 Fraction of misses eliminated for a 512 entry series stream cache using a 128 entry SPT: (a) is a 512 entry series stream cache with a direct mapped main cache and (b) is a 512 entry series stream cache with a 4 way associative main cache. . . . . . . . . . . . . . . . . . . . . . . 138 7.10 Parallel stream cache architecture. . . . . . . . . . . . . . . . . . . . . 140 xix

7.11 Fraction of misses eliminated for 128 and 256 entry parallel stream cache using a 128 entry SPT: (a) is a 128 entry parallel stream cache with a direct mapped main cache, (b) is a 128 entry parallel stream cache with a 4 way associative main cache, (c) is a 256 entry parallel stream cache with a direct mapped main cache, and (d) is a 256 entry parallel stream cache with a 4 way associative main cache. . . . . . . 141 7.12 Execution times for mpeg play-hula with 128 entry parallel stream cache and 128 entry stride table adjusted for extra area required: (a) is a direct mapped main cache and (b) is a 4 way associative main cache.143 7.13 Histogram of useful prefetches for movie ower using a 32KB direct mapped cache and parallel stream cache. . . . . . . . . . . . . . . . . 144 7.14 Eect of movie le on a 32KB direct mapped main cache and parallel stream cache. (a) is hula, (b) is ower, and (c) is easter. . . . . . . . 147 7.15 Eect of main cache size. All graphs are for movie hula with a direct mapped main cache and a parallel stream cache. (a) is a 2KB main cache, (b) is a 4KB main cache, (c) is a 32KB main cache, and (d) is a 1MB main cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7.16 Eect of cache type for a direct mapped 32KB main cache and movie

owers. (a) is a parallel stream cache and (b) is a series stream cache. 149 7.17 Eect of main cache associativity on movie hula with a 2KB main cache and a parallel stream cache. (a) is a direct mapped main cache, (b) is a 2 way associative main cache, and (c) is a 4 way associative main cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 8.1 Trace output from hardware simulation for movie hula. . . . . . . . . 155 8.2 Eect of varying movie used to make prediction le. Prediction le was generated using a 2KB direct mapped main cache and series stream cache. Only the 200 most eective instructions issue prefetches. Fraction of misses eliminated is reported for execution of hula performed with a direct mapped main cache and parallel stream cache. . . . . . 157 xx

8.3 Eect of varying cache type used for generating the prediction le. All graphs compare the eects of predictions generated using a parallel stream cache, a series stream cache, and hardware directed prefetching using the movie hula and a 2KB direct mapped main cache. Only the 200 most eective instructions issue prefetches in the software case. Fraction of misses eliminated is reported for execution performed with a direct mapped main cache and in (a) no stream cache, (b) a parallel stream cache, and (c) a series stream cache. . . . . . . . . . . . . . . 159 8.4 Eect of varying cache size used for generating the prediction le. All graphs compare the eects of predictions generated using a 2KB, 4KB, and 32KB direct mapped main cache and parallel stream cache using the movie hula. Only the 200 most eective instructions issue prefetches. Fraction of misses eliminated is reported for execution performed with a direct mapped main cache and in (a) no stream cache, (b) a parallel stream cache, and (c) a series stream cache. . . . . . . . 161 8.5 Eect of main cache type. Both graphs use hint les generated using a direct mapped main cache and parallel stream cache using the movie hula. Only the 200 most eective instructions issue prefetches. (a) used a 2KB cache to generate the hint le, and (b) used a 32KB cache. 163 8.6 Eect of varying main cache associativity. All graphs use hint les generated using a direct mapped 2KB main cache and parallel stream cache using the movie hula. Only the 200 most eective instructions issue prefetches. (a) is executed using no stream cache, (b) a parallel stream cache, and (c) a series stream cache. . . . . . . . . . . . . . . 165 8.7 Performance comparison for series vs. parallel prefetch insertion executed with a 128KB direct mapped main cache and movie hula. (a) shows fraction of misses eliminated and (b) shows relative execution times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 xxi

8.8 Relative execution time for 90% of available prefetches inserted comparing prediction le generated from series and parallel stream caches. The movie displayed is hula. (a) is a direct mapped cache, and (b) is 2 way associative cache. . . . . . . . . . . . . . . . . . . . . . . . . . 8.9 Hypothetical example illustrating the dierence between series and parallel stream cache data generation. . . . . . . . . . . . . . . . . . . 8.10 Relative execution time for dierent numbers of prefetch instructions inserted in a direct mapped cache. (a) shows execution of movie hula, and (b) shows three movies for a cache size of 128KB. . . . . . . . . . 8.11 Relative Execution Time for 90% of prefetches inserted. (a) is executed with a direct mapped cache, and (b) with a 2 way associative cache. . 8.12 Relative execution time for a direct mapped cache with 90% of prefetches inserted. (a) is a memory access cost of 25 cycles and an instruction mix of 100% loads and stores, (b) is a memory access cost of 25 cycles and 30% loads and stores, (c) is a memory access cost of 100 cycles and an instruction mix of 100% loads and stores, and (d) is a memory access cost of 100 cycles and 30% loads and stores. . . . . . . . . . . 8.13 Relative Execution time for dierent memory models. (a) is for con guration A and a direct mapped cache, (b) is for con guration A and a 2-way associative cache, (c) is for con guration B and a direct mapped cache, and (d) is for con guration B and a 2-way associative cache. . 8.14 Relative Execution time for dierent memory models. (a) is for con guration C and a direct mapped cache, (b) is for con guration C and a 2-way associative cache, (c) is for con guration D and a direct mapped cache, and (d) is for con guration D and a 2-way associative cache. .

170 171 172 173

176

177

178

9.1 Summary of speedup techniques. . . . . . . . . . . . . . . . . . . . . 181 A.1 A.2 A.3 A.4

Simple instruction substitution. . . . . . . . Mistake in control ow for branch delay slot. Unconditional branch instrumentation. . . . Conditional branch instrumentation. . . . . xxii

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

184 187 188 190

A.5 Operand length histogram. . . . . . . . . . . . . . . . . . . . . . . . . 193 D.1 Performance comparison for series vs. parallel prefetch insertion. All caches are direct mapped. The left side shows fraction of misses eliminated, the right shows relative execution times. The top is a 2KB cache and the bottom is an 8KB cache. . . . . . . . . . . . . . . . . . 217 D.2 Performance comparison for series vs. parallel prefetch insertion. All caches are direct mapped. The left side shows fraction of misses eliminated, the right shows relative execution times. The cache size is 32KB.218 D.3 Performance comparison for series vs. parallel prefetch insertion. All caches are direct mapped. The left side shows fraction of misses eliminated, the right shows relative execution times. The top is a cache size of 128KB and the bottom is 512KB. . . . . . . . . . . . . . . . . . . 219 D.4 Performance comparison for series vs. parallel prefetch insertion. All caches are 2-way associative. The left side shows fraction of misses eliminated, the right shows relative execution times. The top is a cache sizes of 2KB and the bottom is 8KB. . . . . . . . . . . . . . . . 220 D.5 Performance comparison for series vs. parallel prefetch insertion. All caches are 2-way associative. The left side shows fraction of misses eliminated, the right shows relative execution times. The cache size is 32KB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 D.6 Performance comparison for series vs. parallel prefetch insertion. All caches are 2-way associative. The left side shows fraction of misses eliminated, the right shows relative execution times. The top is a 128KB cache sizes and the bottom is 512KB. . . . . . . . . . . . . . . 222 D.7 Performance comparison for dierent numbers of prefetch insertion. All caches are direct mapped. The left side shows fraction of misses eliminated, the right shows relative execution times. The top is a 2KB cache and the bottom is an 8KB cache. . . . . . . . . . . . . . . . . . 229 xxiii

D.8 Performance comparison for dierent numbers of prefetch insertion. All caches are direct mapped. The left side shows fraction of misses eliminated, the right shows relative execution times. The cache size is 32KB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 D.9 Performance comparison for dierent numbers of prefetch insertion. All caches are direct mapped. The left side shows fraction of misses eliminated, the right shows relative execution times. The top is a 128KB cache and the bottom is a 512KB cache. . . . . . . . . . . . . 231 D.10 Performance comparison for dierent numbers of prefetch insertion. All caches are 2-way associative. The left side shows fraction of misses eliminated, the right shows relative execution times. The cache sizes are 2KB on the top and 8KB on the bottom. . . . . . . . . . . . . . . 232 D.11 Performance comparison for dierent numbers of prefetch insertion. All caches are 2-way associative. The left side shows fraction of misses eliminated, the right shows relative execution times. The cache size is 32KB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 D.12 Performance comparison for dierent numbers of prefetch insertion. All caches are 2-way associative. The left side shows fraction of misses eliminated, the right shows relative execution times. The cache sizes are 128KB on the top and 512KB on the bottom. . . . . . . . . . . . 234 D.13 Relative execution time for a direct mapped cache. The top has memory access of 10 cycles and the bottom is 25 cycles. The left is instruction mix of 100% loads and stores and the right side is 30% loads and stores. 90% of prefetches inserted. . . . . . . . . . . . . . . . . . . . . 241 D.14 Relative Execution time for direct mapped cache. The top is memory access of 50 cycles and the bottom is 100 cycles. The left is a instruction mix of 100% loads and stores and the right side is 30% loads and stores. 90% of prefetches inserted. . . . . . . . . . . . . . . . . . . . . . . . . 242 xxiv

D.15 Relative Execution time for 2-way associative cache. The top is memory access of 10 cycles and the bottom is 25 cycles. The left is a instruction mix of 100% loads and stores and the right side is 30% loads and stores. 90% of prefetches inserted. . . . . . . . . . . . . . . 243 D.16 Relative Execution time for 2-way associative cache. The top is a memory access of 50 cycles and the bottom is 100 cycles. The left is a instruction mix of 100% loads and stores and the right side is 30% loads and stores. 90% of prefetches inserted. . . . . . . . . . . . . . . 244

xxv

xxvi

Chapter 1 Introduction The goal of this work is the ecient support of video decompression applications for desktop computer systems. Although video compression and decompression is just a small part of multimedia, there are two reasons that such a narrow de nition of multimedia has been chosen. The rst is that video is a key enabling component for multimedia in general. Text, still images, sound, and even graphics are handled eectively in the current generation of desktop machines. Video is just now becoming ubiquitous. Some machines are now shipped standard with video cameras. When this work was begun, digital video was considered an exotic medium that required dedicated hardware. But now, software only video decompression is a reality for most architectures. Video is a key piece in completing the multimedia puzzle. Secondly, video performance is meant to serve as a litmus test for multimedia applications in general. By using video benchmarks as an indication for system performance, it is hoped that techniques used to enhance video performance apply equally well to other multimedia tasks. MPEG1 and MPEG2, the standards for motion picture compression, are selected as the primary benchmarks.

1.1 Strategy There is a spectrum of strategies that could be followed in pursuit of this goal. In one extreme, dedicated DSP-style hardware could be built for dedicated handling 1

2

CHAPTER 1. INTRODUCTION

of video data. A number of dedicated solutions exist in the marketplace. Texas Instruments and C-Cubed make such decoders, for example. In the other extreme, minimal hardware modi cations could be added to a general purpose processor. In the middle, an application speci c processor can be built to incorporate the minimal functionality necessary to support multimedia in a programmable manner. Examples of this include Philips Trimedia and Chromatics Mpact processors. In this work, we choose to pursue the strategy of investigating minimalist techniques for enhancing a generic desktop processor. When this work was begun in 1991, this was considered a very controversial decision. Why waste processor resources on video when it can be handled so eciently in dedicated hardware, was the thinking. But as general purpose processors have become faster and faster, video processing on the CPU has become increasingly possible. In fact, as of this year nearly all major CPU vendors{including Hewlett Packard, Intel, Sun, SGI, and DEC{have announced additional instructions to more eciently handle multimedia data in the general purpose processor.

1.2 Why this strategy There are three reasons for choosing this strategy: cost, exibility, and scalability. The argument for cost is obvious in that no extra components are required on the motherboard. By handling video in the CPU, even the lowest cost systems will support the video data type. This approach is exible since decompression is supported in software. As standards evolve, new functionality is added simply with software modi cations, since the low level architectural requirements for video have been accommodated. Flexibility is also gained in system architecture. A two processor system can be con gured either as a symmetric MP system or as a main CPU with media coprocessor. Finally, this approach is scalable. As CPUs become faster, from increases in clock speed, from exploitation of instruction level parallelism, and from other architectural advances, the multimedia performance increases. The approach is scalable additionally in terms of numbers of processors. Enhancements applied to a single processor apply equally to a single processor system or a single processing

1.3. ORGANIZATION

3

element in a MP system.

1.3 Organization The remainder of this dissertation presents data collected and analysis completed in the pursuit of this objective. Chapter 2 presents the methodology employed in this study. Chapter 3 presents a system level overview of hardware resources for our benchmark applications. Chapter 4 presents a look at the I/O component of this workload and suggests strategies for improving I/O bottlenecks. Chapters 5 and 6 describe the arithmetic portion of this work. Chapter 5 presents a software only technique to achieve SIMD type parallelism with no cost in hardware. A theoretical discussion and implementations are presented. Chapter 6 presents hardware modi cations that can be employed to allow this technique to have higher parallelism and be used in more general computations. Chapter 7 begins the cache prefetching portion of the work. Hardware based cache prefetching techniques are compared and a new technique, the stream cache, is proposed. Chapter 8 presents a technique based on results observed in hardware prefetching to insert software prefetch instructions. This results in similar performance to hardware based prefetching with a greatly decreased cost in hardware. The I/O, arithmetic, and cache prefetching sections of this dissertation can be treated as independent sections and need not be read in the order presented. Finally, the conclusions are presented in Chapter 9. The dissertation also include four appendices. Appendix A describes in detail RYO, a PA-RISC instruction instrumentation tool written in support of this work. Appendix B presents miss rate data for simulated execution of MPEG applications on a variety of cache con gurations. Appendix C gives detailed truth tables for hardware described in chapter 6, and nally appendix D presents simulated miss rate data for MPEG applications executed using the software directed prefetching described in Chapter 8.

4

CHAPTER 1. INTRODUCTION

Chapter 2 Methodology 2.1 Introduction In this chapter we present an overview of the general techniques employed. Slightly dierent benchmarks and simulation techniques have been used in each of three major sections of I/O, arithmetic, and memory. Additional details are presented within each major section.

2.2 Applications The applications we have chosen for performance measurement are mpeg play, mpeg2play, and mpeg encode. Because of the extraordinarily large amount of storage volume necessary to store image and video data, data compression is a necessity for any sort of multimedia application. MPEG [Gal91] extends JPEG-style compression for motion images by adding motion compensation for interframe compression. JPEG [Wal91] is a lossy still image compression standard based on discrete cosine transform (DCT). Image data is rst broken down into 8x8 arrays and is then DCT transformed to convert to the frequency domain. The results of the transform coecients are then quantized. This is the lossy step in the algorithm. The coecients are then zig-zag ordered and run-length encoded so that long strings of zeros can be eciently represented. Finally, these 5

CHAPTER 2. METHODOLOGY

6

application

movie

frame number frame pattern data memory size of frames references mpeg hula 2.mpg 352x240 40 IPPIPPI 6e+07 mpeg

ower.mpg 352x240 148 IBBPBBPBB 2e+08 mpeg easter.mpg 240x176 49 IBBPBB 6e+07 mpeg bicycle.mpg 352x240 148 IBBPBBPBB 2e+08 mpeg mjackson.mpg 160x120 557 IIIIIIII 2e+08 mpeg2 tennis.m2v 576x704 7 IBBPBBPP 8e+07 mpeg encode tennis.yuv 352x240 10 IBBPBBPBB 3e+08 Table 2.1: Benchmark movie characteristics.

DCT

Quantizer

Entropy Encoder

Source Image

C D

I

B

B

Figure 2.1: MPEG overview.

P B B

P

2.3. INSTRUCTION INSTRUMENTATION

7

results are encoded using variable length (Human type) coding. The reverse process is used for decoding. The upper portion of gure 2.1 illustrates this data ow. MPEG uses motion vector techniques in order to take advantage of repeated information between frames for compression. There are three types of frames: I, P, and B frames. I frames are essentially JPEG encoded frames that don't rely on information from any other frames. P frames are forward predicted frames that use motion estimation to extrapolate from previous I frames, while B frames result from a linear combination of predictions from both preceding and ensuing I or P frames. The MPEG implementations used in this work are mpeg play and mpeg encode from the Berkeley MPEG Group [PSR93]. The lower portion of gure 2.1 illustrates the three frame types discussed above. MPEG2 is an enhancement of MPEG to allow for an extended range of applications. The primary application intended for MPEG2 was digital television, but it is considered an improvement over MPEG for a number of reasons. The implementation used in this paper is mpeg2play from the MPEG Software Simulation Group's MPEG2 release [EF95]. Speci c characteristics of the movies used in our benchmarks are shown in table 2.1. Memory references refer to both data loads and data stores. Instruction accesses were not simulated. Although the number of frames for each application might seem small, since the miss rate rapidly converges to a stable average after only a few frames, the applications perform similarly to movies with many more frames. Miss rates for these applications run in a baseline cache with no enhancements are shown in gure 2.2. Additional memory characteristics of the application including explicit description of the streaming behavior is described in detail in [ZFL96b].

2.3 Instruction Instrumentation Instruction instrumentation was used extensively to conduct this research. This is a technique that allows substitution of arbitrary assembly instructions with user supplied subroutines. Dierent simulators were constructed and linked to the instrumented executable to simulate certain aspects of machine performance. File I/O

CHAPTER 2. METHODOLOGY Miss Rate

0.10

|

|

0.12

|

0.10

|

0.12

|

Miss Rate

8

|

0.06

|

0.08

0.06

|

0.08

| 8

| 16

| 32

| 64 (a)

| | | | 128 256 512 1024 Cache Size (KBytes)

0.02

0.00

|

| 4

|

| 2

0.04

|

|

0.00

|

0.02

|

0.04

| 2

| 4

| 8

| 16

mpeg2play mpeg_play-hula mpeg_play-easter mpeg_encode

| 32

| 64

|

| | | 128 256 512 1024 Cache Size (KBytes)

(b)

Figure 2.2: Baseline miss rates. (a) shows data for a direct mapped cache and (b) for a 4 way associative cache. instructions, for example, are instrumented and linked to a disk simulator to investigate I/O performance.

2.3.1 RYO RYO (Roll Your Own) [ZK95] is an instruction instrumentation tool written for the PA-RISC [Lee89] family of processors. RYO replaces a speci c set of PA-RISC assembly instructions with user supplied subroutines. This tool was developed from an HP Labs prototype in the pursuit of this research. Because the user provides his own custom instrumentation routines to RYO, the use of the tool is virtually unlimited. Like Pixie [Smi91], RYO generates program analysis output by running an instrumented executable, yet RYO has both a simpler implementation and greater exibility by its use of arbitrary user supplied subroutines. A complete description of RYO is presented in Appendix A.

2.4. SIMULATORS

9

2.3.2 ATOM ATOM [SE94] is a similar instruction instrumentation tool used for the DEC Alpha architecture. The main dierence between ATOM and RYO is that ATOM takes the nal executable as input. This means that the source code is not required. Furthermore, libraries linked in to the executable are also instrumented. This was a feature RYO did not implement. Atom is used to collect data from DEC Alpha machines.

2.4 Simulators Trace driven simulations were used to model memory behavior, I/O, and processor performance in order to determine performance results. Application code was compiled to an assembly language format with the commercially available HP C Compiler version A.09.75. Maximum optimization was set by using the +O3 option. This assembly code was then instrumented using the RYO instrumentation tool. The instrumentation added additional assembly code so that external library functions are called for every memory access instruction. To save both execution time and disk space, discrete traces are not written to disk les. Instead, the cache simulator is executed concurrently with the instrumented executable so that address references are dynamically simulated. Because new traces are dynamically generated every execution, variables returned from system calls may cause slightly dierent traces at each run and may result in some run-to-run variation. Three separate simulators were used on three dierent types of traces. An I/O simulator was used on traces from system I/O calls. A processor simulator used traces from all computations, and a cache simulator used traces from all load and store operations. Complete miss rate data for MPEG applications over a range of line sizes, cache sizes, and associativities is presented in Appendix B.

10

CHAPTER 2. METHODOLOGY

Chapter 3 System Level Overview 3.1 Introduction In this chapter we present a system level view of MPEG applications run under a variety of conditions. We explicitly model I/O, CPU, and memory system using parameters from a known system and compare the results to measured execution times. The simulation compares favorably to measured execution times.

3.2 Methodology For this chapter we have used ATOM for instrumentation of DEC Alpha binaries. The use of this tool allows the easy construction of several tools to measure a variety of performance constraints. We then verify the results from ATOM by comparing with execution time results from the more traditional Unix time and gprof tools. Speci cally, we use ATOM to measure three separate architectural components of execution time: time spent in executing CPU instructions, time spent in the memory system, and time spent in I/O. The time spent in the execution pipeline is a simulated result generated by collecting a trace of each instruction executed. As the trace is generated, it is dynamically fed into a cycle by cycle pipeline simulator for the appropriate architecture used. From this, total time spent in execution time is calculated. The time spent in the memory system is similarly calculated by tracing both data 11

CHAPTER 3. SYSTEM LEVEL OVERVIEW

3.6

| misc (1.1%)

3.2

misc_arith (5.7%)

other (12.0%)

io (8.2%)

| system (1.7%)

2.8

dithering (27.6%)

|

Execution Time (Seconds)

12

memsys (29.7%) 2.4

|

2.0

| reconstruction (32.8%)

1.6

|

1.2

|

user (86.3%)

pipe (62.1%) idct (14.1%) 0.8

|

0.4

| parsing (18.7%)

|

0.0 |

atom

time

gprof

Figure 3.1: Time break down for mpeg play hula -dither ordered.

3.3. OVERVIEW

13

and instruction references from the user code. Again, this data is dynamically input to a cache simulator with parameters appropriate for the system used to measure execution times. Finally, the time spent in I/O is calculated by counting the number of cycles spent in the I/O procedure calls. Next, the Unix time command was used to measure the actual execution time on a particular machine. The time command divides the time spent into user, system, and other time. Finally, gprof was used as a third check on total execution time. Gprof gives additional information that indicates the temporal breakdown by procedure. We divide the procedures into functional blocks after [PSR93]. The procedural breakdown was double-checked using ATOM to count execution cycles for time spent in individual routines and comparing that to the time reported by gprof. In general, the times were close, but did not compare exactly. The particular data is not presented here.The experimental error is introduced by the sampling technique used by gprof [Var93]. Because time spent in each procedure is calculated based on a random sampling technique, short procedures that are called a number of times tend to be under counted by gprof, while long continuous procedures tend to be over counted by gprof. The machine used to measure timed execution is a DEC 3000/400. The memory subsystem modeled was an 8KB L1 cache with a 512KB L2 cache using a write thru no write allocate memory policy, and the processor cycle time was 6.67 ns. Table 2.1 gives a summary of the movies used to generate the timing data. These movies were selected to represent a range of values for frame size, number of frames, and encoding pattern.

3.3 Overview Figure 3.1 shows the breakdown of execution time by architectural component, system time, and functional block for the movie hula 2.mpg using the ordered dithering algorithm. Table 3.1 compares the percentage of time spent in each functional block for gure 3.1 and those published by [PSR93]. Parsing includes all functions involved in bitstream parsing, entropy and motion vector decoding, and Human decoding.


4.0

| misc (1.8%) io (7.6%)

3.5

misc_arith (4.6%)

other (12.4%)

| system (2.1%)

3.0

dithering (14.6%)

memsys (31.0%)

|


14

reconstruction (22.8%) 2.5

|

2.0

| idct (22.6%) user (85.4%)

|

1.5

pipe (61.7%)

|

1.0

parsing (33.6%)

|

0.5

|

0.0 |

atom

time

gprof

Figure 3.2: Time breakdown for mpeg play easter -dither ordered.

16.0

15

| misc (1.2%) misc_arith (3.1%)

14.0

io (8.5%)

|

other (9.2%) system (1.3%)

12.0

dithering (20.8%)

|


3.3. OVERVIEW

memsys (29.1%)

|

10.0

reconstruction (34.1%)

|

8.0

user (89.5%)

|

6.0

idct (15.8%) pipe (62.5%)

parsing (25.0%)

|

2.0

|

4.0

|

0.0 |

atom

time

gprof

Figure 3.3: Time breakdown for mpeg play mjackson -dither ordered.


4.0

|

3.5

|

other (13.3%)

io (9.2%)

system (2.1%) 3.0

misc (15.0%)

|


16

memsys (26.7%)

|

2.5

misc_arith (1.5%)

dithering (21.5%)

|

2.0

reconstruction (19.0%) user (84.6%)

|

1.5

pipe (64.2%) 1.0

|

0.5

|

idct (24.8%)

parsing (18.1%)

|

0.0 |

atom

time

gprof

Figure 3.4: Time breakdown for mpeg2play hula.

5.4

other (13.0%)

io (8.7%)

|

4.8

17

| misc (9.1%)

system (2.2%)

misc_arith (1.1%) 4.2

|


3.3. OVERVIEW

memsys (27.6%) 3.6

|

3.0

|

2.4

|

dithering (29.4%)

reconstruction (22.9%) user (84.8%)

|

1.8

pipe (63.8%) idct (16.7%) 1.2

|

0.6

| parsing (20.8%)

|

0.0 |

atom

time

gprof

Figure 3.5: Time breakdown for mpeg2play tennis.

18

CHAPTER 3. SYSTEM LEVEL OVERVIEW MPEG parsing 18.7% idct 14.1% reconstruction 32.8% dithering 27.6% misc. arithmetic 5.7% misc. 1.1%

MPEG2 reported 20.8% 17.4% 16.7% 14.2% 22.9% 31.5% 29.4% 24.3% 1.1% 9.9% 9.1% 2.7%

Table 3.1: Breakdown by functional block. Reconstruction is composed of functions that perform predictive pixel reconstruction, including copying and averaging relevant pixels from reference frames. Dithering converts the reconstructed YUV image into RGB for display. The results are remarkably similar despite the use of two totally dierent compilers and machine architectures. Our results con rm results described earlier over a range of architectures and image types. Figures 3.2 and 3.3 show the execution times for the movies easter.mpg and mjackson.mpg, respectively. All are displayed using the ordered dither. By comparing the three movies, one sees that although the dierent frame sizes and movie lengths result in total execution times that vary from 3 to 15 seconds, the proportion of time spent in each functional block remains approximately constant. The breakdown by functional block remains approximately constant across the three movies, even though each uses a dierent pattern for encoding frames. Furthermore, the proportion of time spent in I/O, memory system, and execution pipeline also remains constant. Figures 3.1 thru 3.3 show that I/O time accounts for approximately 8%, memory system time accounts for approximately 30%, and time in the execution pipeline accounts for approximately 62%. Figure 3.4 shows the same execution breakdown for the application mpeg2play. Mpeg2play was used as a second implementation to display MPEG1 encoded movies. Mpeg2play exhibits the same characteristics in terms of architectural and functional breakdown as the mpeg play application. Although the total execution time is longer for the mpeg2 implementation, the proportions of time spent in memory, execution

3.4. WINDOWING ENVIRONMENT

19

pipeline, and I/O remains constant. Figure 3.5 shows the execution breakdown for an mpeg2 encoded movie, tennis.m2v. Again, even though the execution time is even longer, the proportions remain similar.

3.4 Windowing Environment 4.0

| other (15.9%)

3.5

| system (5.6%) io (1.6%)

3.0

misc (3.2%) misc_arith (5.1%)

|


In this section the eects of varying the X windowing environment are examined.

memsys (31.8%) 2.5

|

2.0

|

dithering (28.8%)

user (78.5%)


|

1.5

pipe (66.4%)

|

1.0

idct (16.1%)

|

0.5

parsing (16.1%)

|

0.0 |

atom

time

gprof

Figure 3.6: Time breakdown for mpeg play hula without using shared memory.

20


3.4.1 Shared Memory Figure 3.6 shows the movie hula displayed without using shared memory. Shared memory refers to the method for sending the image data to the X windows server. Shared memory allows the X server and user application to share a common memory space so that when the user application writes the image data, the X server can display it directly. Non-shared memory uses separate memory space for the user application and X windows server. Additional time must be spent copying the image data from one area to the other. This additional time is re ected in the increased system time in gure 3.6. The mpeg play application itself does not spend any extra time, so the ATOM and gprof bars are similar. Compared to gure 3.1, the percentage of time spent in I/O is decreased. This is because the synchronization that must be performed between the X server and user applications is carried out using system I/O calls. When shared memory is not employed, there is no need to perform the synchronization and consequently less time is charged to I/O.

3.4.2 Remote Display Figure 3.7 shows the execution time for displaying the movie hula on a remote display. The decompression application is run on a dierent machine from which the movie is displayed. The other time is greatly increased. The X window server must transmit the image data across the network to display to a remote terminal. Additional time must be spent in executing system routines and in the X server. Again, since the application does not need to do any extra work, this additional time is not re ected in the ATOM or gprof bar. The percentage of time spent in I/O is decreased for the remote display compared to gure 3.1. Writing to the frame buer is counted in the I/O time. Since the I/O buer is no longer written to locally, this time is no longer counted and the percentage of time spent in I/O appears much decreased.

6.3

|

5.6

|

4.9

|

4.2

|

3.5

|


3.4. WINDOWING ENVIRONMENT

21

other (43.6%)

system (5.9%) misc (3.1%) misc_arith (6.2%)

io (1.0%)

|

2.8

memsys (32.4%) dithering (28.3%)

|

2.1

user (50.6%)


|

1.4

pipe (67.0%) idct (16.2%)

|

0.7

parsing (15.9%)

|

0.0 |

atom

time

gprof

Figure 3.7: Time breakdown for mpeg play hula displayed on a remote display.


22

10.0


io (3.1%) 8.0

|


3.5 Dithering

memsys (14.2%)

|

6.0

dithering (75.0%)

|

4.0

pipe (82.6%)

|

2.0


idct (5.2%) parsing (6.7%)

|

0.0 |

atom

gprof

Figure 3.8: Time breakdown for mpeg play hula -dither fs4. Figures 3.8, 3.9, and 3.10 show three dierent dithering algorithms applied to the same movie in mpeg play. The dithering algorithms are fs4, color, and ordered2, respectively. The gures are generated by displaying the movie on a remote display. As in gure 3.7, the percentage of time spent in I/O appears much decreased as an artifact

3.6

23

| misc (5.4%)

3.2

io (0.3%)

|

misc_arith (5.4%)

2.8

|


3.5. DITHERING

memsys (33.4%) dithering (26.5%) 2.4

|

2.0

|

1.6

|

1.2

|


pipe (66.6%) idct (13.7%) 0.8

|

0.4

| parsing (19.0%)

|

0.0 |

atom

gprof

Figure 3.9: Time breakdown for mpeg play hula -dither color.


3.6


io (0.9%) 3.2

|

2.8

|


24

dithering (34.5%)

memsys (42.6%) 2.4

|

2.0

|

1.6

|

1.2

|


pipe (56.5%) idct (13.7%) 0.8

|

0.4

| parsing (17.6%)

|

0.0 |

atom

gprof

Figure 3.10: Time breakdown for mpeg play hula -dither ordered2.

3.6. MEMORY SYSTEM

25

of how the time is accounted. The fs4 dither is a Floyd-Steinberg [FS75] dithering with 4 error terms propagated. The color dither is a full color dithering for a 16 or 24 bit color display. The ordered dither [FvDFH90] was presented earlier in gure 3.1. Ordered and ordered2 dither are dierent implementations of the same standard dithering. In the ordered dither three table lookups are performed and the results summed together. In the ordered2 dither a tradeo is made between memory access and arithmetic. The addition is calculated in advance and a single large table lookup is substituted for the three table lookups and two additions of the ordered dither. In the ordered2 dither, 42.6% of the time is spent in the memory system, while only 29.7% of time is spent in the ordered dither. Depending on the tradeo between memory access and arithmetic in the architecture used, the ordered2 dither may be a bene cial tradeo. In this case, the execution times are comparable. For the rest of this dissertation, mpeg play is always executed using the dithering with the best performance, the ordered dither.

3.6 Memory System The time spent in the memory subsystem account for the second largest percentage of execution time for the mpeg applications. Although the execution pipeline accounts for the greatest percentage of time now, as processor cycle time becomes faster and faster, and as machines incorporate more instruction level parallelism, the memory subsystem can become the dominating component of execution time. Figure 3.11 shows the percentage of time spent in each functional block after the time spent in the memory subsystem has been subtracted out. Interestingly, the percentage of time spent in each block remains almost unchanged. When the functional breakdown was described in [PSR93], the large amount of time spent in reconstruction and dithering was attributed to the great deal of time that is spent in the memory subsystem. Yet this graph indicates that the time spent in the memory subsystem is really fairly evenly divided across the functional blocks, so that reconstruction and dithering still account for the majority of execution time even with memory subsystem


3.6


3.2

dithering (27.6%)

|

2.8

|


26

2.4

|

2.0

|

misc (0.8%) misc_arith (6.6%)

dithering (29.7%) reconstruction (32.8%)

1.6

|

1.2

|

reconstruction (34.1%) idct (14.1%)

|

0.8

idct (15.1%)

|

0.4

parsing (18.7%) parsing (13.7%)

|

0.0 |

gprof

gprof - miss cost

Figure 3.11: Functional breakdown compared to the same functional breakdown with time spent in the memory system subtracted out.

3.7. SUMMARY

27

cycles removed.

3.7 Summary In this chapter we have presented an overview of MPEG applications. We have broken down the applications both by software functional block and by architectural component. In the next chapter, we investigate I/O in greater detail.

28


Chapter 4 IO 4.1 Introduction Disk caches have been thoroughly reported in [Smi85] and Ousterhout [ODH 85]. A great body of work exists on le systems for multimedia systems. Examples include [RW93], [RV91], and [AOG92]. In general these works assume a complicated client/server I/O model with multiple disks, multiple clients, and several simultaneous I/O sessions. Issues such as Quality of Service and scheduling algorithms are thoroughly discussed. Because of assumptions made in this work, these issues are less relevant. We are focused on desktop client performance and assume that the video data le is already on the local disk. We also assume that there is only one client that has exclusive access to the disk. This work emphasizes a simple read ahead prefetching scheme based on attribute caches. Attribute caches were proposed by Richardson [RF93]. These are a type of I/O cache that take advantage of le attributes to selectively cache I/O data with a scheme tailored to the expected behavior of the le type. We propose using le attributes to determine the prefetching strategy in addition to the caching strategy. In general, relatively little work has been done on the combination of caching and prefetching. Exceptions are [CFKL95] which details a strategy to integrate prefetching and I/O caching with application supplied cache management advice. [PGG 95] uses these +

+

29

CHAPTER 4. IO

30 movie

movie read frame write misc total (seconds) (seconds) (seconds) (seconds) bicycle 0.70 0.74 0.11 1.58 coaster 1.64 0.93 0.20 2.83 easter 0.54 0.26 0.03 0.87

ower 0.66 0.80 0.12 1.60 hula 0.19 0.25 0.03 0.51 mjackson 0.73 0.66 0.36 1.77 sukhoi 1.70 0.84 0.50 3.07 Table 4.1: Measured I/O time breakdown where compressed movie data is read from disk. movie


ower 0.03 0.70 0.13 0.87 hula 0.01 0.11 0.03 0.16 mjackson 0.03 0.61 0.38 1.02 sukhoi 0.05 0.88 0.51 1.46 Table 4.2: Measured I/O time breakdown where compressed movie data is read from main memory I/O cache. types of hints for both caching and prefetching.

4.2 I/O Breakdown Tables 4.1 and 4.2 show the measured I/O breakdown for dierent MPEG movies. There are two cases presented. Table 4.1 shows the time spent in I/O when there is no I/O caching and the compressed movie data must be read from disk. For this data, approximately half the time is spent in reading the compressed data from the disk and half the time is spent writing the uncompressed data to the frame buer. Less than

4.3. SIMULATED I/O

31

20% of time is left in the miscellaneous category. This is largely composed of time to set up the color table and also includes time to open other les. There is additional time required to load the mpeg executable from disk, but this is not included in this table. If the movies were even longer, both the times to read the compressed data and write the uncompressed data would grow proportionately, but the color table time would remain unchanged. Thus, the miscellaneous portion becomes a smaller percentage of I/O time for larger movies. Table 4.2 shows the I/O breakdown in the case where the entire movie is cached in the main memory I/O cache. The system on which the I/O performance is measured uses a software managed I/O cache in which recently accessed I/O data is kept in memory. I/O data is cached in 8KB blocks on a demand miss. The movies used in this study are small enough that the entire movie can t in the I/O cache. The movie is contained completely in the I/O cache the second and subsequent times it is played. The rst time it is played, demand misses bring all the data into the I/O cache. The data in table 4.2 is obtained by measuring the second execution of a movie after the compressed data has been cached. Once the data is cached, accesses are extremely fast since the data is read directly from memory and no disk access is required. In this case, the majority of time is spent writing the nal image data to the frame buer. This is not a typical case, but can occur when the same short clip is played repeatedly such as in a video editing application. In the following sections a prefetch strategy is proposed to improve execution time when the movie is not already cached. Because an I/O prefetch instruction is not available on the system studied, an I/O simulator is constructed to model the system with prefetch capability. In the next section, the I/O simulator is demonstrated using parameters obtained from empirical measurement of the system. The simulator is validated by comparing its output to measured execution times.

4.3 Simulated I/O Table 4.3 compares simulated I/O times to actual measured I/O times. A disk simulator is constructed so that hypothetical I/O prefetching schemes can be tested

CHAPTER 4. IO

32 movie

not cached measured simulated bicycle 1.58 1.94 coaster 2.83 3.00 easter 0.87 0.98

ower 1.60 1.94 hula 0.51 0.63 mjackson 1.77 2.17 sukhoi 3.07 3.56

cached measured simulated 19% 0.88 1.03 6% 1.04 1.33 11% 0.25 0.33 17% 0.87 1.03 19% 0.16 0.39 18% 1.02 1.25 14% 1.46 1.68

15% 22% 24% 15% 60% 18% 13%

Table 4.3: Measured vs. simulated times. without the need for modifying a functional operating system. To validate the simulator, measured I/O time is compared to simulated I/O time. As evidenced in the table, the simulated data is within 20% of the measured data in most cases. The I/O simulator explicitly models disk accesses by instrumenting calls to low level I/O functions. Additionally, X window library calls to write to the display and to allocate a color in the color table are instrumented and accounted for. The simulation parameters are based on empirical observations. Writes and uncached reads are performed on 8KB blocks and take 10 ms per block. A read on cached data is charged 4.64 cycles per byte, which equals 30 ns per byte. A frame write costs a constant 105,000 cycles per frame plus 9 cycles per pixel. This comes out to 0.7 ms plus 60 ns per cycle. Finally, a call to allocate a color map value (XAllocColor) is charged 134,000 cycles or 0.9 ms. File open and close calls are considered negligible and are ignored.

4.4 Prefetch Strategy Comparing tables 4.1 and 4.2 shows that caching the compressed movie data improves the I/O time by reducing the le read time to about 5% of the time required without caching. This applies only in cases where the entire movie is already in the I/O cache. Performance bene ts are not obtained the rst time the movie is played or if

4.4. PREFETCH STRATEGY

33

movie


ower 0.07 0.85 0.13 1.06 hula 0.06 0.23 0.13 0.42 mjackson 0.07 1.07 0.13 1.28 sukhoi 0.20 1.47 0.13 1.80 Table 4.4: Simulated IO cache prefetching. the movie is too large to t into the I/O cache, as is the typical case. To remedy this problem, we propose I/O prefetching. Execution time pro les of the data show that there is enough time to overlap execution with disk access. Furthermore, the data is read from the disk in a highly predictable linear manner. We should know well in advance what data is needed and can prefetch it into the I/O cache. This could be accomplished with an explicit non-blocking le read call in the application program. Another approach solves the problem more automatically by using an attribute cache. Attribute caches [RF93] have been proposed to determine I/O caching strategy based on le attributes. It is possible to extend this concept to include prefetching strategy based on le attributes. We propose using the .mpg le extension as a le attribute to signal the operating system to prefetch data linearly forward. Speci cally, we simulate the case where for every open or read done on a le that ends in the .mpg sux, an additional prefetch is executed to maintain 100KB of data in the I/O cache. Since we assume the prefetch is happening in the background and can con ict with other I/O, we charge a latency of two times that for a blocking disk access. This le attribute based prefetching strategy is simulated using the I/O simulator described earlier. The results for this simulation are shown in table 4.4. Comparing the total column in table 4.4 with the totals in tables 4.1 and 4.2 the prefetching

34

CHAPTER 4. IO

scheme reduces the movie read time, although it is still more than the time required if the movie is already cached. The main dierence is the time for the rst le access. On the rst access in the prefetch case, no data has yet been prefetched so the access must go to disk. Every subsequent access, however, is read directly from memory. The prefetching scheme is not limited by large movies that do not t completely in the disk cache, since prefetching is done shortly before the data is used.

4.5 Summary In this chapter we presented measured I/O data to show how the time spent in the I/O system can be broken down. We next presented an I/O simulator based on the measured data. Finally, we used the I/O simulator to demonstrate an I/O prefetch strategy that can greatly reduce the I/O time spent in the MPEG applications. Applying this technique to overlap I/O and execution time reduces the I/O time as long as there is enough computation to hide the time spent in I/O. However, once the I/O becomes the dominating factor, other techniques such as striping the data across a number of disks can be used to increase disk bandwidth. I/O is not currently a bottleneck in MPEG decompression.

Chapter 5 Doubly Enhanced Arithmetic 5.1 Introduction MPEG applications require only 8 or 16 bit data words. Using the full 64 bit data path available in most computers to perform these low-precision calculations is an inecient use of resources. Many manufacturers have addressed this problem by introducing new instructions that allow the packing of sub-word quantities into a full data word. Hewlett-Packard's MAX1 and MAX2, Intel's MMX, and Sun's VIS instruction sets are examples of this kind of strategy. This chapter presents a new software-only technique that accomplishes this same objective by packing subword integer quantities into a double precision oating-point word. This technique works even with machines that have not been specially modi ed with new multimedia instructions. While the idea of packing subword integer quantities into a larger integer data word has been proposed before, this technique is unique in packing integer subwords into a single oating-point word with a shared exponent. Using the oating-point data path has a number of advantages. The traditional oating-point arithmetic operations of add, subtract, and multiply are used to pack and unpack the subword quantities. Therefore, the technique works on any machine that supports the IEEE double precision oating-point arithmetic standard with no machine-speci c code required. Furthermore, the methodology can be implemented in a high-level language 35

CHAPTER 5. DOUBLY ENHANCED ARITHMETIC

36 movie

frame size number of frames hula 2.mpg 352x240 40 mjackson.mpg 160x120 557 sukhoi.mpg 160x120 764 bicycle.mpg 352x240 148 easter.mpg 240x176 49

ower.mpg 352x240 148 coaster.mpeg 288x192 283

frame pattern size (bytes) bits/pel IPPIPPI IIIIIII IIIIIII IBBPBBP IBBPBBI IBBPBBP IBBIPBB

148076 724576 1534405 718897 507252 719510 1353919

0.35 0.51 0.78 0.45 1.96 0.45 0.69

Table 5.1: Benchmark movie characteristics for DE experiments. such as C. Additions, subtractions, and multiplication of two packed quantities by a single multiplicand are supported. Because the data is kept in oating-point format, multiplications can be by arbitrarily large or small numbers. Since the oatingpoint data path is used, the high-speed oating-point hardware available in many workstations can be better used. Finally, this technique has the advantage that no hardware modi cations are required. It works on any machine that supports the IEEE double precision oating-point arithmetic standard. In this chapter we describe this technique in general and then demonstrate its validity by implementing it in a public domain mpeg decoder application, mpeg play distributed by the Berkeley Plateau Multimedia Research Group. While there is some degradation in quality because calculations are carried out with lower precision, there is no noticeable dierence in image appearance. A quantitative comparison of the image quality is presented later in section 5.7.6. We achieved an average speed-up of 13.8% in mpeg play by implementing subword parallelism using only source level software modi cations to the original C code.

5.2 Related Work This work relies on two fundamental concepts. The rst is reuse of the oating point multiplier for a purpose other than simple multiplication. The second is the

5.2. RELATED WORK

37

concept of breaking a longer data path into multiple smaller segments to achieve SIMD parallelism. The earliest reference to the reuse of a dedicated hardware multiplier unit is [MHR90], in which the oating point multiplier was coupled with an adder to produce the fused multiply-add. This technique is now common in many commercial processors. Eric Schwarz [Sch93] later proposed reusing the oating point multiplier to provide seed values for division and other higher order functions. He reuses the multiplier summation tree to sum product terms, but replaces the standard multiplication product terms with special logic functions to generate the desired approximations. The basic concept of breaking a longer data path into multiple lower precision segments has existed in industry in one form or another for several years, for example residue arithmetic [ST67]. However, discussion in the literature has only recently resumed. This work [ZL94] is unique in that it reuses the oating point data path to do arithmetic on integer quantities. It furthermore, presents a general technique that can be implemented on any architecture in software only. Lee describes multimedia enhancements to the PA-RISC processors that include half word arithmetic instructions in [Lee94] and [Lee95]. These additional instructions are known as MAX-1. The operations supported are add, subtract, and shift-and-add instructions that serve as primitives for multiplication and division. The application of these instructions to speed an mpeg decoder was described in [BKL95]. The 64 bit multimedia enhancements were described in [LH96], and additional enhancements were described in [Lee96] and referred to as MAX-2. Stephan Eckart [Eck95] mentions a software only technique to achieve similar data path parallelism in the process of describing an MPEG1 decoder optimized for the Intel Pentium. He mentions partitioning the 32 bit ALU into 4 virtual 8 bit ALUs. However, he does not give many details describing the speci c methodology he used to achieve this. Furthermore, the techniques described are targeted towards the Intel Pentium architecture. Sun [KMT 95] described their Visual Instruction Set (VIS). This is a set of multimedia instruction enhancements similar in concept to MAX, but includes complete subword multiply instructions instead of multiply primitives. Application of these +


38

techniques to an MPEG decoder was described in [ZKK 95], and additional reports of VIS were made in [TONH96]. Finally, Intel described their multimedia instructions, known as MMX, in [PW96]. In addition to a number of additional multimedia enhancements, MMX supports arithmetic on packed data formats. +

5.3 Algorithm 5.3.1 Overview The basic idea is to use simple arithmetic operations to pack two operands into the left and right parts of a single register. Standard oating point hardware is then used to operate on these packed operands, now performing two operations in parallel. Only multiplication by a constant, addition, or subtraction will produce correct results. There are a number of criteria that must be met for this methodology to be useful as the algorithm is not suitable for all calculations. It is designed for DSP- style calculations in which these four criteria are met: 1. Because additional instructions must be added to pack and unpack operands before and after arithmetic calculations, it is desirable to have a long series of calculations between the packing and unpacking so the packing/unpacking cost is eectively amortized over a large number of operations. 2. Because this technique relies on packing multiple lower precision operations into a single high precision unit, it is essential that the functions it calculates require only lower precision calculations. 3. Because careful attention must be paid to operand length and magnitudes in order to insure correct results, information about the range and domain of the data must be known beforehand. 4. Because calculations done on each operand concurrently must be the same, the application is best suited for multiple loop iterations done in parallel.

5.3. ALGORITHM

39

The DCT calculation, used in the JPEG and MPEG image compression standard, meets all these criteria. In it 16 1-D DCTs are calculated to perform a single 8x8 2-D DCT, so there are many identical loop iterations that can be done in parallel. Baseline JPEG is de ned for 8 bits of precision, so condition 2 is easily met. Furthermore, from the input and output restrictions on the range of data, and from the mathematical properties of the DCT, we have the necessary information referred to in condition 3. The DCT is used in this paper to illustrate the double enhancement technique, but it is generally applicable to similarly constrained DSP-style applications.

5.3.2 Packing Packing is performed by a multiplication by a constant and an addition. If two words, L and R, are to be packed into a single word X, then

X = L 2c + R

(5.1)

where c is a suitably chosen constant. In binary form, this might appear as X = eeeeee000000rrrrrrr

where e and r are the bits in L and R, respectively. One word is on the left, one word is on the right, and a buer is between them. Assuming an architecture with a

oating point multiply-add instruction, packing can be performed with just a single instruction. Otherwise two instructions are needed. Multiplication and addition can be performed in this packed format by using the distributive property. For example, to multiply by a constant, k, we have X k = (L k) 2c + (R k). So, both L and R are multiplied simultaneously by the same constant, k. Or to add Y, a similarly packed number, to X

Y = A 2c + B

(5.2)

X + Y = (L + A) 2c + (R + B ):

(5.3)

and


40

By using this simple algebraic method to pack two numbers, the sign information for the R term is automatically encoded in the empty bits between the L and R terms. The sign for the L term is encoded in the sign bit for the entire oating point quantity, and the bits between L and R indicate whether R is the same or opposite sign of L.

Example 1 Consider L=5, R=3, the constant of packing, c, equal to 4, and a decimal system. The packed quantity is obtained as: X = 5*10^4 + 3 = 50000 + 3 = 50003.

Since both numbers are positive, the bits between them are zero.

Example 2 If, now, R=-3, and L=5 and c=4 as before, then X = 5*10^4 - 3 = 50000 - 3 = 49997.

Now, the bits in the buer area are non-zero, indicating that the R term is opposite in sign to the L term.

Example 3 Now consider the same example in binary: L = 101 , R = 11 , and the constant of packing, c, equals 5. The packed quantity is obtained as: 2

2

X = 101*2^5 + 11 = 10100000 + 11 = 10100011.

Again, since both numbers are positive, the bits between them are zero.

5.3. ALGORITHM

41

Example 4 If, now, R = ?11 , and L = 101 and c=5 as before, then 2

2

X = 101*2^5 - 11 = 10100000 - 11 = 10011101.

This time, the bits in the buer area are non-zero, indicating that the R term is opposite in sign to the L term.

Example 5 If L=-5, R=3, and c=4, the packed quantity becomes: X = -5*10^4 + 3 = -50000 + 3 = -(50000 - 3) = -49997.

In this case, the minus sign for the entire fp number is stored in the sign bit. Also, R has the opposite sign from L is indicated by the non-zero digits in the buer area between L and R.

5.3.3 Unpacking Unpacking is accomplished by rounding X just so that all bits of R are eliminated. L is recovered by a multiplication, and R is recovered with a subtraction.

L = round(X ) 2?c

(5.4)

R = X ? round(X )

(5.5)


42

Example 1 Consider a base 10 system where L=12, R=-15, and C=4. X = 120000 - 15 = 119985.

To recover, X must be rounded to the 10 place, since R is 3 digits long (including the sign). This is one digit to the left of R. Then, 3

round(X) = 120000 L = 120000 * 10^-4 = 12 R = 119985 - 120000 = -15.

Example 2 If X were truncated instead of rounded, then an incorrect result may be obtained. For example for the above example with L=12, R=-15, and C=4: trunc(X) = 119000 L = 119000 * 10^-4 = 11.9 R = 119985 - 119000 = 985.

Example 3 For binary, consider L=7 (= 111 ), R=-6 (= ?110 ), and C=6. 2

2

X = 111000000 - 110 = 110111001.

To recover, X must be rounded to the 2 place, since R is 4 digits long. Then, 4

round(X) = 111000000 L = 111000000 * 2^-6 = 111 R = 110111001 - 111000000 = -110.

5.3. ALGORITHM

43

Example 4 Similarly, in binary, if X were truncated instead of rounded, then an incorrect result may be obtained. Consider the above example with L=7, R=-6, and C=6: trunc(X) = 110110000 L = 110110000 * 10-6 = 110.11 R = 110111001 - 110110000 = 1001.

44


5.4 Theory 5.4.1 Informal Discussion of Length, Magnitude, and Relative Magnitude It is extremely important to keep track of length and magnitude of the operands to ensure that the nal results can be successfully separated into the original components. The relative magnitude of R and L is de ned as jL=Rj and is a measure of the distance between the left most bits of the two parts. Since a shared exponent is used for L and R, only the relative magnitude is important and large changes in absolute magnitudes can be handled by simply changing the exponent eld. For a xed point system, on the other hand, absolute magnitude would also be important since care must be taken that the packed word does not over ow. Imagine a packed word, X, composed of two parts, L and R, with the bits in L represented by e, and the bits in R represented by r. X = eeeeee000000rrrrrrr

As L becomes longer the buer space between the numbers becomes smaller. Therefore, one must assure that the buer space is large enough to accommodate the growth of L. As R becomes longer there is no eect in a oating point system, since the hardware left aligns the word to the leftmost bit of L, and rounds R so that it ts into the number of bits available in mantissa hardware. In a xed point system, the length of R would also have to be considered since X is right aligned to the rightmost bit of R. If R is too large, X would over ow. As the relative magnitude increases in a oating point system, precision in R is lost. That is, as the space between L and R increases, more and more of R has to be rounded o to t into the available hardware. As the relative magnitude decreases, the space between the parts shrinks and there is more of a tendency to overlap, though R has more bits of precision. In order to ensure that the nal result is separable, we must guarantee that the maximum length of L and the relative magnitude are such that the rightmost bits of L do not overlap with the leftmost bits of R.

5.4. THEORY

45

We can understand the requirements of DE arithmetic intuitively fairly simply. A more detailed proof is presented in the next section. Because the numbers are kept in oating point format, the absolute magnitude of the numbers is irrelevant. Only the relative magnitude of the numbers must be considered and this can be varied arbitrarily by picking the proper packing constant. For example, if two 5 bit numbers of equal magnitude are to be packed with 10 bits between them, then a packing constant of 2 is used. If, however, R is 8 times as large as L, then a packing factor of 2 is used and all subsequent calculations are equivalent. The larger packing factor causes L to be moved 3 bits further left to account for the extra 3 bits in R. Multiplications eect only the length of the subwords. Changes in magnitude are handled in the shared exponent. Multiplications cause the subwords to grow on the right. Since the right subword is rounded using the regular oating point hardware, the important concern is the growth of the left subword. Therefore care must be taken that multiplies don't cause the left subword to grow over the right. The number and length of multiplies determines how large the buer region must be between subwords. Addition of words that dier greatly in magnitude can cause contamination between the two subwords. It is easy to imagine a short very small number added to a short large number that results in a very long result, even if most of the bits in the middle are zero. For this reason, additions must be performed on quantities that are approximately the same magnitude. If this criterion is met, the number of additions and subtractions that can occur in a series is not limited in the same way as the number of multiplications that can occur in a series. In the next section a theoretical justi cation for this intuitive explanation is presented. 15

18

5.4.2 Length and Magnitude de nition 1 The most signi cant position of X, msp(X), is de ned for X 6= 0. msp(X ) = blog jX jc. 2

de nition 2 The least signi cant position of X, lsp(X) is also de ned for X 6= 0. lsp(X) is the largest n, where n is an integer, for which jX j mod 2n = 0. For the


46

purposes of this de nition, N mod m is de ned as the smallest positive M for which N ? k m = M where k = 0; 1; 2; 3 : : :

corrollary 1 The length of X, len(X), is the number of bits between the most signi cant position and least signi cant position of a number inclusive. Len(X ) = msp(X ) ? lsp(X ) + 1:

For example, for X=110101100: msp(X)=8, lsp(X)=2, and len(X)=7.

Discussion Intuitively, the msp and lsp are the bit positions of the most signi cant and least signi cant non-zero bits, respectively. Counting begins with zero in the position immediately to the left of the radix point and increases to the left. The position immediately to the right of the radix point is -1 counting decreases to the right. The goal for these de nitions is a formal system to determine the limits of DE arithmetic. In this section we derive bit position relationships for the fundamental operations of addition, subtraction, and multiplication based on speci c values for data. Once we determine how individual data bits behave, we determine the limits of this behavior. By determining the greatest msp, and the least lsp, we de ne a bucket within which we are guaranteed the data bits of a given number lies. Eventually, we exploit the concept of the bucket to determine the requirements for implementing DE arithmetic in the DCT. When we ensure that two buckets do not overlap, then it is guaranteed that any particular instance for a given number similarly does not overlap. Zero is treated as a special case. Since zero has no non-zero bits in its representation, there is no clear value for msp or lsp. It would be possible to de ne msp(0) = ?1 and lsp(0) = 1 which would allow for the treatment of zero as any other number. In this case, from the de nition, len(0) = ?1. However, this approach would require de ning an unnecessarily complicated arithmetic of in nities. To avoid this problem, we simply handle the case of zero separately at the conclusion of this section.

5.4. THEORY

47

Addition We now de ne the eect on length and magnitude of addition, subtraction, and multiplication. First we discuss addition. For all cases, we assume that msp(X ) msp(Y ), and when msp(X ) = msp(Y ), then lsp(X ) lsp(Y ). If this is not the case, then the operands can be switched and the results are the same. We also assume that X; Y 6= 0. The case for zero is treated separately at the end of the next section. Case I: no overlap lsp(X ) > msp(Y ) 1xxxxx100000000 +

1xxx1

msp(X + Y ) = msp(X )

(5.6)

lsp(X + Y ) = lsp(Y )

(5.7)

Discussion By de nition, there is guaranteed to be no overlap between the two numbers. The sum is simply a bitwise or of the two numbers. No carry can happen from one to the other, so the magnitude equals that of the larger number. Nothing can cancel the rightmost bit, so the rightmost bit remains that of the smaller number. Case II: partial overlap lsp(X ) msp(Y ) and lsp(Y ) < lsp(X ) 1xxxxx100 +

1xxx1

msp(X + Y ) msp(X ) + 1

(5.8)

lsp(X + Y ) = lsp(Y )

(5.9)


48

Discussion Because the numbers now overlap, a carry out can occur at the leftmost bit increasing the magnitude of the larger number by one. By de nition, since the overlap is only partial, the rightmost bit can not be canceled, and so it remains that of the smaller number. Case III: total overlap lsp(X ) msp(Y ) and lsp(Y ) lsp(X ) 1xxxxx1 +

1xxxxx1

msp(X + Y ) msp(X ) + 1

(5.10)

lsp(X + Y ) lsp(X )

(5.11)

Example 1011 +

101 _____ 10000

Discussion As before, a carry out can occur causing the magnitude to grow one bit beyond that of the larger number. If the rightmost bits of the numbers line up, then a cancellation of the lower bits can occur causing the rightmost bit to shift to the left. This is illustrated in the example. In the limit a cancellation can cause the rightmost bit to equal the leftmost bit with a length of one bit. If the rightmost bit of the smaller subword is to the left of the rightmost bit of the larger subword, then the rightmost bit of the sum is that of the larger subword.

5.4. THEORY

49

Subtraction Now do the same for subtraction. For subtraction we additionally assume X 6= Y . This results in a dierence of exactly 0 and is treated later as a special case. Case I: no overlap lsp(X ) > msp(Y ) 1xxxxx100000000 -

1xxx1

msp(X ? Y ) msp(X )

(5.12)

lsp(X ? Y ) = lsp(Y )

(5.13)

Discussion By de nition, the dierence at the rightmost bit will be 0 ? 1 = 1. This guarantees the rightmost bit is that of the smaller word. It is possible for a borrow to occur from the leftmost bit, so they magnitude is either equal to or less than the larger number. Case II: partial overlap lsp(X ) msp(Y ) and lsp(Y ) < lsp(X ) 1xxxxx100 -

1xxx1

msp(X ? Y ) msp(X )

(5.14)

lsp(X ? Y ) = lsp(Y )

(5.15)

Discussion The arguments for case I apply equally to case II.


50

Case III: total overlap lsp(X ) msp(Y ) and lsp(Y ) lsp(X ) 1xxxxx1 -

1xxxxx1

msp(X ? Y ) msp(X )

(5.16)

lsp(X ? Y ) lsp(X )

(5.17)

Discussion For case III, the numbers can be of approximately equal magnitude, so it is possible for a large cancellation to occur zeroing out the higher order bits. The cancellation must stop where the numbers no longer have overlapping bits. This is at the rightmost bit of X. The rightmost bit can end up anywhere within X. In the case when the numbers are exactly equal, they cancel and go to zero. Summarizing for all 6 cases above:

msp(X Y ) msp(X ) + 1

(5.18)

lsp(X Y ) min(lsp(X ); lsp(Y ))

(5.19)

msp(X ) + msp(Y ) ? 1 msp(X Y ) msp(X ) + msp(Y )

(5.20)

Multiplication Proof Again, we consider later the case where X = 0 or Y = 0.

5.4. THEORY

51 A

S

KA S 4

10

5 9

KA KA S 3

5

4

4 8

KA KA KA S 2

5

3

4

4

3 7

KA KA KA KA KA S

KA KA KA KA S 1

5

2

4

3

3

4

2 6

A K KA KA KA KA KA S

5

4

4

0

5

1

4

2

3

3

2

4

1 5

0

4

1

3

2

2

3

1

4

0

A K KA KA KA KA

3

3

0

3

1

2

2

1

3

0

0

S

4

3

A K KA KA KA

2

2

0

2

1

1

2

0

A K KA KA

1

1

0

1

1

0 0

S

A K KA

0

0

0

0

0 0 0

S

2

0

0 0 0 0

S

1

0

Figure 5.1: Partial product terms for a 6 bit x 5 bit multiply.

msp(X Y ) = = = =

blog jX Y jc + 1 blog jX j + log jY jc + 1 blog jX jc + blog jY jc + 1; or blog jX jc + blog jY jc + 2 msp(X ) + msp(Y ); or msp(X ) + msp(Y ) ? 1 2 2

2

2

2

2

2

len(X ) + len(Y ) ? 1 len(X Y ) len(X ) + len(Y )

(5.21)

Discussion The partial product terms for an exemplary 6 bit x 5 bit multiply drawn in gure 5.1 illustrates this point. This gure represents the mantissa portion of a typical multiply, so that it is guaranteed that bits A , A , K , and K are all ones. The rightmost sum bit, S , must be non-zero since the rightmost partial product, K A is one and only zeros are added to it. The leftmost sum bit, S , can be either a zero or one depending on whether there is a carry out. Counting the terms shows the length is either the sum of the lengths of the multiplicands, or one less than that. 5

0

4

0

0

0

0

10

lsp(X Y ) = lsp(X ) + lsp(Y ) ? 1

(5.22)

52


Proof There are two cases that can occur: either a carry out did or did not occur in the leftmost position. If it did, msp(X Y ) = msp(X ) + msp(Y ) and len(X Y ) = len(X )+ len(Y ). If it did not, msp(X Y ) = msp(X )+ msp(Y ) ? 1 and len(X Y ) = len(X ) + len(Y ) ? 1

lsp(X Y ) = msp(X Y ) ? len(X Y ) + 1 for carry out = msp(X ) + msp(Y ) ? len(X ) ? len(Y ) + 1 = (msp(X ) ? len(X ) + 1) + (msp(Y ) ? len(Y ) + 1) ? 1 = lsp(X ) + lsp(Y ) ? 1 for no carry out = msp(X ) + msp(Y ) ? 1 ? len(X ) ? len(Y ) + 1 + 1 = lsp(X ) + lsp(Y ) ? 1

Zeros We have shown the characteristics of DE arithmetic for X; Y 6= 0. We now treat the special case where X or Y can equal zero and show that the above conditions are unchanged. Consider rst addition and subtraction where only one operand equals zero. Assume that Y = 0. If this is not the case, X and Y are swapped and the same conditions hold by symmetry. With Y = 0, X Y = X , and msp(X Y ) = msp(X ) and lsp(X Y ) = lsp(X ). This ts the bounds previously derived above in equations 5.18 and 5.19. Now consider a subtraction where X = Y . The result equals exactly zero. In this case, the msp and lsp are unde ned. Since these values are unde ned, the bounds in equations 5.18 and 5.19 are not strictly met. A similar result occurs in multiplication when X; Y = 0. Again, the product is 0. Msp(0) and lsp(0) are unde ned. However,

5.4. THEORY

53

because 0 contains no non-zero bits, 0 cannot negatively interfere with the bits of another calculation. This case is discussed in detail in section 5.4.3 below.

5.4.3 Requirements for DE Addition and Subtraction For DE arithmetic to work, we must ensure that the right bits of R do not interfere with the left bits of L. The analytic requirement for this is:

lsp(L) > msp(R)

(5.23)

Assume we are adding two packed numbers R and R , and L and L . Furthermore, we call the results R and L, respectively. That is, R + R = R and L + L = L. We combine equations 5.18 and 5.19 above to determine the conditions after an addition or subtraction. Notice that we use max(msp(R ), msp(R )) since in equation 5.18 it was assumed that msp(X ) msp(Y ). We no longer make this assumption. 1

2

1

1

1

2

2

2

1

2

msp(R) max(msp(R ); msp(R )) + 1

(5.24)

lsp(L) > min(lsp(L ); lsp(L ))

(5.25)

1

2

and 1

2

Combining these two equations, we see that condition 5.23 is met when

min(lsp(L ); lsp(L )) > max(msp(R ); msp(R )) + 1 1

2

1

2

(5.26)

Multiplication Assume we wish to multiply two packed subwords by a constant k. From equation 5.20 for multiplication, we know:


54

msp(R) < msp(R ) + msp(k)

(5.27)

lsp(L) = lsp(L ) + lsp(k) ? 1

(5.28)

1

and from equation 5.22 1

So that condition 5.23 above is guaranteed to be met when

lsp(L ) + lsp(k) ? 1 > msp(R ) + msp(k)

(5.29)

lsp(L ) > msp(R ) + (msp(k) ? lsp(k) + 1)

(5.30)

lsp(L ) > msp(R ) + len(k)

(5.31)

1

1

and simplifying 1

1

1

1

Now we must consider the case where L or R equals zero. In this case, the msp and lsp is unde ned. Condition 5.23 is not strictly met since the values are unde ned. However, since zero contains no non-zero bits, we can guarantee that bits from the zero can not interfere with bits from the other packed word. At rst glance, it appears as if there is a problem when L equals zero. In this case,the oating point hardware normalizes the packed data to the leftmost nonzero bit of R. If this number is added to a normally packed number, it appears that L and R are now aligned and the addition is performed incorrectly with L being added to R. However, when the word with L=0 was left shifted to normalize to R, its exponent eld was adjusted for the shift. Now when the addition is performed, the standard oating point addition hardware will right shift the number to account for the dierence in exponents. This occurs in the mantissa alignment phase of the

oating point addition. Once the number has been correctly right shifted, the two R positions are aligned and the single L position is aligned with zeros. The addition is performed correctly.

5.4. THEORY

55

5.4.4 Conditions for Implementing DE in DCT Ln,x

+/Ln,y

X

Ln+1,z

X

Rn+1,z

K Rn,x

+/Rn,y K

Figure 5.2: DCT calculation kernel. For the DCT to work correctly using DE arithmetic, we must guarantee that inequality 5.23 (lsp(L) > msp(R)) holds. To do this, we are interested in the bucket, or container, for L and R. A container is described by the maximum msp and minimum lsp and describes the maximum extent of non-zero bits for a given point in the calculation. We therefore describe the bucket for a data word, X, by msp(X );max and lsp(X );min. Msp(X );max is the largest possible msp value for X and de nes the left boundary of the bucket. Lsp(X );min is the smallest possible lsp value for X and de nes the right boundary of the bucket. The oating point hardware need not be modi ed to recognize this bucket. Instead, it is determined in advance by analyzing the calculation path that the buckets of the two subwords will not overlap. The DCT calculation path is modeled as a tree of stages. A single stage is illustrated in gure 5.2. Each stage consists of an add or subtract followed by a multiply by a constant, Ln ;z = (Ln;x Ln;y ) k. The same calculation is done in parallel for R. The integer, n, is used to designate the stage in the DCT calculation. The DCT calculation is modeled as a tree of calculation kernels. This is illustrated in gure 5.3. Each box in gure 5.3 represents the kernel shown in gure 5.2. An +1


56

L0,0 L1,0 L0,1 L2,0

L0,2 L1,1 L0,3

L3,0

L0,4 L1,2 L0,5 L2,1

L0,6 L1,3 L0,7 R0,0 R1,0 R0,1

R2,0

R0,2 R1,1 R0,3

R3,0

R0,4 R1,2 R0,5 R2,1

R0,6 R1,3 R0,7

Figure 5.3: DCT calculation tree. Each box represents the kernel shown in gure 5.2.

5.4. THEORY

57

actual DCT implementation will vary depending on the speci c algorithm used. Each DCT coecient will have its own calculation path similar to the one in gure 5.3. The calculation in gure 5.3 is chosen for its simplicity and symmetry to represent a typical DCT calculation. Notice in gure 5.3 that the data at every stage has gone through an identical calculation. For example, L ; and L ; have gone through the same combination of adds or subtracts and multiplies. From this, we conclude the data bucket for any given stage is the same. For example, msp(L ; );max = msp(L ; );max. In general, for any stage n, and for any 2 data elements i and j, msp(Ln;i);max = msp(Ln;j );max and lsp(Ln;i);min = lsp(Ln;j );min. The same is true for the calculations for R. Beginning with the rst addition step in the kernel of gure 5.2, from equation 5.19: 20

21

20

21

lsp(Ln;x Ln;y ) min(lsp(Ln;x ); lsp(Ln;y ))

(5.32)

Recalling that, lsp(Ln;x);min = lsp(Ln;y );min, so

lsp(Ln;x Ln;y );min = lsp(Ln;x);min

(5.33)

For msp(R), from equation 5.18:

msp(Rn;x Rn;y ) max(msp(Rn;x ); msp(Rn;y )) + 1

(5.34)

Again by recalling, msp(Rn;x);max = lsp(Rn;y );max, so

msp(Rn;x Rn;y );max = msp(Rn;x);max + 1 Now, consider the multiplication step in gure 5.2, from equation 5.22:

lsp((Ln;x Ln;y )(k)) lsp(Ln;x Ln;y ) + lsp(k) ? 1 lsp((Ln;x Ln;y )(k));min = lsp(Ln;x Ln;y );min + lsp(k);min ? 1 = lsp(Ln;x);min + lsp(k);min ? 1

(5.35)


58

= lsp(Ln

;z );min

+1

For msp(R), from equation 5.20,

msp((Rn;x Rn;y )(k)) msp(Rn;x Rn;y ) + msp(k) max(msp(Rn;x ); msp(Rn;y )) + msp(k) + 1 msp((Rn;x + Rn;y )(k));max = msp(Rn;x);max + msp(k);max + 1 = msp(Rn ;z );max +1

So, summarizing for a single stage :

lsp(Ln ;z );min = lsp(Ln;x);min + lsp(k);min ? 1 msp((Rn ;z );max = msp(Rn;x);max + msp(k);max + 1 +1

+1

We combine these equations recursively to allow for 3 stages. Because of the symmetry, we can freely interchange terms for a single stage. For example, msp(L ;x );max = msp(L ;y );max = msp(L ;z );max. 1

1

1

lsp(L ;z );min = lsp(L ;x);min + lsp(k);min ? 1 lsp(L ;z );min = lsp(L ;z );min + lsp(k);min ? 1 = lsp(L ;x);min + lsp(k);min ? 1 + lsp(k);min ? 1 = lsp(L ;x);min + 2 lsp(k);min ? 2 lsp(L ;z );min = lsp(L ;z );min + lsp(k);min ? 1 = lsp(L ;x);min + 2 lsp(k);min ? 2 + lsp(k);min ? 1 1

0

2

1

0

0

3

2

0

5.4. THEORY

59 = lsp(L ;x);min + 3 lsp(k);min ? 3 0

Similarly for msp(R ;z );max. 3

msp((R ;z );max = msp(R ;x);max + msp(k);max + 1 msp((R ;z );max = msp(R ;z );max + msp(k);max + 1 = msp(R ;x);max + msp(k);max + 1 + msp(k);max + 1 = msp(R ;x);max + 2 msp(k);max + 2 msp((R ;z );max = msp(R ;z );max + msp(k);max + 1 = msp(R ;x);max + 2 msp(k);max + 2 + msp(k);max + 1 = msp(R ;x);max + 3 msp(k);max + 3 1

0

2

1

0

0

3

2

0

0

Summarizing,

lsp(L ;z );min = lsp(L ;x);min + 3 lsp(k);min ? 3 msp((R ;z );max = msp(R ;x);max + 3 msp(k);max + 3 3

0

3

0

Initially, we know lsp(L ;x);min = C , where C is the constant of packing, and msp(R ;x);max = 8. So, for the nal stage of the DCT calculation to work from condition 5.23, 0

0

lsp(L ;z );min > msp((R ;z );max 3

3

(5.36)

lsp(L ;x );min + 3 lsp(k);min ? 3 > msp(R ;x);max + 3 msp(k);max + 3 0

0


60

C + 3 (msp(k);max ? len(k);max + 1) ? 3 > 8 + 3 msp(k);max + 3 C > 8 + 3 len(k);max + 3 C > 11 + 3 len(k);max This is the condition required for the DE arithmetic to work in the DCT. Intuitively, this shows that C, the constant of packing, must equal the length of the initial data subword plus one bit per addition stage plus the length of the constant for every multiplication. At each addition stage, two numbers of the same maximum magnitude are added together, so that the resulting maximum is two times the magnitude of the input. This multiplication by two results in one bit of length for every addition step. Since both numbers are multiplied by the same constant the magnitude of the constant is not important, only the length. The left subword grows on the right by the length of the constant for every multiply step.

5.4.5 Error in DCT Max Error There are two sources of error in the DCT implementation using DE arithmetic. The rst comes from rounding the constants to 8 bits, and the second comes from the potential loss of precision in R if L is a maximum. We begin by looking at the error due to rounding the constant, k. We decompose k into two components, k = kr + kerr , where kr is the rounded value of k and kerr is the error resulting from the round. We calculate the maximum error thru a single stage of the type drawn in gure 5.2. We decompose L into rounded and error components similarly to what was done for k, Ln;x = Ln;x;r + Ln;x;err , and the same for Ln;y . Since we are calculating maximum error, we consider only addition. A subtraction could produce an error of the same absolute magnitude. After the addition and multiplication, (Ln;x + Ln;y ) k = (Ln;x;r + Ln;y;r + Ln;x;err + Ln;y;err ) (kr + kerr )

5.4. THEORY

61 = (|Ln;x;r + {zLn;y;r ) k}r + (|Ln;x;err + {zLn;y;err ) kr} error

value

+ (|Ln;x;r + Ln;x;err + L{zn;y;r + Ln;y;err ) kerr}

= Ln

;z

error

+1

The left quantity is the calculated value, and the right quantities are the error terms. Taking just the error term and generalizing for stage n+1

Ln

;z;err

+1

= (Ln;x;err + Ln;y;err ) kr + (Ln;x;r + Ln;x;err + Ln;y;r + Ln;y;err ) kerr (5.37)

We observe that Ln;x;r Ln;x;err and so Ln;y;err Ln;y;r that we can neglect the terms with Ln;x;err and Ln;y;err . This gives

Ln

;z;err

+1

(Ln;x;err + Ln;y;err ) kr + (Ln;x;r + Ln;y;r ) kerr

(5.38)

and

Ln

;z;err;max

+1

(Ln;x;err;max + Ln;y;err;max) kr;max +(Ln;x;r;max + Ln;y;r;max) kerr;max

(5.39) (5.40)

Recalling from the symmetry of the tree structure in the calculation that the data elements of a given stage have the same maximum values, we know Ln;x;r;max = Ln;y;r;max and Ln;x;err;max = Ln;y;err;max. This gives

Ln

;z;err;max

+1

2 (Ln;x;err;max kr;max + Ln;x;r;max kerr;max)

(5.41)

This gives an expression for a given stage's error based on the previous stage's error. The constant k is represented as the 8 bit rounded value. If we assume that k < 1,


62

and k is rounded to the nearest 2? , then the maximum possible error, kerr;max = 2? . This maximum error occurs when k > :5. If k :5, the error is less than maximum. When we assume k < 1, Ln ;z;max = 2 Ln;x;max; this occurs when two numbers with maximum magnitude are added together. For simplicity, we assume that kr;max = 1. Furthermore, we know L ;err;max = 0 and L ;max = 2 , so we can set up the following recursive relationships: 8

9

+1

0

8

0

Ln;z;max Ln;z;err;max = 2 Ln? ;x;max =2 (Ln? ;x;err;max + kerr;max Ln? ;x;r;max) 0 2 0

n

1

1

1

8

1

2

2

2

3

2

9

10

11

2 (0 + 2? 2 ) =1 9

8

2 (1 + 2? 2 ) =4 9

9

2 (4 + 2? 2 ) = 12 9

10

Now we must consider the error due to the potential loss of precision in R. L can expand right into the buer zone, so there is never any error in rounding for L. The error for R occurs when L reaches its maximum magnitude. We describe the speci c packing parameters in detail in section 5.5.1. When R uses its full 11 integer bits, L has 11 integer bits and no fractional bits available. In this case, R is rounded to the one's place on every multiply or add. This means that the maximum error for every operation is 2? . From before, we have 1

Ln

;z;err;max

+1

= 2 (Ln;x;err;max + kerr;max Ln;x;r;max)

(5.42)

5.4. THEORY

63

Since we are not considering the error due to multiplication, we set kerr = 0. Furthermore, we know each stage can add in an error of 2? as described above. Then the error for a single stage is 1

Ln

;z;err;max

+1

= 2 (Ln;x;err;max) + 2?

1

n 0

Ln;err;max 0

1

2 (0) + 2? = 2?

(5.43)

1

1

2

2 (2? ) + 2? = 1:5

3

2 (1:5) + 2? = 3:5

1

1

1

Thus the maximum error due to rounding of R is less than the magnitude of error from rounding the constants of multiplication. In this section we derived the maximum possible error when implementing the DCT using DE arithmetic. The typical error, however, is much lower than the maximum error. In the next section we use empirical data from actual mpeg movies to measure the typical error in implementing the DCT with DE arithmetic.

Measured Error In this section we use data from actual MPEG encoded movies to measure typical error. The data presented comes from measuring a full 2D 64 element DCT. Furthermore, the Loeer DCT implementation [LLM89] is used so that each data element undergoes at most two multiplies per 1D DCT.


64 movie

DCTs

left subword right subword max abs err mean abs err max abs err mean abs err tennis.mpg 2449 1.00 0.07 1.00 0.08 hula.mpg 19414 1.00 0.06 1.00 0.08

owers.mpg 48627 1.00 0.09 1.00 0.10 easter.mpg 18915 1.00 0.12 1.00 0.13 Table 5.2: Error compared to Loeer integer DCT. In table 5.2 mean absolute error and maximum absolute error is calculated using a full precision Loeer DCT for reference. The same Loeer calculation path is used for both calculations, but one uses full precision calculations and constants, and the other uses the DE methodology with reduced precision. The error is shown for a comparison of individual DCT coecients of these two implementations. The mean absolute error is symmetric for the right and left sides and is always less than 0.13. The maximum error is at most 1.00. So when compared to the same calculation path, the DE methodology adds little additional error to the DCT calculation.

5.4.6 Comparison with State of the Art In this section we analyze a small code kernel to give a comparison between DE arithmetic and the state-of-the-art in industry. We compare DE arithmetic, HewlettPackard's MAX instructions, and Intel's MMX instructions. The code kernel analyzed is the calculation shown below, a multiplication by a constant followed by an addition. This type of calculation is common in multimedia algorithms.

Z =X k+Y The code for HP MAX is shown in gure 5.4. MAX allows the packing of four 16 bit subwords, so it is assumed the calculation is done on four words in parallel in the example. It is further assumed that the data is both loaded from memory before

5.4. THEORY

65

; load packed words LDD X(%r20), %r1 LDD Y(%r20), %r2 ; do calculation HSHRADD %r1, HSHRADD %r4, HSHRADD %r3, HADD,ss %r2,

2, %r1, %r4 2, %r1, %r3 3, %r4, %r4 %r4, %r4

; store packed result STD %r4, Z(%r20)

Figure 5.4: Code kernel for HP's MAX instructions. the calculation and stored to memory after the calculation in the appropriate packed format, 16 bit integers. This would be the case if the kernel shown is one step in a long series of calculations The calculation step requires 1 instruction for the add and 3 shift-right-and-adds for the multiply. The multiplication is shown for the constant 1.41406250. This is app proximately equal to 2, a common constant for the DCT calculation. The multiply is synthesized from three HSHRADD primitives as shown in gure 5.4. A description of this technique for synthesizing multiplication from shift-and-add primitives is available in [LM97]. The MAX instructions allow saturating arithmetic. This does not save any cycles for the kernel shown, but in an actual implementation, this feature could save additional execution cycles. The code for the same kernel executed using software only DE arithmetic is shown in gure 5.5. Since software DE arithmetic allows only two subwords, calculation is done on only two words in parallel. It is assumed that the initial operands and nal product are loaded from and stored to memory in oating point format. Therefore, the kernel includes code for packing and unpacking. The FMPYFADD instruction has a three cycle latency, but a thruput of just one cycle. DE takes advantage of existing oating point instructions so that the calculation is performed with a single

66


; load and pack words FLDD FLDD FMPYFADD FLDD FLDD FMPYFADD FLDD

X1(%r20), %fr1 X2(%r20), %fr2 %fr1, %frp, %fr2, %fr3 Y1(%r20), %fr7 Y2(%r20), %fr8 %fr1, %frp, %fr2, %fr4 K(%r20), %fr5

; ; ; ; ; ; ;

; do calculation FMPYFADD

%fr3, %fr4, %fr5, %fr6

; 8

; store and unpack results FADD,DBL %fr6,%frb,%fr1 ; round FSUB,DBL %fr1,%frb,%fr1 ; round

1 2 3 4 5 6 7

9 10

FMPY,DBL FSTD

%fr1,%fri,%fr2 ; unpack L Z1(%r20), %fr2

11 ; 12

FSUB,DBL FSTD

%fr6,%fr1,%fr7 ; unpack R Z2(%r20), %fr7

13 ; 14

Figure 5.5: Code kernel for software DE arithmetic. ; load packed words into MMX registers movq MM0, ptr[X] movq MM1, ptr[Y] movq MM2, ptr[K] ; do calculation pmulhw paddssw

MM0, MM2 MM1, MM0

; store packed result movq ptr[Z], MM1

Figure 5.6: Code kernel for Intel's MMX instructions.

5.4. THEORY

4 pixels pathlength (instructions) execution time (cycles) 8 pixels pathlength (instructions) execution time (cycles)

67 4 way MAX 2 way DE 4 way MMX 7 6

28 16

6 8

14 7

56 21

12 12

Table 5.3: Kernel calculation comparison for DE, MAX, and MMX code kernel. fused multiply-and-add instruction. Finally, we consider the same calculation executed using Intel's MMX instructions. MMX style code is presented in gure 5.6. The same assumption that was made for MAX is made assuming the operands are stored in memory as 16 bit integers. MMX allows operations on four words in parallel, so the calculation is performed on four subwords in parallel. In the current MMX implementation, there is only one multiplier so the multiply does not perform truly parallel calculations and the instruction executes in 3 cycles. This can be corrected in future implementations by adding additional multiply units, but at signi cant area cost for each additional multiplier. MMX also has saturating arithmetic similar to MAX. MMX requires a nal expensive instruction to clear the MMX state after completing MMX calculations. This is not shown in the example. Table 5.3 compares instruction pathlength and execution time for the calculation code kernels presented. Data for MAX and MMX is shown for both a single iteration of the kernel, as would be the case when calculating values for 4 pixels, and for 2 iterations of the loop, as would be the case when calculating values for 8 pixels. Because DE has only 2 way parallelism, 2 iterations and 4 iterations are shown for the 4 and 8 pixel cases, respectively. Execution time is calculated assuming all instructions can be pipelined for single cycle thruput. Furthermore, it is assumed all multiply and multiply-add instructions have a 3 cycle latency, and the remainder of instructions complete in 1 cycle. 4 way MAX uses 4 one cycle arithmetic instructions to perform the calculation

68


and 3 memory operations to load and store the data. Assuming a PA-8000 architecture where 2 arithmetic and 2 memory operations can be done in parallel, these instructions could be scheduled to complete in 6 cycles when data dependencies are considered. 8 pixels requiring 2 iterations of the loop in gure 5.4 are calculated in 7 cycles. The 2 LDD instructions for the second iteration can occur in the same cycle as the rst HSHRADD instruction of the rst iteration. The subsequent instructions for the second iteration follow in each successive cycle. 2 way DE uses a single oating point instruction with a single cycle thruput to perform the calculation. Packing requires 5 loads and 2 fused multiply-adds. Unpacking requires 4 arithmetic and 2 store operations for a total of 7 arithmetic and 7 memory operations. Again assuming a PA-8000 architecture, the kernel could be completed in 14 cycles. 2 kernel iterations, to calculate 4 pixels, complete in 16 cycles. 4 iterations, to calculate 8 pixels, complete in 21 cycles. Code schedules for 1, 2, and 4 iterations are given in tables 5.4, 5.5, and 5.6, respectively. MMX uses two instructions for the calculation and 4 memory operations to load and store the data. The kernel completes in 8 cycles due to data dependencies and the 3 cycle pmulhw instruction [PWW97]. 8 pixels are calculated in 12 cycles. The Pentium architecture can do 1 ALU operation and 1 memory operation in parallel. For 2 iterations, the second iteration's load instructions are overlapped with the rst iteration's multiply instruction. The second iteration's multiply is initiated concurrently with the rst iteration's store, leaving 4 cycles to complete the second iteration. In general, DE takes advantage of extremely ecient packing, but unpacking and format conversion is expensive. Format conversion is not included in this example. DE utilizes the existing high speed oating point hardware to perform ecient calculations and requires no special hardware support. It has a signi cant disadvantage, however, in that only 2 way parallelism is possible. HP's MAX achieves the best performance while adding only a fraction of a percent of additional die area [Lee97]. MMX adds an entirely new functional unit including a new 16 bit multiplier. Software DE with 2 way parallelism is not twice as slow as MMX with 4 way parallelism and added hardware. As the number of iterations of the code kernel increases (going from a 4 pixel calculation to an 8 pixel calculation) the ratio of DE

5.4. THEORY

69

count 1 2 3 4 5 6 7 8 9 10 11 12 13 14

ld/st ld/st ALU 1-FLDD 2-FLDD 4-FLDD 5-FLDD 3-FMPYFADD 7-FLDD 6-FMPYFADD

ALU

8-FMPYFADD 9-FADD 10-FSUB 11-FMPY

14-FSTD

13-FSUB

12-FSTD

Table 5.4: Code scheduling for 1 iteration of DE kernel on PA-8000. count 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

ld/st 1a-FLDD 4a-FLDD 1b-FLDD 4b-FLDD 7a-FLDD

14a-FSTD 12a-FSTD 14b-FSTD 12b-FSTD

ld/st 2a-FLDD 5a-FLDD 2b-FLDD 5b-FLDD 7b-FLDD

ALU

ALU

3a-FMPYFADD 6a-FMPYFADD 3b-FMPYFADD 6b-FMPYFADD 8a-FMPYFADD 8b-FMPYFADD 9a-FADD 10a-FSUB 11a-FMPY 9b-FADD 13a-FSUB 10b-FSUB 11b-FMPY 13b-FSUB

Table 5.5: Code scheduling for 2 iterations of DE kernel on PA-8000.


70

count 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

ld/st 1a-FLDD 4a-FLDD 1b-FLDD 4b-FLDD 7a-FLDD 1c-FLDD 4c-FLDD 1d-FLDD 4d-FLDD 7c-FLDD

ld/st 2a-FLDD 5a-FLDD 2b-FLDD 5b-FLDD 7b-FLDD 2c-FLDD 5c-FLDD 2d-FLDD 5d-FLDD 7d-FLDD

ALU

3a-FMPYFADD 6a-FMPYFADD 3b-FMPYFADD 6b-FMPYFADD 8a-FMPYFADD 3c-FMPYFADD 6c-FMPYFADD 9a-FADD 10a-FSUB 11a-FMPY 8c-FMPYFADD 11b-FMPY 12a-FSTD 13a-FSUB 14a-FSTD 14b-FSTD 9c-FADD 12b-FSTD 10c-FSUB 11c-FMPY 13c-FSUB 14c-FSTD 13d-FSUB 14d-FSTD 12c-FSTD 12d-FSTD

ALU

8b-FMPYFADD 3d-FMPYFADD 6d-FMPYFADD 9b-FADD 10b-FSUB 8d-FMPYFADD 13b-FSUB 9d-FADD 10d-FSUB 11d-FMPY

Table 5.6: Code scheduling for 4 iterations of DE kernel on PA-8000.

5.4. THEORY

71

execution time to MMX execution time decreases from 2 to 1.75. Furthermore, this example is a worst case for DE since only one calculation instruction is performed in the \do calculation" portion of gure 5.5 for the full overhead of packing and unpacking. As the number of calculations in this section increases, the fraction of overhead in packing and unpacking decreases and the relative performance of software only DE improves. Finally, in all 3 cases, only loop unrolling is employed. If software pipelining were additionally applied, it is likely that relative DE performance will improve due to the empty instruction slots left in the code schedule. These empty slots are due to data dependencies. This requires more loop iterations to be eective.


72

5.5 Implementation 5.5.1 Packed Formats In order to determine how to pack the operands for the DCT examples, we must rst determine the largest possible value. If the DCT is used in JPEG or MPEG, then the inputs are restricted to 8 bits or less. We do not here consider the higher precision speci cations where the inputs can be 12 bits. This format is not commonly implemented. We know that the terms can grow by at most 3 bits within an 8 point 1-D DCT, so we know that at most 11 bits are required to hold the data. However, since the sign bit for L is held in the sign bit for the entire fp number, only 10 bits are needed to represent L. We begin by allocating 21 bits to hold the integer portions of L and R. Since we have 53 bits available in the IEEE double precision format, this leaves 32 bits. Next, we must worry about the increase in length of L after every multiply (remember that the R component is automatically rounded so there is no problem here). From the above calculation, L can grow by at most 32 bits. Multiplication by a constant results in an increase in length equal to the length of the constant. To be able to perform a 1D Lee's DCT , we must be able to do 4 sequential multiplies. So we must express the constants as 32/4 = 8 bit quantities so that the left component grows as shown below. The 11th bit of L is stored as the sign bit for the entire fp quantity. left part

mult 1

mult 2

mult 3 mult 4

right part

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyyyyyyyyyyy

The implemented format shows that L must be spaced 43 bits over from R. Therefore, for these experiments, the constant of packing was set equal to 2 . 43

5.5.2 Rounding Implementation Any method may be used to implement the round operation required to unpack the operands as described in equations 5.4 and 5.5 above. For these experiments,

5.5. IMPLEMENTATION

73

the round was implemented by performing an addition and subtraction with a large predetermined constant. This method takes advantage of the round that must be performed by the hardware when a number becomes too large to be t into available hardware. First, the large constant is added to force the round at the desired position, and second, the constant is subtracted to restore the correct value to the result.

Example 1 Assume a 6 digit decimal system where it is desired to round X to the nearest 1000. For this example, the correct constant is 100000000. X = 498657 X + 100000000 = 100498657 X + 100000000 = 100499 * 10^3

Round must be performed to fit result into available 6 decimal places

(X + 100000000) - 100000000 = 499000.

The correctly rounded result is obtained.

Example 2 The method is applicable to binary. Assume a 6 binary decimal system where it is desired to round X to the nearest 8 = 1000 . For this example also, the correct constant is 100000000 . 2

2

X = 110111 X + 100000000 = 100110111 X + 100000000 = 100111 * 2^3

Round must be performed to fit result into available 6 binary places

(X + 100000000) - 100000000 = 111000.

74


The correctly rounded result is obtained again. Implementing the round in this manner with an add and subtract means that 2 operands can be completely unpacked with a multiply, an add, and 2 subtracts. Some of these operations can be combined with multiply-add or multiply-subtract instructions.

5.5.3 Method for Extracting Parallelism The calculation of a 64 element 8 point two dimensional DCT can be done by performing 8 one dimensional DCTs on the rows followed by 8 one dimensional DCTs on the columns of the input matrix. Therefore, it might seem reasonable to use the double enhanced methodology to calculate two rows or two columns in parallel. However, a problem is reached when the switch is made between doing calculations on the rows and the columns. The operands would have to be unpacked and then packed again in a transposed layout in order to do the switch correctly. The amount of packing and unpacking overhead would be doubled, since each element would now have to be packed and unpacked twice. To avoid this additional overhead, calculations are done on two independent 2D DCTs in parallel, rather than two rows or two columns within a single DCT. The amount of parallelism achieved is identical, and no additional packing and unpacking is required. The disadvantage with performing the two 2-D DCTs in parallel is the slight initial delay in having to decode two 8x8 blocks before the DCT may be performed in parallel on both these blocks. Also, fast storage is needed for two 64 element blocks rather than just one. For MPEG, the additional startup delay is not important, and the cache sizes of most workstations can usually accommodate two 64 element blocks quite easily since 128x8 = 1 Kbytes.

5.6 Study 1: JPEG Implementations of DE arithmetic have been completed for both JPEG and MPEG image compressions algorithms. The implementation of DE in JPEG was completed for an earlier work and is described completely in [ZL94]. That study described a

5.7. STUDY 2: MPEG

75

speedup of 15% obtained in the DCT portion of JPEG. The next section presents the results for a DE implementation in MPEG in detail.

5.7 Study 2: MPEG 5.7.1 Experimental Methodology We compiled both the standard distribution and our modi ed code with the HP9245301 A.09.75 HP C compiler using +O3 optimization. The application was executed on an HP 9000/725/75 workstation. Each movie clip was played 40 times (20 each in base and enhanced) and averaged to obtain the speedup data. Initially the execution pro ling tool, gprof, was used to time individual procedures, but this proved to be too inaccurate to measure the parameters for the speedup discussed below [Var93]. We also experimented with using the gettimeofday() command, which reads the system clock and reports the time in seconds and microseconds, but the overhead in making the procedure call again made our measured results too inaccurate for the speedup model. Finally, we read directly from the system interval timer in order to measure the wall clock time. The PA-RISC Architecture Manual states [Pac92] \...the internal register is continually counting up by 1 at a rate which is implementation-dependent and between twice the `peak instruction rate' and half the `peak instruction rate'." By timing the same piece of code with both the interval timer and the Unix time command, the interval timer count is calibrated to time measured in seconds. Reading the interval timer directly we can make an accurate measurement of wall clock time without invoking the excessive overhead of using system calls. Reading directly from the interval timer has the disadvantage that any system interrupts that occur in the procedure we are interested in timing will be included in the measured execution time. However, because the procedures we time are relatively short and because we average over 20 runs, this error is largely eliminated. We continue to use gprof to measure total execution time since this allows us to measure only the time spent in user code. The execution time reported by gprof gives the same results as the Unix time command.

76


5.7.2 Speedup Model In the next sections, measured execution time data is presented for a variety of experimental runs. The observed speedup ranges from 1.08 to 1.49. In this section we present a speedup model to explain the nature of this observed speedup. Ultimately we de ne speedup in terms of four measured parameters, F, K, D, and V. These parameters are de ned below. We begin by stating that the speedup equals the total execution time of the base implementation, Tbase, divided by the total execution time of the DE enhanced implementation, TDE .

Speedup = TTbase DE

(5.44)

We now express the speedup equation in terms of individual execution time parameters. Toh is de ned as the total amount of execution time spent in overhead in the DE implementation. This overhead consists primarily of packing and unpacking. In this implementation, the unpacking is done in the FP DCT, so the overhead term consists only of packing. The data is packed after the Human decode. In this implementation a trick is used to reduce packing overhead. The data is tested to detect zeros. When a zero coecient is detected, the packing arithmetic is not performed since the calculation would be simply an addition of zero. This eliminates unnecessary arithmetic, however the overhead now depends on how many non-zero terms are in the data. This packing is done in the ParseReconBlock routine. We calculate this by determining the average time spent per call in the base ParseReconBlock routine and then subtract the average time per call spent in the DEParseReconBlock routine. This dierence is the average pack overhead per DEParseReconBlock call. There is an additional overhead to manage the additional control instructions necessary for doing two DCTs in parallel. The added instructions consist of a modulo division, and additional shift operation, and several additional branches. This overhead is negligible. The term Toh, then, refers to the overhead spent in the DEParseReconBlock routine.

5.7. STUDY 2: MPEG

77

TiDCT;base is the total amount of time spent in the integer DCT procedure in the base implementation. In the DE implementation, the DCTs are calculated using both integer and oating point arithmetic. TiDCT;DE and TfDCT;DE are the total amount of time spent in the integer DCT and oating point DCT in the DE implementation, respectively. The dierence in execution time is due to these dierent methods for computing the DCTs and the additional overhead in the DE implementation. We can now write TDE = Tbase ? TiDCT;base + TfDCT;DE + TiDCT;DE + Toh

(5.45)

The time to execute the DE implementation is the same as the time to execute the base implementation with the DE DCT time substituted for the base DCT time. The DE overhead time is also added to the total DE time. From this,

and

Speedup = T

Tbase

base ? TiDCT;base + TfDCT;DE + TiDCT;DE + Toh

Speedup =

(5.46)

1

(5.47) + T +TT + TT 1? T We now express each of the terms in the denominator of equation 5.47 in terms of new parameters that are relative to execution time. We begin by de ning D as the fraction of time spent in the DCT calculation of the base implementation. T

T

iDCT ;base base

f DCT ;DE

iDCT ;DE

oh

base

base

base

D = TiDCT;base T

(5.48)

base

V is the fraction of time spent in overhead.

V = TToh base Substituting these parameters into equation 5.47 gives Speedup =

1?D+

T

1

f DCT ;DE

T

base

+TT

iDCT ;DE base

(5.49)

+V

(5.50)


78

We continue doing the same with the remaining two terms. We rst describe the total time spent in the DCT routines. The total time spent in the integer DCT in the base implementation is equal to the time spent in each integer DCT procedure call times the number of times the integer DCT is called in the base implementation.

TiDCT;base = CiDCT;base tiDCT

(5.51)

where C is the number of times the procedure is called and t is the time to execute one procedure call. Similarly for the DE implementation

TiDCT;DE = CiDCT;DE tiDCT

(5.52)

TfDCT;DE = CfDCT;DE tfDCT

(5.53)

and

K is de ned as the ratio of execution times in one call to the oating point DCT procedure to one call to the integer DCT procedure.

K = ttfDCT

(5.54)

CiDCT;base = 2 CfDCT;DE + CiDCT;DE

(5.55)

iDCT

Because of this particular implementation of the DE methodology, we execute DCTs in parallel only if they occur in the same macroblock. Therefore when a macroblock has an odd number of DCTs, the last one is not done in parallel. Two integer DCTs are executed in parallel using a single oating point DCT. The DCTs that are not executed in parallel are executed using the original integer DCT. Formally,

We de ne F as the fraction of DCTs that can be doubly enhanced, and (1-F) as the fraction of DCTs that cannot be doubly enhanced. (1 ? F ) = CiDCT;DE CiDCT;base

(5.56)

5.7. STUDY 2: MPEG

79

from which we derive

F = 2 C CfDCT;DE iDCT;base

Also

TfDCT;DE = tfDCT CfDCT;DE Tbase Tbase = tfDCT tiDCT CfDCT;DE tiDCT Tbase t t = fDCT iDCT CiDCT;base CfDCT;DE tiDCT Tbase CiDCT;base = tfDCT TiDCT;base CfDCT;DE tiDCT Tbase CiDCT;base = K D F2 And similarly

TiDCT;DE = tiDCT CiDCT;DE Tbase Tbase t = iDCT TCiDCT;base CCiDCT;DE base iDCT;base T C iDCT;DE = iDCT;base Tbase CiDCT;base = D (1 ? F ) Substituting into equation 5.50 gives

Speedup = =

1 ? D + (K D

1

F ) + (D (1 ? F )) + V 2

1

1 + D (?1 + (K F ) + (1 ? F )) + V 2

(5.57)


80

1 1 + D F ( K ? 1) + V

=

(5.58)

2

This is the equation used to model speedup. Speedup is expressed in terms of the variables D, F, K, and V.

5.7.3 Base The base mpeg play code uses the original DCT algorithm that has not been optimized with additional branches, and does not do any sparse DCT calculations. The sparse DCT is executed in the case when all the DCT coecients except the rst coecient are zero. In this case an alternate DCT with far less computation can be executed. These options are set using compiler directives -DORIGINAL DCT and -DNO SPARSE DCT. By not optimizing the base implementation, the application executes slower than its peak performance. However, the speedup model is simpli ed since only one type of DCT is used throughout the execution. Later sections add the DCT optimizations. The table below reports speedup in total execution time when no dithering or display is performed. We eliminate dithering by using the \-dither none" option. This makes the DCT a larger fraction of the total execution time so that the resulting speedups become much larger. However, the experimental variation due to dithering is eliminated. movie hula 2.mpg mjackson.mpg sukhoi.mpg bicycle.mpg easter.mpg

ower.mpg coaster.mpeg

F

K

D

V

0.94 1.00 1.00 0.93 0.95 0.86 0.98

0.72 0.85 0.76 0.72 0.50 0.60 0.77

0.45 0.54 0.54 0.43 0.48 0.38 0.44

0.020 0.024 0.019 0.019 0.037 0.021 -0.006

Tbase

TDE Calculated Observed

2.78 1.98 12.79 8.99 19.39 13.16 11.94 8.91 3.64 2.44 10.95 8.30 14.76 10.74

Speedup 1.34 1.40 1.46 1.31 1.44 1.26 1.37

Speedup 1.40 1.42 1.47 1.34 1.49 1.32 1.37

5.7. STUDY 2: MPEG

81

The predicted speedup matches the observed speedup well. With this success in validating the model for no dithering, we measure speedups for ordered dithering below. movie hula 2.mpg mjackson.mpg sukhoi.mpg bicycle.mpg easter.mpg


F

K

D

V

Tbase


0.94 1.00 1.00 0.93 0.95 0.86 0.98

0.73 0.84 0.76 0.72 0.50 0.60 0.77

0.34 0.45 0.46 0.35 0.42 0.30 0.35

0.016 0.013 0.012 0.016 0.031 0.018 0.013

3.72 15.34 22.84 14.85 4.14 13.86 18.48

2.95 11.52 16.64 11.79 2.93 11.20 14.67

Speedup 1.23 1.33 1.38 1.23 1.37 1.19 1.25

Speedup 1.26 1.33 1.37 1.26 1.41 1.24 1.26

Finally, we measure a third type of dithering, the color dither. movie hula 2.mpg mjackson.mpg sukhoi.mpg bicycle.mpg easter.mpg


F

K

D

V

Tbase


0.94 1.00 1.00 0.93 0.95 0.86 0.98

0.72 0.85 0.76 0.72 0.50 0.60 0.77

0.31 0.42 0.43 0.32 0.40 0.27 0.32

0.015 0.012 0.011 0.014 0.036 0.015 0.009

4.08 16.34 24.31 16.21 4.37 15.20 20.12

3.31 12.55 17.99 13.15 3.17 12.58 16.34

Speedup 1.21 1.30 1.34 1.21 1.34 1.18 1.23

Speedup 1.23 1.30 1.35 1.23 1.38 1.21 1.23

Here it is interesting to note that when the DE methodology was applied to JPEG in [ZL94], this original DCT with no sparse DCT was used in the public distribution. For this implementation, we achieve an average speedup of 1.31 for the ordered dither and 1.28 for the color dither.

5.7.4 Faster DCT We now apply our measurement methodology to slightly more complicated MPEG implementations. First we add the case of a more ecient DCT. This is the standard


82

DCT in the Version 2.3 mpeg play. It introduces the technique of adding additional branch instructions to test whether the DCT coecients are zero. If it detects a zero, that branch in the DCT calculation is not performed. Because the additional branch instructions introduce additional overhead, the total execution time is a result of a trade-o between the time saved due to eliminated arithmetic, and time wasted due to added overhead. For the integer DCT, the tradeo is successful and the DCT executes faster. For the FP case, however, the DCT is typically slowed down. For this reason, when we use the DE methodology, we use the original DCT that does not have the extra conditional statements. Data is presented below for no dither, ordered dither, and color dither. Generally the K term becomes larger indicating that the integer DCT is sped up in relation to the oating point DCT. No Dither movie hula 2.mpg mjackson.mpg sukhoi.mpg bicycle.mpg easter.mpg


F

K

D

V

0.94 1.00 1.00 0.93 0.95 0.86 0.98

1.14 1.26 1.05 1.05 0.62 0.85 0.98

0.34 0.44 0.46 1.03 0.42 0.30 0.39

0.025 0.018 0.022 0.065 0.160 0.025 0.024

Tbase


2.33 1.97 10.51 8.99 16.48 13.14 10.35 8.93 3.29 2.43 9.71 8.29 13.22 10.74

Speedup 1.13 1.17 1.25 1.64 1.13 1.14 1.20

Speedup 1.19 1.17 1.25 1.16 1.35 1.17 1.23

Experimental error is responsible for the discrepancy between the calculated and observed speedup for bicycle.mpg. During the benchmark execution of the unenhanced code for bicycle.mpg, 9 out of the 20 execution iterations resulted in abnormally high execution times. This was caused by an unexpected load on the machine during benchmark timing. Because the runs are averaged over 20 execution iterations, this is normally not a problem. In this case, however, approximately half the runs had signi cant error. This caused an impossible value of 1.03 for D, and consequently a mismatch between calculated and observed speedup. Ordered Dither

5.7. STUDY 2: MPEG movie hula 2.mpg mjackson.mpg sukhoi.mpg bicycle.mpg easter.mpg


83

F

K

D

V

Tbase


0.94 1.00 1.00 0.93 0.95 0.86 0.98

1.12 1.26 1.05 1.05 0.62 0.85 0.98

0.24 0.35 0.39 0.27 0.37 0.23 0.30

0.017 0.023 0.018 0.017 0.032 0.019 0.013

3.27 13.10 20.08 13.18 3.75 12.64 17.03

2.96 11.60 16.79 11.81 2.91 11.21 14.70

F

K

D

V

Tbase


0.94 1.00 1.00 0.93 0.95 0.86 0.98

1.13 1.26 1.05 1.05 0.62 0.85 0.98

0.22 0.33 0.36 0.24 0.35 0.21 0.27

0.015 0.014 0.020 0.015 0.029 0.015 0.017

3.59 14.06 21.31 14.60 4.04 14.01 18.70

3.30 12.49 17.98 13.20 3.15 12.61 16.33

Speedup 1.09 1.12 1.20 1.11 1.27 1.11 1.16

Speedup 1.11 1.13 1.20 1.12 1.29 1.13 1.16

Color Dither movie hula 2.mpg mjackson.mpg sukhoi.mpg bicycle.mpg easter.mpg


Speedup 1.08 1.12 1.18 1.10 1.25 1.10 1.14

Speedup 1.09 1.13 1.19 1.11 1.28 1.11 1.15

5.7.5 Sparse DCT Finally we add in the sparse DCT. With this nal addition, we can compare the speedup produced by the DE methodology to the fastest possible distribution of the Berkeley mpeg player. The sparse DCT introduces additional complications into our speedup model. We therefore introduce new terms into our speedup model. First, we now need to consider the ratio of execution time of the integer sparse DCT to the oating point DCT, in addition to the K term we de ned earlier. Furthermore, not all unenhanced sparse DCTs can be calculated as enhanced sparse DCTs, since to do the calculation in the enhanced sparse form, we must have two sparse 64 element blocks sequentially.


84

A 64 element block with all zero terms except for the rst indicates a region with a single color. The sparse DCT, therefore, only has to calculate what this value should be and then copy it to all 64 output elements. The main execution time is spent in writing the output data, not in doing any arithmetic. Therefore since the enhanced sparse DCT must write data for two 64 element blocks, it takes approximately twice as long to complete. So although it can do two in parallel, since each sparse DCT takes twice as long to do, there is no speedup in this case. It is not the case that the DE version takes twice as long to execute, but rather that since there is relatively little arithmetic to do in parallel, the bene t of doing it in parallel is not signi cant. Instead of introducing new parameters that would unnecessarily complicate the speedup model we derived earlier, we can simply count the number of times each procedure is executed and multiply by the average amount of time required to execute that procedure. Data is presented below comparing the speedup obtained with the DE enhancement compared to the fastest possible version of mpeg play. Data for no dither, ordered dither, and color dither is presented. No Dither movie hula 2.mpg mjackson.mpg sukhoi.mpg bicycle.mpg easter.mpg


Ordered Dither

F

K

D

0.86 1.00 1.00 0.92 0.83 0.75 0.98

1.06 1.13 0.88 0.97 0.61 0.83 0.79

0.31 0.37 0.41 0.31 0.42 0.29 0.33

V

Tbase


0.017 2.24 1.94 0.023 9.35 8.32 0.026 14.92 12.25 0.024 9.89 8.72 0.039 3.27 2.44 0.023 9.65 8.27 0.013 12.07 9.94

Speedup 1.12 1.13 1.22 1.13 1.31 1.14 1.20

Speedup 1.15 1.12 1.22 1.13 1.34 1.17 1.21

5.7. STUDY 2: MPEG F

K

D

V

Tbase


0.86 1.00 1.00 0.92 0.83 0.75 0.98

1.07 1.12 0.89 0.97 0.61 0.83 0.79

0.22 0.29 0.33 0.24 0.36 0.23 0.25

0.012 0.025 0.003 0.018 0.031 0.019 0.020

3.17 11.84 18.44 12.76 3.77 12.52 15.89

2.94 10.82 15.91 11.58 2.95 11.18 13.97

F

K

D

V

Tbase


0.86 1.00 1.00 0.92 0.83 0.75 0.98

1.06 1.12 0.89 0.97 0.61 0.82 0.79

0.20 0.27 0.31 0.22 0.35 0.20 0.23

0.016 0.015 0.018 0.016 0.037 0.015 0.004

3.51 12.99 19.86 14.19 4.00 13.94 17.58

3.26 11.95 17.19 12.98 3.14 12.56 15.66

movie hula 2.mpg mjackson.mpg sukhoi.mpg bicycle.mpg easter.mpg


85

Speedup 1.08 1.09 1.20 1.10 1.26 1.10 1.13

Speedup 1.08 1.09 1.16 1.10 1.28 1.12 1.14

Color Dither movie hula 2.mpg mjackson.mpg sukhoi.mpg bicycle.mpg easter.mpg


Speedup 1.07 1.09 1.16 1.08 1.24 1.09 1.14

Speedup 1.08 1.09 1.16 1.09 1.27 1.11 1.12

5.7.6 Quality Because the DE methodology uses less precision to calculate the IDCT, we can expect some degradation in image quality. However, because MPEG is a lossy compression standard, there is some loss of image quality simply from compressing the image. Observation of the decoded movie shows that the quality loss due to low precision arithmetic is not signi cant. Figure 5.7 illustrates this point in a more quantitative fashion. The PSNR (peak signal to noise ratio) of the base implementation is plotted against the percentage dierence in PSNR between the base and DE implementation. To generate this


1.400

|

1.200

|

1.000

|

Percent Difference in PSNR (Base vs. DE)

86

|

0.600

|

0.800

|

0.400

|

0.000 | 21

| | |

|

|

0.200

| 24

| 27

30

33

36

39

| 42

| 45

| 48 PSNR (dB)

Figure 5.7: Percent dierence in PSNR for Y,U, and V frame components versus base PSNR. Circles are used to mark components compressed using typical quantization values.

5.7. STUDY 2: MPEG

87

data, a ten frame movie sequence is compressed ve dierent times using dierent quantization matrices. For each frame, dierent values for each of the three Y, U, and V components are calculated. A total of 150 points are plotted. The circles represent data that is generated using typical quantization values. The points to the right represent data that is more accurate but less compressed. For each quantization matrix, the movie is decoded using both a base and DE implementation. The percent dierence in PSNR between the base and DE implementation is plotted on the vertical axis. For the data points plotted as circles, which were compressed using typical values, there is almost no dierence between the base and DE decompression. Even for the higher PSNR on the right side of the gure, the percentage dierence is still only 1.4%. We conclude that the image quality is not signi cantly aected by the DE methodology.

5.7.7 Analysis We summarize the observed speedups in gure 5.8 for the ordered dither. The bars show speedup of the DE mpeg code compared to mpeg play with the same DCT, and compared to the fastest mpeg play implementation. In both cases, the DE code is the same. The relative speedup obtained is less for the better public release mpeg since the base mpeg with the same DCT runs slower, as discussed above. The speedup can vary signi cantly depending on the movie. Looking at the speedup compared to the best public domain code, speedups anywhere in the range from 1.08 to 1.34 depending on the dithering algorithm are observed. This range is primarily due to two experimental constants: D and K. In gure 5.9, we assign F and V constant values of 1 and .2 respectively. These are fairly typical of the values of observed. We then graph the fraction of time spent in the DCT, D, along the x axis, and plot a series of lines for K, the ratio of execution time between the integer and oating point DCT. The box indicates the range of observed values for D and K. D varies from about .2 to .4, and K can vary from 1.1 to .6. Both these quantities vary depending on the movie data. This graph shows the importance


88

1.5 1.45 1.4

1.3

sa

1.25

b

1.2 1.15 1.1

Figure 5.8: Speedup for ordered dither.

1.05

coaster.mpeg

flower.mpg

easter.mpg

bicycle.mpg

sukhoi.mpg

mjackson.mpg

1 hula_2.mpg

Speedup

1.35

5.7. STUDY 2: MPEG

89

1.5

1.4

1.2

1.1

Figure 5.9: Speedup for V=0.2 and F=1. 1

D -- Fraction of Time spent in DCT

0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0.9 0

Speedup

1.3


90

of these factors in determining the nal speedup. Easter, for example, has both a low K and a high D, resulting in exceptionally high speedups. This is because Easter is composed primarily of zooming, and so a larger percentage of time must be spent in the DCT since there is little motion compensation in the compression. The K term measures the change in execution time in switching from an integer to oating point implementation. If, for example, the oating point execution took twice as long, then there would be no speedup by doing two DCTs in parallel. For this reason, it is essential to have support for high performance oating point for the DE 1.6 methodology. Because of the PA-RISC architecture does some integer multiplication in software, the factor K is data dependent and varies with the movie. 1.5

5.7.8 Increased Parallelism

1.3

1.2

1.1

D -- Fraction of Time spent in DCT

0.45

0.4

0.3

0.25

0.2

0.15

0.1

Figure 5.10: Speedup for dierent numbers of subwords. 0.05

0.9

0.35

1

0

Speedup

1.4

5.8. SUMMARY

91

In gure 5.10, we show the eect of varying S, the degree of parallelism. Until now, we have only considered the case where S=2, or two data words are operated on simultaneously. To model increased parallelism, we must substitute S for 2 in equation 5.58. Furthermore, with more operands packed into a single oating point word, the cost of packing and unpacking also increases. V must be replaced by (S ? 1) V . This results in the new speedup equation:

Speedup =

1

(5.59) 1+DF ? 1) + (S ? 1) V This graph shows that increases to a parallelism of 4 operands packed into a single data word would approximately double the speedup obtained. For this graph V=.2 and F=1. Interestingly, if we were to try to pack 8 data words into a single

oating point word (S=8) then the increase in overhead would oset the increase in parallelism, and there would be no speedup over S=4 for D < 0:35. For D > 0:35, the S=8 case does do better than the S=4 case, however the improvement is not signi cant in the range plotted. This is interesting since if we assume we need 11 bits per word, then 4*11=44 bits is achievable using today's IEEE oating point. Minor hardware modi cations, however, would be necessary to prevent the words from over owing into each other. For S=8, though, we would need 8*11=88 bits which is more than would be available in a 64 bit word. Therefore, we see that it is not necessary to go through the extra trouble of trying to achieve 8 way parallelism since there is no signi cant bene t in going beyond 4.

( KS

5.8 Summary In this chapter we presented a general technique to achieve xed point subword parallelism using oating point hardware with only software modi cations. We then applied the technique to an mpeg decoder application and achieved an average speedup of 13.8%. To use this technique, the algorithm it is applied to must be well understood so that we can be certain the operands do not over ow into each other, and it must

92


be amenable to SIMD style parallelization. Furthermore, it must use only addition, subtraction, or multiplication of the packed operands by a single constant. Multiplication of each part by a dierent constant is not supported. Finally, it requires a system that supports high speed IEEE oating point. This technique relies only on standard arithmetic primitives so that no machine speci c instructions are required. It is implementable completely in the high level C source code. Furthermore, because the numbers are kept in oating point format, multiplication by arbitrarily large and small constants is supported. There is no need to convert fractional quantities to integer as would be required in integer arithmetic. The IDCT in the MPEG video compression standard ts all the requirements necessary to successfully utilize this speedup technique. After presenting measured execution time improvements on a Hewlett Packard 9000/725/75 workstation, we presented a model that explains the wide range of speedups we observed. Finally, we used this model to illustrate the eects of increasing the parallelism by packing more than two subwords into a single oating point word.

Chapter 6 Hardware Modi cations for DE Arithmetic 6.1 The Reasons for Hardware DE Modi cations In software DE arithmetic, numbers are \relatively xed point." By this it is meant that the numbers are xed precision only with relation to each other. Multiplying both by very large or very small constants is still possible. Decimal point location must be known to do packing and unpacking. No shifting or scaling is needed to adjust calculations, such as when using integer arithmetic to do xed point calculations. The great bene t of the software only DE methodology is that it works on any architecture and requires no vendor speci c instructions. The methodology is implemented completely in a high level language. The problem is that the precision and magnitude of the operands must be carefully understood in order to ensure that one subword does not over ow into the other. In this section we investigate possible hardware modi cations to alleviate these weaknesses. There are three implementations that are investigated. The problem is that all these techniques introduce hardware speci c modi cations to the oating point hardware. DE methodology relying on these modi cations now runs correctly only on machines that have implemented speci c hardware modi cations. 93

94

CHAPTER 6. HARDWARE MODIFICATIONS FOR DE ARITHMETIC

6.1.1 Problems with Software DE Arithmetic In the following sections we discuss the speci c limitations of DE arithmetic and the motivations for adding hardware modi cations. Throughout the discussion, L and R are used to represent the left and right packed subwords, respectively, and l and r are used to represent the individual bits of L and R. Finally, S is used to represent the sign bits in the buer space between L and R.

Addition and Subtraction Growth on the left of L can cause loss of precision in R. The more signi cant problem is growth on the left of R causing the numbers to overlap. This problem occurs in two cases. The rst is when the numbers are of similar magnitude, but the length of L is longer than allowable for non-overlap with R. The second case occurs when numbers of dierent magnitudes are added. When the two numbers are shifted to cause correct alignment, the subwords can end up misaligned. In the software DE we made sure this was not a problem by ensuring that adds and subtracts were done only between numbers with similar exponents. We also ensured that over ow was not a problem. Hardware can be added to guarantee that these conditions do not occur.

Multiplication Because multiplication of an n bit number times an m bit number produces an n+m length result, L tends to grow on the right with every multiply and eventually over ow into R. Therefore an unlimited number of multiplications is not possible. In the software DE we solved this problems by making sure that the buer space between L and R is greater than the number of multiplies times the length of the multiplicand. In the DCT case, for example, we ensured the buer was large enough for 4 multiplies by 8 bit constants. The buer space had to be at least 32 bits. A calculation path that requires a large number of multiplies does not work using software DE.

6.2. RESET INSTRUCTION

95

6.1.2 Three Speci c Implementations In the follow sections we present three dierent schemes for modifying the oating point hardware to correct some of the problems with DE arithmetic. The proposals are presented in order of increasing hardware complexity. The rst possibility discussed is adding an additional hardware reset instruction. This requires the least additional hardware and does not interfere with the base multiplier design. The new instruction resets the bits in the buer region between L and R after every multiply so that L can no longer over ow into R. This xes the problem of L over owing into R so that an unlimited chain of sequential multiplies is now possible. The second scheme presented is 4 way DE arithmetic. This scheme proposes modi cation to the base multiplier, so that some impact on the performance of the base multiplier may result. The problem of L over owing into R is solved by zeroing certain partial product terms to prevent the overlap. Because the large buer region is not needed since the multiplier has been modi ed, four subwords can be packed into a single data word instead of the two in software DE arithmetic. Only multiplies of 4 packed words by a single constant are allowed. The nal scheme adds the most hardware complexity and allows for 2 concurrent single precision calculations by reusing the existing double precision hardware. Fully independent variable times variable multiplies are allowed. This is useful for algorithms such as audio compression that require the full dynamic range of independent

oating point.

6.2 Reset Instruction The software DE methodology relies on a buer zone between the two packed subwords. An instruction that resets this buer zone to the proper value allows an unlimited sequence of multiplies. The previous conditions concerning length and magnitude for software DE arithmetic still apply. The reset instruction removes the restrictions on long chains of multiplies. One reset technique can be done completely in software

96


and does not require any special hardware modi cations. The reset instruction relies on the same rounding procedure as the unpack calculations. Suppose we had numbers packed such that L was 4 bits, R was 4 bits, and there is a 4 bit buer zone. The buer bits must be reset to the sign bit of R. This can be done algebraically with six adds. llllxxxxrrrr B

A

First, the number is rounded to position A with an add and subtract, next R is extracted by subtracting the rounded result from the whole number. Next, the number is rounded to position B with an add and subtract, and nally, this result is added to R. Six adds and subtracts are required. Since the magnitude of the numbers must be known to do the rounding, multiplications by arbitrarily large and small numbers with respect to L and R is not allowed in conjunction with this approach.

+

l

l

i 0 S

i 0 S

l

l

11

11

10

10

l

l

l

l

l

S

r

r

r

r

i i i i i i i i i i 0 0 1 0 0 0 0 0 0 0 S S S S S S S S S S 9

8

7

6

5

4

3

2

1

0

9

8

4

4

4

4

3

2

1

0

l

l

S

S

S

S

r

r

r

r

Figure 6.1: A multiply reset instruction. A second approach is to copy the sign bit of R directly over the buer space, resetting the buer region using a bitwise logic operation. This relies on knowing the precise bit position of the packed number. Copying the bits directly into the buer region, however, causes L to be simply truncated producing a biased result. By adding a constant and then truncating, a type of round is performed. Figure 6.1 illustrates a hardware instruction implementation to reset the packed word to the desired format of 4 bit words with a 4 bit buer space. The top line illustrates the

6.3. 4 WAY DE ARITHMETIC

97

bit pattern of the packed subwords after L has grown to 7 bits. The next multiply will cause L to over ow into R. First the constant is added in preparation for the truncation and then the sign bit of R, S, is copied across the buer zone. The correct format is obtained. The lower line in the gure indicates the reset format. This operation is implemented as a discrete DE reset instruction or is added as an extra stage to the multiplier for a DE multiply instruction. This increases the latency of the multiply instruction, but maintains the same thruput. Adding a reset instruction has the advantage of leaving the multiplier unmodi ed. Regardless of what type of multiplier is implemented the reset instruction works in the same manner. The disadvantage is that relying on a reset capability requires the use of a buer zone between the packed subwords. This makes higher degrees of parallelism more dicult to implement, since a number of bits must be reserved for the buer space. In the next section, we look at modifying the multiplier itself.

6.3 4 way DE Arithmetic This is an example of limited hardware modi cations to eliminate only the worst aspects of DE arithmetic. The multiplier is modi ed so that an unlimited number of sequential multiplies can occur. Furthermore, since there is no need for the buer zone, four subwords can be packed instead of only two. Except for multiplication, the same restrictions on length and magnitude for software DE arithmetic still apply. Since it is assumed that the length and magnitude restrictions guarantee over ow will not occur, over ow is simply ignored. If the subwords do overlap with each other incorrect results are obtained.

6.3.1 The Bit Alignment Problem It is a dicult problem to know where to divide the subwords. When L and R are packed, they are positioned based on assuming a maximum number of integer bits for each. However, the numbers need not be the maximum length. Base oating point

98


hardware normalizes the packed quantity so that it is left aligned to the MSB of L. The size of L determines the position of the breakpoint. When L is equal to zero, the packed number is shifted all the way to the left so that the MSB of R becomes the bit to which the packed quantity is normalized. The breakpoint is de ned as the position where the carry chain is broken so that R does not over ow into L and the two independent subword calculations are performed correctly. In the software only DE methodology there is no need to determine the exact position of the breakpoint since only existing oating point operations are used. If the hardware{either the adder or the multiplier{is to be modi ed, the breakpoint must now be precisely known. At rst glance it seems that inspection of the exponent eld should give some information as to the position of the breakpoint. However, consider that multiplication of the packed quantity by 2n arbitrarily changes the exponent value but leave the breakpoint unchanged. This means the exponent really tells nothing about the breakpoint. Additional information is required to keep track of the breakpoint. We ensure that the packed number has the same alignment regardless of the bits in L. We do this by adding a large constant so that the alignment of the data is determined by normalizing to the constant, not by normalizing to the data. For example, if L and R are at most 4 bits, and we want a 4 bit buer, and we have a 16 bit multiplier, we will use a packing constant of 2 . 8

X=2^8 * L + R X=llll0000rrrr.0000

Now, if L is really only 2 bits, X is normalized as such: ll0000rrrr.000000

The alignment is dierent from what we expected. We x this simply by adding the constant 2 . Now, we always know the MSB, and the alignment of X. The two cases above now are normalized. 12

1llll0000rrrr.0000 100ll0000rrrr.0000


99

The number is now normalized to the leading bit, 2 . However, there are still some problems. 1) The constant must have the same sign as L. If L is greater than or equal to zero, the constant equals +2 . If L is less than zero, the constant equals ?2 . This is because L is encoded as a signed magnitude number. This means an extra conditional branch is needed when doing the packing and unpacking. It is possible to pipeline or combine the test and add into a single new pack or unpack instruction. 2) Packing now requires an additional add, and unpacking requires an additional subtract to add or subtract the new required constant for normalization. 3) There is a disadvantage in that one bit of precision is sacri ced to ensure correct alignment. 4) The adder must also be modi ed to account for the leading one. This can be done simply by forcing the leading sum bit to always equal one and ignore the carry out. It can also be accounted for arithmetically. After an eective addition, the constant must be subtracted, and after an eective subtraction, the constant must be added. The DE software restriction on the maximum length of L would still be in eect since if L over ows into the leading one an incorrect result will occur. In fact, adding the leading one could actually speed the addition algorithm since we are now guaranteed that no post-normalization is required. The multiplier would also have to be modi ed so that the leading one is removed from partial product generation. This technique has the advantage of maintaining constant bit positioning without signi cant hardware modi cations. It relies on utilizing arithmetic properties as much as possible to force the correct positioning since no new instructions are required. However, it requires an additional test and add step in packing, and test and subtract step in unpacking. 12

12

12

Alternative Approaches Another option is to turn o normalization for all DE adds, subtracts, and multiplies. This solution is signi cant in terms of hardware modi cations since the base adder and multiplier is designed to expect normalized operands. In this design approach,

100


we have the ability to align the packed word to a known position and then turn o all normalization hardware so that the bit positioning is maintained. With normalization turned o, multiplications by large or small constants are still possible, but the breakpoint is always known. There are only three cases where normalization can cause a shift. A carry out after a multiply or add can cause a 1 bit right shift, and a cancellation after a subtract can cause a left shift. These 3 cases are turned o so that shifting cannot occur. A carry out in the case of a multiply or add causes an over ow and incorrect results ensue. This forces the same requirement of the software DE methodology as the maximum number of bits to the left of the decimal point must be known. If normalization is turned o, a new instruction must be added to allow the oating point number to be shifted to a known exponent value. This is necessary in the case where L is rst packed with less than the maximum allowable number of bits to the left of the decimal point. The packed number must then be manually shifted to the right to allow for extra growth of L to the left. As with software DE, this requires that the maximum magnitude of L is known in advance. Furthermore, a new instruction to renormalize the packed number before it is unpacked has to be added. This can be implemented simply as an addition of zero followed by a standard normalization. The disadvantage of this technique is that we are introducing a new oating point representation since we have to ignore the implicit one in the IEEE format. Depending on the base hardware implementation, it might be dicult to use a format that does not assume an implicit one. Furthermore, there would still be problems with over ow in that the maximum number of bits to the left of the decimal point must be known in advance. However, turning o normalization allows for a constant breakpoint. With this, both the multiplier and the adder can be modi ed to remedy some of the limits of DE arithmetic. Multiplication by large or small constants can still be calculated correctly. In general the former approach is simpler, but speci c implementations may lend themselves better to the latter. Both bit alignment strategies described so far rely on aligning the position of the data to t the breakpoint in the hardware. It is also possible to allow the data to


101

normalize normally and instead shift the hardware breakpoint to t the data. B3

A3

0

B2

A2

0

B1

A1

0

Cout S3

B0

A0

0 S2

S1

Decoder

S0

Breakpoint Counter

Figure 6.2: Adder modi cations for variable breakpoint. Extra information must be maintained so that the breakpoint location can be determined at every calculation. Figure 6.2 shows how muxes could be inserted between every bit in an adder to allow the carry chain to be broken anywhere. Control logic based on a new counter eld would determine where the carry chain is broken. When the numbers are rst packed, the breakpoint can be determined completely from the exponent. Only when a normalization occurs are the numbers shifted so that the breakpoint can no longer be determined. A single right normalization shift can occur after an add or multiply, or a large left normalization shift can occur after a subtraction. By keeping a counter that decrements on a right shift and increments by the shift amount on a left shift, we obtain the extra information to determine where the breakpoint is. The breakpoint counter bits would have to be added to the oating point number format so that each packed number would have its own breakpoint counter eld. This approach is cumbersome because of the extra hardware needed to allow the breakpoint to shift with the data. As many muxes as there are bits in the adder would have to be added to the carry propagate path, which will slow the adder's critical path. This does not seem to be a promising technique and is included only for completeness.

102


6.3.2 Dividing the Multiplier Multiplication is the main limitation for DE arithmetic. Even if software only DE addition is acceptable, software only DE multiplication allows only a limited number of multiplies in the calculation path. This is because a multiplication of an n bit number times an m bit number produces an n+m bit result. When a packed number is multiplied by a k bit constant, both L and R will grow by k bits. R will be rounded to the correct number of bits by the oating point hardware. But L will grow across the buer zone and eventually corrupt R. In this section the strategies for dividing a direct and Booth multiplier are discussed. We then present two speci c implementations in the next section. The rst multiplier is modi ed to allow 4 way DE arithmetic. Product terms are zeroed to allow for increased parallelism and to remove the limit on sequential number of multiplies. This extends the usefulness of software DE arithmetic. Next, a double precision multiplier is divided to allow two concurrent single precision calculations. This is useful for algorithms such as audio compression that require variable x variable oating point multiplies.

Dividing a Direct Multiplier In this section we look at techniques to divide a direct multiplier to produce a subword multiply. The breakpoint is xed by using the methodology described in section 6.3.1 to add a large alignment constant.

Dividing Vertically Assume that A is composed of two packed subwords and both

are to be multiplied by the constant K. It is possible to divide the multiplier vertically to perform the two multiplies as shown in gure 6.3. The problem occurs for sum terms S thru S where the two products overlap. This can be solved by cancelling out the terms of the left product where they overlap on the right. This reduces the length of the left result, but this is acceptable since we are interested only in the higher order bits. As with typical oating point multipliers, the product has the same number of bits as the multiplicands. Since we assume that 7

3


103

S

11

KA KA KA KA KA KA S S S 5

5

3

5

4

4

5

4

5

4

5

10

9

3

8

KA KA KA KA S 2

5

3

4

4

3

5

2

7

KA KA KA KA KA S 1

5

2

4

3

3

4

2

5

1 6

A K KA KA KA KA KA KA S

5 5

0

5

1

4

2

3

3

2

4

1

5

A K KA KA KA KA KA

4 4

0

4

1

3

2

2

3

1

4

0

A A A A K K K K KA KA KA KA KA KA KA 0 KA KA 0 0 KA 0 0 0 2

1

0

3

2

1

0

0

3

0

2

0

1

1

2

1

1

1

0

2

1

2

0

3

0

0 0

0

0

S

5

3

0 0

S

4

0 0

S

3

0

0 0

S

2

0

S

1

0

Figure 6.3: A vertically divided multiplier.

S

11


5

10

3

5

4

4

5

4

5

4

5

9

3

8

KA KA KA KA KA KA S 0

2

1

2

2

2

3

2

4

2

5

2

7

KA KA KA KA KA KA S 0

2

1

2

2

2

3

2

4

2

5

1

6


5 5

0

2

1

2

2

2

3

2

4

1

5

0

5

A K KA KA KA KA KA

4 4

0

2

1

2

2

2

3

1

4

0

0

S

4


2

3

1

2

1

0

2

0

2

0

1

1

2

1

1

1

0

2

1

2

0

3

0

0 0

S

Figure 6.4: A modi ed multiplier.

3

0 0

S

2

0

0 0

S

1

0

0

0

0 0

S

0

104


the right word is complement encoded, the partial products for the right subword must be sign extended so that the sign calculation is correct. Finally, the carry chain must be broken between sum terms S and S , so that the carry out from the right calculation does not over ow into the left. The desired partial product layout is shown in gure 6.4. This is implemented simply by allowing the partial product inputs to the summation tree to be selected between the base partial product term or the new partial product term speci ed for a subword multiply. Additional delay is caused by the additional multiplexor in the partial product generation logic. Furthermore, there is a penalty in both time and area to perform the necessary sign extension since the sign must be propagated from the right. If the lower order bits of the left calculation are simply truncated, a biased product will result. To correct for this, a constant bias is added to the calculation. This bias is equal to one-half of the maximum value of the truncated bits. It is added by forcing carry in terms to the left calculation where the carry chain has been broken. This is described in detail for a speci c implementation in section 6.3.3. This technique allows parallel subword multiplies to be carried out with no need for a buer zone between the subwords. With this constraint gone, more dense packing of subwords is possible and greater than two way parallelism is possible. A disadvantage is that the order of the multiplicands must be carefully controlled. The calculation would not be the same if A and K were switched. This was not required for the software only DE methodology. It is also possible to maintain full precision for each calculation by duplicating the nal carry propagate adder (CPA). In this case, the partial product tree would be divided along the line indicated in gure 6.3 so that the two separate sums would be routed to two separate CPAs. By duplicating the rounding logic as well, two symmetrically rounded results would be obtained. Furthermore area could be saved by using the same CPA in two pipelined steps to do the two separate calculations. This produces the same results but only one CPA is required. This would, however, add an extra pipeline stage to the multiply and slow down the calculation. 8

7


105

S

11


5

3

5

4

5

4

4

5

4

5

3

10

9

8

KA KA KA KA S 2

5

3

4

4

3

5

2 7

KA KA KA KA KA S 1

5

2

4

3

3

4

2

5

1 6


5 5

0

5

1

4

2

3

3

2

4

1

5

0

A K KA KA KA KA KA

4 4

0

4

1

3

2

2

3

1

4


1

0

3

2

1

0

0

3

0

2

0

1

1

2

1

1

1

0

2

1

2

0

3

0

0

0

S

5

3

0 0

0 0

S

4

0 0

S

3

0

0 0

S

2

0

S

1

0

Figure 6.5: A horizontally divided multiplier.

A A A A A A K K K K K K KA KA KA KA KA KA KA KA KA KA KA KA 0 KA KA KA KA KA KA 0 0 5

1


S

11

5

5

10

5

4

5

4

4

5

4

5

3

9

8

2

5

0 0 0

S

7

2

4

5

3

4

2

3

1

2

1

0

5

0

4

0

3

0

2

0

1

5

1

4

1

3

1

2

1

1

1

0

4

2

3

2

2

2

1

2

0 0 0

S

6

0 0 0

S

5

0 0 0

S

4

0 0 0

S

3

0

0 0 0

S

2

0

0 0 0

S

1

0

0

0

0 0 0

S

0

Figure 6.6: A horizontally divided multiplier with zeroed partial product terms.


106

A K

A A A K K K 0 0 0 KA KA KA 0 0 KA KA KA 0 0 KA KA KA 0 0 5

0 0

0

KA KA KA 0 KA KA KA 0 0 KA KA KA 0 0 0 S S S S S S 3

S

5

11

5

5

10

5

3

4

4

3

4

5

4

4

5

4

5

3

9

8

7

3

3

6

5

A K

4

4

2

2

0 0 0

S

4

A K

3

2

1

3

2

1

0

2

0

1

1

0

1

2

1

1

2

1

2

0

0 0 0

S

3

0 0 0

S

2

0 0 0

S

1

0

0

0

0

0 0 0

S

0

Figure 6.7: A multiplier modi ed to perform two independent multiplies.

Dividing Horizontally The complement to dividing the multiplier vertically is to divide the calculation horizontally. In this case, we assume that K is composed of two subwords that are each to be multiplied by the constant A. This time we assume each subword is sign and magnitude encoded so that no sign extension is required. The calculation is divided as shown in gure 6.5. The top and bottom calculations can be routed to separate CPAs as described above to maintain full precision. Rounding and pipelining, as described above, apply equally to the horizontally divided multiplier. Figure 6.6 shows how zero terms are forced into the partial product array to allow for the calculation to be performed using a single CPA. A constant is added to the left product by forcing carry in terms to correct for the biased product that would result from simple truncation. This is described in detail for a speci c implementation in section 6.3.3.

Dividing in Both Directions Finally, we consider dividing the multiplier in both

directions. This produces two fully independent 3 bit by 3 bit multiplies. Figure 6.7 shows where zero terms must be added to prevent the calculations from interfering. The rounding methodology must be modi ed from the base technique or else the right calculation is almost completely rounded o while the left calculation would not be rounded at all. This is shown in gure 6.8. The top line of the gure represents the product term generated at the bottom of gure 6.7. In a base multiplier, the product is rounded at S as indicated so that the product has the same number of 6


107

S

S

S S S S S S S S S S

S S

S S

S S S S0 S0 S S S0

11

10

11

10

11

10

9

8

7

6

9

8

7

6

9

5

4

3

5

4

3

2

1

0

base rounding DE rounding

Figure 6.8: Modi cation of rounding to divide a multiplier in both directions. Prime notation is used to indicate where rounding is performed. bits as the multiplicands. If this rounding were left unchanged, bits S thru S , the signi cant bits of the right subword, would be rounded o. To correct this, rounding is performed in DE mode as shown at the bottom of the gure. The rounding for the left subword is done at S and the rounding for the right subword is done at S . Now, however, bits S thru S are shifted to the left as shown in the gure so that the subwords are correctly positioned in the nal product. The advantage of dividing the multiplier in both directions is the capability of doing variable times variable multiplies. The software DE methodology allowed only two packed variables to be multiplied by the same constant. 5

9

5

3

3

3

Dividing a Booth Multiplier Booth multiplication [Boo51] is a technique to generate partial products in a more ecient format. It is beyond the scope of this thesis to describe Booth encoding in great detail, but the reader is directed to [WF82] for a more complete treatment of this subject. Essentially, Booth divides the multiplier into 3 bit groups and encodes two of these bits into a row of partial product terms. There is one bit of overlap between two groups of three bits. This results in a partial product tree approximately half the height of a direct multiplier. Based on the three bit group 0, 1, or 2 times the multiplicand is added or subtracted in the partial product array. Techniques to divide the Booth multiplier are presented similar to dividing the direct multiplier as discussed above.

Dividing Horizontally Figure 6.9 shows how a Booth multiplication is divided horizontally for an 8 bit multiply divided into a multiplication of a constant by two


108

X X X (4) 7

6

5

(3) X X 4

3

X X (2) 2

1

(1) X 0 0

A A A A A A A A A A A A A A B B B B B B B B B B B B C C C C C C C C C C D D D D D D D D 7

7

7

7

7

7

7

6

5

4

3

2

7

7

7

7

7

6

5

4

3

2

1

0

7

7

7

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

1

(1) (2) (3) (4)

0

Figure 6.9: Booth encoding example (with complement formation not shown). subwords. The X's represent the multiplier that is Booth encoded. The three bit groups are indicated in the gure. The X term consists of two independent 4 bit subwords divided between bits X and X . The partial product array is shown in the same gure where each row corresponds to a numbered group of three bits. The calculation is split between partial product words 2 and 3 as indicated by the horizontal line in gure 6.9. This is similar to dividing a direct multiplier as described above. The CPA would have to be duplicated as described in section 6.3.2. 4

Yi 0 0 0 0 1 1 1 1

+1

3

Yi Yi? Action 0 0 +0 0 1 +X +X 1 0 1 1 +2X 0 0 -2X 0 1 -X 1 0 -X 1 1 -0 1

Table 6.1: Truth table for typical Booth decoding where X represents the multiplicand. The problem is in decoding group of bits number 3. Part of this group lies in one subword and part lies in the other. The solution is to modify the decode logic for this one group of bits so that the left subword always sees a zero in the X position. 3


109

Yi 0 0 0 0 1 1 1 1

+1

Yi Yi? Action 0 0 +0 0 1 +0 1 0 +X 1 1 +X 0 0 -2X 0 1 -2X 1 0 -X 1 1 -X 1

Table 6.2: Truth table for modi ed Booth decoding where X represents the multiplicand. This logic is for group 3 of gure 6.9 when the the multiplication is divided into two subwords between bits X and X . Yi? is treated as if it is always zero. 4

Yi 0 0 0 0 1 1 1 1

+1

3

1

Yi Yi? Action 0 0 +0 0 1 +X 1 0 -X 1 1 -0 0 0 +0 0 1 +X -X 1 0 1 1 -0 1

Table 6.3: Truth table for modi ed Booth decoding where X represents the multiplicand. This logic is for group 3 of gure 6.9 when the the multiplication is divided into two subwords between bits X and X . Yi is treated as if it is identical to Yi. 3

2

+1


110

0 0 0 0 0 0 0 0 0 0 0 0 C C C C C C D D D D 0 0 7

7

7

6

7

6

5

4

5

4

0 0 0 0

0 A A A 0 B B B 0 0 0 0 3

3

3

3

2

1

A A A B 2

0

1

0

(1) (2) (3) (4)

Figure 6.10: A Booth multiplier modi ed to perform two independent multiplies. Table 6.1 shows what action is taken to Booth encode in the base case, and table 6.2 shows how the table is modi ed in the case of group 3. If the breakpoint was instead between bits X and X , then group 2 would have to be modi ed so that the right subword always sees bit X in the X position so that the sign is correctly extended. For this case, the truth table for group 2 is given in table 6.3. This assumes that the right subword is complement encoded. 3

2

2

3

Dividing in Both Directions Similar to the direct multiplier, it is possible to

divide a Booth multiplier to perform two independent variable x variable multiplies. A divided Booth multiplier is shown in gure 6.10. As before, the X term is divided between bits X and X . But now, the multiplicand is also composed of two independent subwords divided between bits 4 and 3. The zero terms are substituted in the partial product array to account for this. The Booth algorithm requires that the complement of the multiplicand be formed in cases where an action of -X or -2X is taken. For this reason, care is taken to ensure that complement formation is performed correctly for each subword independently. The duplication of A in gure 6.10 ensures that the complement is formed correctly for the right subword. The construction of the base Booth multiplier ensures that this sign extension is already handled correctly for the left subword. Furthermore, to form the correct complement in the left calculation, it must be possible to add a one (as a carry in) at each of positions C and D . This is because a complement formation requires complementing the bits individually and incrementing the result. A full description of the need for complement formation in Booth multiplication is given in [WF82]. 4

3

3

4

4


111

6.3.3 Implementation We now present a speci c implementation of 4 way DE arithmetic. Only the multiplier is modi ed to allow for increased parallelism and to prevent L from over owing into R. As in software DE arithmetic, only multiplication of all 4 subwords by a single constant is allowed. Breakpoint alignment is guaranteed by adding a constant when packing as described in section 6.3.1. 53 bits of mantissa are available in a double precision word, so that leaves 13 bits per subword plus 1 bit extra. This is enough for 1 sign bit, 11 integer bits, and 1 guard bit per subword. The leftmost number, however, does not need a sign bit since its sign is encoded with the sign bit of the entire oating point word. This means there are 2 extra bits which we arbitrarily assigned to the right most word. The

oating point word, then, is broken into 4 subwords of 12, 13, 13, and 15 bits each. The multiplier is modi ed as shown in gure 6.11. Three triangular regions of modi ed partial product terms are now required instead of just the one shown in the gure 6.4. The dots represent the unmodi ed partial product terms, and the zeros represent partial product terms that can be eliminated. The stars indicate bits that are sign extended from the subword partial product to the right. For clarity, the upper left corner is shown with a full array of sign bits. In an actual implementation, the value of this region is calculated in advance and only the lower order bits are added when sign extension is required. As in section 6.3.2, a constant is added to oset the biased product that would result from truncating the lower order bits. This bias is equal to one-half of the value of the truncated bits and is added by forcing bits into the carry in terms.

Packing and Unpacking Packing is accomplished using the same arithmetic instructions used in DE software arithmetic. Now instead of one multiply and one add, three multiplies and three adds must be used to pack four subwords. Three dierent packing constants are now required to pack the four subwords. Assume the subwords are named from the left: L , L , L , and R, and the packed number is X, then 1

2

3

112

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

.. .

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ?

? ? ? ? ? ? ?

? ? ? ? ? ? ?

? ? ? ? ? ? ?

? ? ? ? ? ?

? ? ? ? ? ?

? ? ? ? ? ?

? ? ? ? ?

? ? ? ? ?

? ? ? ? ?

? ? ? ?

? ? ? ?

? ? ? ?

? ? ?

? ? ?

? ? ?

? ?

? ?

? ?

?

?

?

Figure 6.11: Modi cation for 4 way DE arithmetic. Dots represent unmodi ed partial product terms, zeros are zeroed terms, and stars are sign extended terms from the partial product term on the right.


? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?


113

X = L 2C + L 2C + L 2C + R 1

1

2

2

3

3

where C1=41, C2=28, and C3=15. Unpacking for two subwords was done with one round, one multiply, and one subtract. Using the same methodology serially, unpacking four subwords requires three rounds, three multiplies, and three subtracts. Assume points A, B, and C are the bit positions dividing words L and L , L and L , and L and R, respectively, then we can recover the packed subwords as follows. 1

2

2

3

L = round(A) 2?C L = (round(B ) ? round(A)) 2?C L = (round(C ) ? round(B )) 2?C R = X ? round(C ) 1

2

3

3

1

2 3

Over ow Detection The same conditions of length and magnitude for software DE are required to ensure that there is no over ow. Over ow could be detected for the complement encoding subwords (the three right ones) when the carry out bit is dierent from the sign bit. Over ow could also be detected for the leftmost word whenever there is a carry out bit. The leftmost word is sign and magnitude encoded. We assume that the length and magnitude constraints guarantee over ow does not occur, so over ow is ignored.

Accuracy The worst case error occurs when all the truncated bits would have been equal to one. The bits that are truncated, or arbitrarily set to zero, are indicated for the rst

114


subword in the trapezoidal region of zeros in gure 6.12. We sum the worst case error by column. Let S be the number of bits in the subword, in this case S=12, and k the number of bits in the constant term. In this case, k=53. In the case of the DCT, the least signi cant calculation column has a weight of 2? . The weight of the rst truncated column, then, is 2? . We sum the maximum error by column. Each column term is calculated by multiplying the column weight by the number of partial product terms in the column. The columns in the trapezoidal region are summed from left to right. 1

2

Max Error = 2? ((S )(2? ) + (S )(2? ) + (S )(2? ) + + (S )(2? k?S ) +(S )(2? k?S ) + (S ? 1)(2? k?S ) + + (1)(2?k )) 1

1

(

Max

2

+1)

3

(

kX ?S ? 1 Error = 2 (S 2?i + i=1

k X

(

)

+2)

(k + 1 ? i)(2?i))

i=k?S +1

Substituting for k and S, we nd that the maximum error equals 6.00 for the DCT calculation. Adding one-half of the maximum error, or 3.00, as a bias causes the error to range between -3.00 and +3.00. This is shown symbolically below. Let Pfull equal the entire product value. Ptri is the triangular region on the left indicated by the solid dots in gure 6.12, and Ptrap is the trapezoidal region indicated by the zeroes in the gure. Then,

Pfull = P8tri + Ptrap < P + 0 min error = : tri Ptri + 6 max error

6.4. DOUBLE FLOATING POINT

Ptri

115

8 < P ? 0 min error = : full Pfull ? 6 max error

Adding the bias constant to both sides:

8 < P + 3 min error Ptri + 3 = : full Pfull ? 3 max error Thus, we see that when the bias value is added to Ptri, the resulting error is centered about zero. The bias is added by forcing carry in terms into the left triangular region of gure 6.12. The rightmost column in this region has the weight of 2? . Thus, 6 carry in terms are forced always to one in this column. The bias value equals 6 2? = 3. Notice that the column has a height of 12 terms, so exactly half the terms in this column have forced carries. 1

1

Eect on Area and Latency Modi cations to the multiplier add one multiplexor delay to the partial product generation tree. Additionally, there is a penalty for the sign extension logic, since the sign needs to be propagated from the right. If the signal is forwarded directly, then an area penalty is paid since additional wires are added for the forwarding. If the sign is propagated, then a penalty is paid in time. Furthermore, there is an additional penalty for the extra partial product terms that are added to the upper left part of the partial product tree.

6.4 Double Floating Point In this section we look at the option of adding maximum hardware enhancements to provide full exibility for DE arithmetic. That is, we allow two oating point

Figure 6.12: Error due to truncation for rst subword.


.. .

116


117

calculations to occur on packed words in parallel. This is useful, for example, in audio compression algorithms where independent variable x variable multiplies are required. In DE arithmetic only multiplication of two packed subwords by a single constant is allowed. In this case, two similarly packed words are multiplied together. The two left subwords and the two right subwords are independently multiplied. If multiplication of both subwords by the same constant is desired, then the constant must be formed in advance by packing the same value into both the left and right subwords. In order to conform to existing standards, we choose to implement the packing of two single precision oating point words in the existing oating point double precision hardware. Each subword calculation will have full rounding and over ow detection as prescribed by the IEEE standard, and each will have an independent exponent eld.

Addition We begin by describing the steps required in a base addition, and then discuss what changes must be made to accommodate DE arithmetic. A block diagram is shown in gure 6.13. These are the steps required for an IEEE compliant addition or subtraction [Fly93]. 1) Subtract the exponents to compute the absolute dierence. 2) Swap the mantissas so that the smaller operand is on the left. 3) Shift the smaller operand to the right by the amount calculated in step 1 to align the mantissas. 4) Add the aligned mantissas. 5) Normalize the result. This could require a right shift by one place, or a left shift by up to 53 places. 6) Round the result using the guard, round, and sticky bits. 7) Renormalize the result if necessary. This step requires only a one bit right shift. 8) Compute the exponent adjustment and adjust with an addition or subtraction to the original exponent. 9) Detect Exceptions.

118


Exp B

Exp A

Sgn A Sgn B

Subtract

Mant B

Mant A

Mux

Mux

R Shift Mux

Add/Subtract

R/L Shift Sign Logic

Control Logic

Round (Increment)

R Shift Add/Subtract

Sgn Out

Exp Out

Sign and Exponent

Mant Out

Mantissa

Figure 6.13: Base adder (after Waser and Flynn).


119

We now discuss what changes must be made in each step to accommodate two

oating point numbers in parallel. This split mantissa data path is shown in gure 6.14. When DE select is low, both paths have the same control so that the adder behaves just as in the base case. Setting DE select high allows independent control of the left and right subwords. 1) A second 8 bit exponent subtracter unit must be added so that the absolute dierence can be calculated for both subwords simultaneously. Alternatively, the 11 bit double precision exponent subtracter can be extended by 5 bits and then divided in half by breaking the carry chain to do the required calculations. 2) The multiplexors must be split in half to allow the swapping of each subword independently. This is accomplished using relatively minor hardware modi cations. 3) The shifter must be broken in half to allow each subword to be calculated independently. Sticky bit logic must be duplicated for the left subword, so that bits shifted out of the add calculation are not shifted into the right subword, but accumulated in the new sticky bit logic. Splitting the shifter is simple since it requires only adding a second control signal, but duplicating the sticky bit logic will require additional area since a new series of or gates is required in the middle of the double precision word where it previously did not exist. The truth table for the DE right shifter is given in table C.1 in appendix C. The guard and round bits are included in the addition for a total of 55 bits. The shifter is similar to a standard shifter, but now a DE mode input is added. When DE mode is low, the shifter behaves as before. When DE mode is high, the left and right words are shifted independently, and each have their own guard and round bits. The sticky bit is needed for certain rounding modes. It detects if there are any ones that have been shifted to the right of the guard bit. The truth table for the new sticky bit logic is shown in tables C.2 and C.3 in appendix C. Again, a DE mode signal is added to the standard sticky bit logic. When DE mode is high, an additional sticky bit calculation is made for the left subword. 4) The original double precision adder must accommodate 1 implicit one bit, 52 mantissa bits, a guard, and round bit for a total of 55 bits. The single precision mantissa is 1 implicit one bit, 23 mantissa bits, guard, and round bit for 26 total bits.

120


This means two single precision words t into the double precision mantissa adder with no extension required. The carry path must be broken as before. 5) The normalization step requires splitting the shifter so that the two subwords can be shifted independently. Separate control signals are required for each subword. 6) The round logic must be duplicated for each subword. 7) Again, the shifter must be split in half to allow each subword to be independently shifted. 8) The exponent arithmetic logic must be duplicated so that each exponent can be calculated and adjusted independently. 9) The exception detection logic must be duplicated to handle each subword independently. In summary, shifters, multiplexors, and adders, which use bitwise logic and are easily broken into smaller subunits, are divided in two to allow two smaller mantissas to t in parallel. The exponent arithmetic, which doesn't have enough bits to t both exponents, and control logic, such as rounding and exception handling must be duplicated. Speci cally, two new exponent arithmetic units are added in steps 1 and 8, and sticky bit logic, rounding logic, and exception handling are duplicated for steps 3, 6, and 9. Three shifters must be split in steps 3, 5, and 7. A multiplexor is split in step 2, and the adder is split in step 4. An additional consideration concerns possible adder optimizations. Certain optimization techniques rely on the mutual exclusivity of combinations of shifts to speed the worst case path [QF90]. Since two calculations are now done in parallel, this mutual exclusivity can no long be guaranteed and these optimizations are no longer applicable.

Multiplication The steps for oating point multiplication are: 1) Add the two exponents. 2) Multiply the mantissas. 3) Normalize the result. 4) Round the normalized result.


121

Mant A2

Mant A1

Mant B1

Mant B2

Control Logic

Mux

Mux

Mux

R Shift

R Shift

Add/Subtract

R/L Shift

R/L Shift

Round (Increment)

Round (Increment)

R Shift

R Shift

DE Select

Mant1 Out

Mant2 Out

Figure 6.14: Modi ed mantissa portion of adder.

Mux

122


5) Adjust exponent as necessary. 6) Handle any exceptions. Modi cations in this case are very similar to those for addition. For steps 1 and 5 the exponent arithmetic hardware must be duplicated. The rounding and exception logic must be duplicated for steps 4 and 6. The shifter is split for step 3, and the multiplier is divided as described in section 6.3.2 to produce two independent multiplies. A nal shift is required to realign the mantissas, since the right subword ends up pushed to the right as shown in gures 6.7 and 6.8.

Packing and Unpacking The packed subword format is de ned simply as two 32 bit single precision oating point words concatenated together. This choice means that no new instructions are required to do packing and unpacking. However, hardware must be added to the input and output of the arithmetic unit to handle this packed format. This additional hardware would consist only of wires used to route the correct exponent, sign, and mantissa elds to the correct functional units.

Accuracy and Over ow Since IEEE oating point rounding and exception handling is duplicated for each subword, accuracy and over ow is handled as expected for single precision arithmetic.

Eect on Area and Latency Hardware modi cations to the adder consisted of splitting bitwise logic such as shifters and adders and duplicating exponent arithmetic and rounding, sticky bit, and exception logic. This has only a small impact on cycle time. Modi cations to the multiplier include duplicating the exponent arithmetic and muxing zero terms to the partial product array. The only impact on latency here is an extra multiplexor in the partial product generation tree. There is no need to perform sign extension as when we were multiplying by a single large constant (as in sections 6.3.2 and 6.3.3), so there is less

6.5. SUMMARY

123

area penalty than dividing the multipliers as described earlier. However, only two way parallelism is achieved instead of the four way parallelism described earlier.

6.5 Summary In this chapter we have presented hardware modi cations that allow for some of the restrictions for software DE arithmetic to be removed. Three speci c implementations are described. A reset instruction is described that removes the constraints on chains of multiplies without modifying any of the arithmetic hardware. Other restrictions on length and magnitude, however, still apply. We then present a modi ed multiplier that in addition to removing the restrictions on chains of multiplies, allows for higher parallelism than software DE. This modi ed multiplier allows 4 way parallelism rather than just the 2 way parallelism possible with software only DE. Again, the software restrictions on length and magnitude still apply. There is some time and area penalty incurred in the multiplier modi cations. Finally, a scheme to reuse double precision oating point hardware to perform two concurrent single precision

oating point calculations is described. In addition to removing the constraints on chains of multiplies, the restrictions on length and magnitude are also removed. The area penalty in this case is less than before since there is no need to add extra sign extension hardware, however the bene t is not as great since only two way parallelism is achieved.

124


Chapter 7 Cache Prefetching 7.1 Introduction This chapter is targeted at the memory hierarchy, which is of fundamental importance to multimedia system performance. While there has been much work studying memory performance for scienti c and general purpose applications, there has been little work on the needs of multimedia applications. Relatively simple prefetching techniques can signi cantly improve the memory hit rates for multimedia applications. As processors become faster and utilize increasing instruction level parallelism, memory performance has a dominating eect on overall processor performance.

7.2 Related Work A number of techniques exist for cache prefetching. The idea of prefetching is to predict data access needs in advance so that a speci c piece of data is loaded from the main memory before it is actually needed by the application. While a number of papers have been written studying both hardware and software prefetching techniques, no work has looked speci cally at the memory behavior of multimedia applications. The earliest hardware prefetching work was reported by Smith [Smi82] who proposed a one-block-lookahead (OBL) scheme for prefetching cache lines. That is, when a demand miss brings block i into the cache, block i+1 is also prefetched. Jouppi 125

126

CHAPTER 7. CACHE PREFETCHING

[Jou90] expanded this idea with his proposal for stream buers. In this scheme, a miss that causes block i to be brought into the cache, also causes prefetching of blocks i+1, i+2, ..., i+n into a separate stream buer. Jouppi also recognized the need for multi-way stream buers so that multiple active streams can be maintained for a given cache. He reported signi cant miss rate reduction. Palacharla and Kessler [PK94] proposed several enhancements to the stream buer. They have developed both a ltering scheme to limit the number of unnecessary prefetches, and a method for allowing variable length strides in prefetching stream data. Another hardware approach to prefetching diers from the stream buer in that data is prefetched directly to the main cache. In addition, some form of external table is used to keep track of past memory operations and predict future requirements for prefetching. This has the advantage of eciently handling variable length striding, that is data accesses that linearly traverse a data set by striding through in non-unit steps. Fu and Patel [FP91] proposed utilizing stride information available in vector processor instructions to prefetch relevant data. They later [FPJ92] expanded the application to scalar processors by use of a cache-like look-up table called the stride prediction table. Chen and Baer [BC91] have proposed a similar structure called the reference prediction table. Their scheme additionally includes state bits, so that state information can be maintained concerning the character of each memory operation. This is then used to limit unnecessary prefetching. Further analysis of this scheme [CB95] investigate the timing issues of prefetching by use of a cycle-by-cycle processor simulation. Sklenar [Skl92] presents a third variation on the same theme of the use of an external table to predict future memory references. A number of techniques also exist to do software prefetching. Porter eld [Por89] proposed a technique for prefetching certain types of array data. Mowry, et al [MLG92] is generally recognized as having the most practical software prefetch scheme. While software prefetching clearly has a cost advantage, it does introduce additional overhead to the application. Extra cycles must be spent to execute the prefetch instruction. Loop unrolling required in Mowry's scheme causes the code expansion and may result in negative side eects such as increased register usage. Furthermore,

7.3. METHODOLOGY

127

software prefetching must be optimized for a given memory architecture, implementation, and algorithm. Despite these disadvantages, though, software prefetching is still a promising technology.

7.3 Methodology 7.3.1 Simulation Methods A cache simulator was linked into memory address traces generated by RYO. Only data references are modeled so instruction accesses are ignored. The simulator provides data over a wide range of data cache sizes and associativities. A line size of 16 bytes was chosen for all simulations. This line size was chosen so as to better expose the potential bene ts of prefetching. Because only a single process was simulated for each cache con guration, it is expected that the performance for the cache sizes reported corresponds to a larger cache size in a real system. Instruction memory accesses are not modeled. Miss rates for these applications run in a baseline cache with no enhancements are shown in gure 2.2. Characteristics of the movies used in the benchmark executions are summarized in table 2.1.

7.3.2 Performance Metrics Fraction of misses eliminated is the primary performance metric reported. This metric judges the performance of a given prefetch scheme independent of the particular cache implementation. A perfect prefetching scheme would eliminate all memory misses. This would have a fraction of misses eliminated value of 1.0 since all misses have been eliminated. Similarly, an architecture that eliminates half of all the misses of a cache with similar size and associativity would have a fraction of misses eliminated value of 0.5. In the case of a second level cache, the fraction of misses eliminated metric is identical to the hit rate for the second level cache. Of all the misses that occur in the rst level cache, the fraction of those that hit in the second level cache is, by

128


de nition, equal to the fraction of misses eliminated. The reason for using fraction of misses eliminated instead of second level hit rate is for those con gurations such as the stride prediction table and parallel stream cache where no discreet second level cache exists. In this way, comparisons with a common metric can be used across all cache con gurations in the study. This metric is desirable for a number of other reasons as well. In this way, performance improvement can be judged independently of other cache design considerations such as main cache size and associativity. The size of the main cache will have a dominating eect on miss rate, so that if results were simply compared in terms of absolute miss rates, the variation due to cache size would tend to mask out the variation due to prefetching scheme. Furthermore, performance can also be judged independently of memory implementation parameters such as time to access main memory. If this were not the case, varying memory parameters such as cycles to ll a main cache line could have a signi cant impact on results. Results are also reported for execution time in numbers of cycles. For these results, an aggressive memory-limited processor model is assumed. An n-way superscalar processor is assumed such that there are sucient resources to perform any nonmemory operation in a single cycle. In this model, computation time is limited only by memory operations. This model shows the maximum impact that the memory system architecture has on performance.

7.3.3 Memory Bandwidth For the purposes of this study, memory bandwidth is assumed to be large enough such that this is not a limiting factor on performance. This assumption is made to study the eects of diering prefetch strategies independent of memory bus architectures. It is recognized that this assumption may not be valid in terms of today's architectures. However, the trend for wider bandwidth to memory indicates that this may not be a problem in the future. All architectures studied in this paper rely on signi cantly increasing accesses to main memory in the form of increased prefetching. Techniques for ltering, [PK94]

7.4. STREAM BUFFERS

129

and [CB95], exist to address this issue. However, they may have a negative impact on total number of misses captured. Since the goal of this paper is to compare the upper limit of cache performance between the diering cache con gurations, these ltering techniques have not been implemented in the cache models used.

7.4 Stream Buers The stream buer architecture simulated is shown in gure 7.3. As proposed by Jouppi, the stream buer is a FIFO type queue that sits on the re ll path to the main cache. A new stream is allocated at each successive data cache miss. When all stream buers are allocated, the next data cache miss replaces the stream least recently accessed (LRU replacement). A memory access that misses in the main cache, but hits in the stride buer is counted as a hit. Since the re ll time from the stream buer can be an order of magnitude faster than a re ll from main memory, this assumption should not signi cantly aect the reported results. Our simulations assume up to 16 parallel stream buers. This number is selected to be large enough so that the the number of stream buers is not a limiting factor in performance. We also simulate a stream buer depth of 5 entries. Palarcharla [PK94] proposed an enhancement to the stream buer to lter unnecessary excess prefetches. However, because memory bandwidth is not a limiting factor in our model, this could only potentially hurt performance and is not included. He also proposed a mechanism to allow non-unit striding through the data. This is not included in our simulation model. Performance data across a range of cache sizes and with direct mapped and 4way associativities are shown in gure 7.1 and gure 7.2. For most applications, the stream buer tends to peak out at eliminating about 50% of the misses. Mpeg play playing easter.mpg is the single exception and can eliminate 80% of the misses for very large caches. This seems to be an outlying data point, however. Thus, for most cases, only 50% of misses are eliminated with the stream buer. This is because the relatively complicated algorithms involved tend to access the data in a non-unit strides, and the stream buer is designed to aid only in cases of unit

| 2

| 4

| 8

| 16

| 32

| 64

0.40

0.20


0.00

| 2

| 4

| 8

| 16

| 2

| 4

| 8

0.40

| 16

0.20

| 32

| 64 (c)


0.00

|

| 32

| 64


| 2

| 4

|

0.60

|

|

0.00

|

0.20

|

0.40

Fraction of Misses Eliminated

|

|


0.80

|

|

1.00

|

0.60

(b)

|

0.80

(a) 1.00

|

|

|

0.60

|

|

0.00

0.20

|

0.40

0.80

|

0.60

|


|

|

|

0.80

1.00

|

1.00



130

| 8

| 16

| 32

mpeg_encode mpeg2play mpeg_play-hula mpeg_play-easter

| 64


(d)

Figure 7.1: Fraction of misses eliminated for 4 and 8 way stream buers: (a) is a 4 way stream buer with a direct mapped main cache, (b) is a 4 way stream buer with a 4 way associative main cache, (c) is a 8 way stream buer with a direct mapped main cache, and (d) is a 8 way stream buer with a 4 way associative main cache.

| 2

| 4

| 8

| 16

0.40

0.20

| 32

| 64


0.00

| 2

| 4

| 8

(a)

| 16

| 32


| 64


Processor |

|

0.00

0.60

|

|

|

0.20

|

0.40

0.80

|

|

0.60


|

|

0.80

1.00

|

1.00

131

|


7.4. STREAM BUFFERS

(b)

Figure 7.2: Fraction of misses eliminated for 16 way stream buers: (a) is a 16 way stream buer with a direct mapped main cache and (b) is a 16 way stream buer with a 4 way associative main cache. Cache

Figure 7.3: 5 way stream buer architecture.

Memory

132


strides. Even with a 16-way stream buer approximately 50% of misses are eliminated, and therefore the stream buer is not a totally eective prefetching technique.

7.5 Stride Prediction Table

Instr Address

Memory Address

Figure 7.4: Stride prediction table architecture. The structure of the stride prediction table (SPT) simulated is shown in gure 7.4. Instr Address Last Mem Address Valid A table, indexed by instruction address, is maintained for all memory operations executed and holds the address of the last memory address accessed. When a memory instruction is executed, its address is compared to the instruction addresses stored in the SPT. When the instruction does not match an instruction stored in the SPT, an SPT miss occurs. On an SPT miss the new entry, composed of the instruction address Compare and data memory address, is added to the SPT replacing the least recently used entry Subtract (LRU). When a memory access is made by an instruction already contained in the stride prediction table, an SPT hit occurs. The current memory access address is subtracted from the previously stored last memory address to calculate a data stride value. If this value is non-zero, a prefetch is issued. The prefetch address is calculated Add by adding the stride SPTvalue hitand the current memory address. The data is prefetched into the main cache.

Prefetch Address

| 16

| 32

| 64

0.60

0.20

0.00

| 2

| 4

| 8

| 16

| 32

| 64

0.60

| 64 (c)



| 64


0.00

|

| 32


|

| 16

0.20

| 8

|

| 4

0.80

0.40

| 2


|


1.00

|

|

0.00

|

0.20

|

0.40

|

0.60

|

|

(b)

|

0.80

(a) 1.00


|

| 8

0.40

|

| 4

0.80

|

|

| 2

|

0.00

|

0.20

|

0.40

|

0.60


|

|

0.80

1.00

|

1.00

133

|


7.5. STRIDE PREDICTION TABLE

| 2

| 4

| 8

| 16

| 32

(d)

Figure 7.5: Fraction of misses eliminated for 64 and 128 entry stride prediction tables: (a) is a 64 entry SPT with a direct mapped main cache, (b) is a 64 entry SPT with a 4 way associative main cache, (c) is a 128 entry SPT with a direct mapped main cache, and (d) is a 128 entry SPT with a 4 way associative main cache.


0.60

0.40

0.20

| 4

| 8

| 16

| 32

| 64


0.00

|

|

| 2

| 4

| 8

| 16

| 32


0.60


| 64


0.20

| 16

| 32

| 64 (c)


| 8

0.00

|

|

| 2

0.40

| 4

|

0.80

|

|

0.00

0.20

|

|

0.40

|

0.60

1.00

|


|

|

0.80

| 64 (b)

|


(a) 1.00

| 2

|

|

0.00

0.20

|

|

0.80

|

|

0.40

|

0.60

1.00

|

|

0.80

|

1.00



134

| 2

| 4

| 8

| 16

| 32

(d)

Figure 7.6: Fraction of misses eliminated for 512 and 1024 entry stride prediction tables: (a) is a 512 entry SPT with a direct mapped main cache, (b) is a 512 entry SPT with a 4 way associative main cache, (c) is a 1024 entry SPT with a direct mapped main cache, and (d) is a 1024 entry SPT with a 4 way associative main cache.

7.5. STRIDE PREDICTION TABLE

135

The SPT requires access to the PC (program counter) and may therefore be slightly disadvantageous compared to the stream buer. The stream buer, since it relies only on external data requests, may be added more easily than the stride prediction table to an existing commercial processor. Data obtained from simulations using a variety of stride table sizes is shown in gure 7.5 and gure 7.6. All applications perform well with a stride cache of 128 entries. For large main cache sizes, between 70% and 90% of misses are eliminated relative to a cache with the same size and associativity, but no stride prediction mechanism. A knee in the curve appears, however, at a cache size of approximately 32KB below which the stride prediction table rapidly becomes less eective. Surprisingly, this is not as major of a factor for mpeg encode, for which the performance does not appreciably decay for small cache sizes. This is due to the memory intensive motion estimation that must be done for mpeg encoding. The fairly large search space required for motion vector encoding must be repeated for each 16x16 macro block in the movie. Although this requires a very large total number of memory references, the memory locality is quite good, and the traditional cache structure performs well, even for very small caches. Therefore, the smaller number of remaining misses that are not captured by the traditional main cache are handled more easily by the stride prediction mechanism. Finally, an interesting eect is observed for stride prediction tables of greater than 128 entries. In these cases, the stride prediction actually harms memory performance for relatively small cache sizes. The large number of non-useful prefetches begins to remove useful data from the cache. This problem could potentially be solved by the use of ltering techniques. The stride prediction table works very well for middle and large cache sizes. Indeed, it would be dicult to do better than eliminating 90% of the misses. In this range, the SPT is an eective means of prefetching. However, the stride prediction table has two signi cant problems with the smaller cache sizes. A large stride prediction table and small main cache results in unpredictable performance and may be undesirable in a real system. Certain applications may result in improvement, while other applications would be unpredictably degraded. The second problem is that even


136

when the performance of smaller cache sizes is not degraded, it is hardly improved. It is the smaller cache range, where the main cache is exhibiting a high miss rate, that has the most to bene t by successful prefetching. The larger caches, where the stride prediction table eliminates most of the misses, actually have fewer misses to eliminate, so that total execution time would be less impacted. Thus, it is this range where the performance bene t potential is greatest that the stride prediction table is least eective.

7.6 Stream Cache 7.6.1 Series Stream Cache Processor

Cache

SPT Stream Cache

Memory

Figure 7.7: Series stream cache architecture. The stream cache overcomes the problems of the SPT by improving performance for the small cache sizes. The stream cache is an independent cache into which data is prefetched on an SPT hit. On an SPT miss, no data is prefetched to the stream cache, but the instruction that missed in the SPT is added to the SPT as described in section 7.5. The SPT does a good job of predicting which data to prefetch, but fails for smaller cache sizes because it prefetches a large amount of unnecessary data. With the stream cache, data is prefetched not to the main cache, but to an independent

| 4

| 8

0.60

0.40

0.20

| 16

| 32

| 64


0.00

|

|

| 2

0.80

| 2

| 4

| 16

| 32

| 64

| 8

| 8

| 16

| 32

| 64 (c)

0.60

0.40


0.00

|

|

| 4

0.20

| 2

0.80


| 2


| 64


|

|

0.00

0.20


|

|

1.00

|

|

|

0.40

|

0.60

|

0.80

(b)

|


(a) 1.00

|

0.00

0.20

|

|

|


|

|

0.40

|

0.60

|

0.80

1.00

|

1.00

137

|


7.6. STREAM CACHE

| 4

| 8

| 16

| 32

(d)

Figure 7.8: Fraction of misses eliminated for 128 and 256 entry series stream cache using a 128 entry SPT: (a) is a 128 entry series stream cache with a direct mapped main cache, (b) is a 128 entry series stream cache with a 4 way associative main cache, (c) is a 256 entry series stream cache with a direct mapped main cache, and (d) is a 256 entry series stream cache with a 4 way associative main cache.


0.60

0.40

| 4

| 8

| 16

| 32

| 64 (a)


0.00

|

| 2

| 2

| 4

| 8

| 16

| 32


| 64


|

|

0.00

0.20

|

0.20

|

0.40

0.80

|

|

|

1.00

|

|

0.60

|

0.80

|

1.00



138

(b)

Figure 7.9: Fraction of misses eliminated for a 512 entry series stream cache using a 128 entry SPT: (a) is a 512 entry series stream cache with a direct mapped main cache and (b) is a 512 entry series stream cache with a 4 way associative main cache. stream cache. Because the data is not prefetched directly to the main cache, polluting the main cache is not a problem. A series stream cache architecture is shown in gure 7.7. The term, series, is used since the stream cache is connected in series with the main cache. The series stream cache is queried after a main cache miss, and is used to ll the main cache with the desired data. If the data missed in the main cache is not in the stream cache, it is brought from main memory directly to the main cache. The series stream cache simulated is fully associative. Because it is unlikely the data will be reaccessed from the series stream cache once it is copied from the stream cache to the main cache, a most recently used (MRU) replacement policy is used when fetching new data into the stream cache. Using an LRU replacement policy would cause the data that has most recently been copied to the main cache to linger in the stream cache. This data is now in the main cache and keeping a copy in the stream cache is an inecient use of stream cache storage. New data is fetched into the stream cache only on an SPT hit. The prefetch is not completed if the data at the prefetch address is already contained in the main cache.

7.6. STREAM CACHE

139

The stream buer works well only for unit strides and is inherently con gured for a xed number of streams. If a 16-way stream buer is used, there should be 16 separate streams of application data for the cache memory to be eciently utilized. The stream cache solves the problems of the stream buer by uniting the separate FIFOs of multiple stream buers into one relatively small fully associative stream cache. The stride prediction table is used as before to predict which data to prefetch, but the data is prefetched to the stream cache instead of the main cache. Because the stream cache is uni ed, the speci c number of streams in the application is irrelevant. Performance data for a 128, 256, and 512 entry stream cache are shown in gure 7.8 and gure 7.9. The dip in performance at 16KB for the 128 entry stream cache and at 64KB for the 256 entry stream cache occurs only in the con gurations with a direct mapped main cache and not for comparable cache sizes with 4 way associativity. This indicates that the dip is likely due to a cache mapping eect. The 512 entry stream cache appears large enough to give a fairly uniform performance improvement of between 60 and 80% across most cache sizes and both associativities. Performance for main cache sizes of less than approximately 32KB is signi cantly improved over the same cache con gurations using only a stride prediction table. This region on the left part of the graph is signi cant, since this is where the smaller main caches are not performing as eciently and memory performance is a much higher percentage of execution time. Assuming a 16 byte line size, the 512 entry stream cache would require 8KB of additional die area and may not be an eective trade-o.

7.6.2 Parallel Stream Cache The parallel stream cache is similar to the series stream cache except the location of the stream cache is moved from the re ll path of the main cache to a position parallel to the main cache. This is shown in gure 7.10. This is based on the hypothesis that multimedia applications tend to operate on a relatively small workspace of data that marches through the movie. The data in this workspace is operated on for a short time, but then is not frequently reused. The goal of the modi ed stream cache is


140

Processor

SPT Cache

Stream Cache

Memory

Figure 7.10: Parallel stream cache architecture. to isolate this local workspace to the stream cache. Prefetched data is brought into the stream cache, but is not copied into the main cache. A cache access must search both the main cache and the stream cache in parallel. On a cache miss that cannot be satis ed from either the main cache or the stream cache, the data is fetched from main memory directly to the main cache. Because the data in the stream cache can be accessed multiple times before it becomes stale, a least recently used (LRU) replacement scheme is now employed. An MRU replacement scheme would prematurely discard recently accessed data before it becomes stale. Like the series stream cache, data is prefetched to the stream cache on an SPT hit. The prefetch is not completed if the data is already contained in the main cache. On an SPT miss no data is prefetched to the stream cache, but the instruction that missed in the SPT is added to the SPT as described in section 7.5. Miss rate data for a 128 and 256 entry parallel stream cache with a 128 entry SPT is shown in gure 7.11. The smaller caches show a greater enhancement than mid sized caches since there is a greater bene t from keeping less frequently used data out of the main cache. For small cache sizes, performance is better than the 128 entry series stream cache described previously in section 7.6.1. Furthermore, this is the region where the main cache is suering from high miss rates, so that this improvement is particularly bene cial.

0.80

0.60

|

1.00

|

|

0.60

|

0.80


|

1.00

141

|


7.6. STREAM CACHE

|

0.20

|

| 2

| 4

| 8

| 16

| 32

| 64


0.00

|

|

0.00

0.40

|

0.20

|

0.40

| 2

| 4

| 8

| 16

| 32

0.80

0.60

| 32

| 64 (c)


0.00

|

| 16


| 64


|

| 8

|

0.20

| 4

0.40

| 2

|


1.00

|

|

0.00

|

0.20

|

0.40

|

0.60

|

0.80


(b) |

1.00

|


(a)

| 64

| 2

| 4

| 8

| 16

| 32

(d)

Figure 7.11: Fraction of misses eliminated for 128 and 256 entry parallel stream cache using a 128 entry SPT: (a) is a 128 entry parallel stream cache with a direct mapped main cache, (b) is a 128 entry parallel stream cache with a 4 way associative main cache, (c) is a 256 entry parallel stream cache with a direct mapped main cache, and (d) is a 256 entry parallel stream cache with a 4 way associative main cache.

142


7.7 Time/Area Tradeos In the previous sections signi cant improvements in miss rates were reported. This increase did not come for free, however. There is a cost in the additional die area required for the SPT and stream cache. In this section performance comparisons are presented after considering this additional area.

7.7.1 Additional Area Requirement Actual die area is highly implementation dependent and is dicult to model accurately. Each stride prediction table entry must hold a complete instruction address, data address, and a valid bit for each entry. The SPT area is modeled assuming two 32 bit words, or 8 bytes, per entry. Additional area is also required for associated logic such as adders and comparators. This is not considered in the model. The additional area for the stream cache is calculated assuming 16 bytes per entry. This considers only the 16 byte data line size per entry and neglects the tag bits. It may be possible to reduce the size of the SPT by storing only the lower 16 bits of the instruction address. This will cause some aliasing between unlike instructions, but is based on the hypothesis that the eect is not signi cant. Furthermore, it may be possible to store only the lower 16 bits of the address based on the assumption that data strides of more than 64KB are unlikely. Again, this will cause some address aliasing. These techniques are not considered in the area model employed.

7.7.2 Execution Time Absolute execution times for a single application are shown in gure 7.12. Execution time is calculated assuming a main memory latency for both the main and stream caches of 25 cycles. If data is needed while it is in the process of being loaded to the cache, then the balance of cycles remaining is counted in total execution time. Memory latency is always charged a constant latency of 25 cycles and con icts between requests are not simulated. The horizontal axis is adjusted such that total area, including both the main cache and stream cache, is shown for the enhanced

7.7. TIME/AREA TRADEOFFS

143

cache. For cache sizes of up to about 256 KB, less than 80% of the original time is required for execution. For very large cache sizes, the traditional cache design does a fairly good job of capturing the working set and the stream cache is proportionately less bene cial or even detrimental in some cases. In the case of large caches, then, the stride prediction table alone is an eective means of prefetching. As movie sizes become larger, however, this break point will shift to the right and the stream cache will be useful over a larger range of caches.

Execution Time (Millions of Cycles)

| 4

| 8

| 16

| | | | | | 32 64 128 256 512 1024 Combined Cache Size (KBytes) (a)

30 0

|

| 2

60

|

90

|

120

mpeg_play-hula base mpeg_play-hula enhanced

|

150

|

180

|

|

0

|

30

|

60

|

90

210

|

240

|

|

120

|

150

|

180

|

210

|

240

|

Execution Time (Millions of Cycles)

This data suggests that the stream cache is eective in improving execution time for either a very small on chip cache or a low cost multimedia system using only a small cache. A 128 entry stride prediction table with a 128 entry stream cache adds only about 2.5KB extra area, but cause the 2KB main cache to perform as a baseline 16KB cache or a 4KB cache to perform as a baseline 128KB cache for the application shown.

| 2

| 4

| 8

| 16

| | | | | | 32 64 128 256 512 1024 Combined Cache Size (KBytes) (b)

Figure 7.12: Execution times for mpeg play-hula with 128 entry parallel stream cache and 128 entry stride table adjusted for extra area required: (a) is a direct mapped main cache and (b) is a 4 way associative main cache.


144

7.8 Eectiveness of Stride Prediction

1800

|

1500

|

1200

|

900

|

600

|

300

|

total useful prefetches (thousands)

In this section the eectiveness of hardware based stride prediction is further investigated. The advantage of the hardware based stride prediction is that the stride value can change dynamically. A single instruction can prefetch on several dierent stride values throughout the duration of the program. Furthermore, the SPT described in section 7.5 is accessed for every load and store executed. Perhaps it is necessary to keep track of only a small subset of load and store instructions to determine when best to issue a prefetch.

|

0| 0

| 200

| 400

| 600

| | 800 1000 sorted instruction id

Figure 7.13: Histogram of useful prefetches for movie ower using a 32KB direct mapped cache and parallel stream cache. The following graphs explore the eectiveness of stride-based prefetching over a wide design space. The stride prediction is always done by means of a 128 entry SPT. Figure 7.13 shows a histogram of total useful prefetches for a 32KB direct mapped main cache and 128 entry parallel stream cache and movie ower. Total useful prefetches are shown on the y axis for a given instruction on the x axis. We de ne a useful prefetch as one for which the prefetched data is subsequently used by the application. Prefetches are counted for the instruction that predicted a given stride. The graph further sorts the instructions from those causing the most prefetches

7.8. EFFECTIVENESS OF STRIDE PREDICTION

145

to those causing the least number of prefetches. From this histogram it is observed that only a relatively small number of instructions cause most of the useful prefetches. Figures 7.14 through 7.17, the remainder of the graphs in this chapter, present the integral of total useful prefetches, that is the area under the curve of the type shown in gure 7.13. Thus, when the line indicating total prefetches becomes level, the instructions at that point are causing no signi cant prefetches. In general, a relatively small number of instructions, on the order of 200, is all that is needed to cause most of the prefetches. These graphs, furthermore, divide the prefetches into three separate categories indicated by the three separate lines on the graphs. The top-most line is the total number of useful prefetches. The bottom-most line counts only those prefetches that had constant strides. For these, the stride prefetch table seems unnecessarily complex, since these strides do not change dynamically. The capability to change strides dynamically is one of the key features of the SPT. Finally, the middle line shows all the prefetches that result if each instruction has only a static stride associated with it. A single stride value is selected based on the most common stride value from the run. In general, the static prediction performs only slightly worse than the dynamic prediction. The following sections show the eect of varying a single design space parameter while leaving the rest constant.

7.8.1 Eect of Application Movie Figure 7.14 illustrates the eect of varying the movie while holding associativity, cache size, and cache con guration constant. The cache con guration selected is a 32KB direct mapped main cache with a parallel stream cache. Data is shown for movies hula, owers, and easter. For all three movies, the ratio of the percentage of prefetches with constant strides, and the percentage that could be captured with a static prediction remain approximately constant regardless of movie. Furthermore, most of the useful prefetches can be captured with about 200 instructions for all three movies. The total number of prefetches is almost identical between easter and hula. This is

146


because the number of frames for the two movies is comparable. Flowers has almost four times as many useful prefetches because the number of frames is almost four times as long for ower as hula and easter. Since total prefetches are counted for the entire movie execution, the total number of frames aects the total number of prefetches reported.

7.8.2 Eect of Cache Size Figure 7.15 illustrates the eect of varying cache size while holding associativity, cache con guration, and movie constant. The cache con guration selected is a parallel stream cache, a direct mapped main cache, and the movie hula. Data is shown for cache sizes of 2KB, 4KB, 32KB, and 1MB. The cumulative number of total useful prefetches varies according to cache size. Approximately 34 million, 31 million, 19 million, and 8 million useful prefetches are issued for the 2KB, 4KB, 32KB, and 1MB cache respectively. For all cache sizes, the total number of instructions issuing useful prefetches is approximately equal at less than 600 prefetches. For the two larger cache sizes, most of the useful prefetches are captured in the rst 150 instructions, while the small caches requires 350 instructions before all useful prefetches are captured. For all four caches, useful prefetches with constant strides compose approximately 50% of total prefetches moving up to 60% of total prefetches for the 1MB cache case. For the large 1MB cache, 90% of the useful prefetches could have been issued with a static prefetch prediction. This percentage decreases for the smaller cache sizes to about 75{80% for the 2KB, 4KB, and 32KB caches.

7.8.3 Eect of Cache Type Figure 7.16 illustrates the eect of varying cache con guration while holding associativity, cache size, and movie constant. The cache con guration selected is a 32KB direct mapped main cache. The movie used is ower.mpg. Data is shown for series and parallel stream caches.

| | | | | | | |

9

|

9

18

|

18

27

|

27

36

total useful prefetches (millions)

|

36

45

|

45

54

|

54

63

|

63

72

|

72

81

|

81

|



| 200

| 400

| 600


81

|

72

|

63

|

54

|

45

|

36

|

27

|

18

|

9

|


(a)

|

0| 0

0| 0

|

|

0| 0

147

| 200

| 400

| 600


(b)

total useful prefetches best static guess prefetches with const stride

| 200

| 400

| 600


(c)

Figure 7.14: Eect of movie le on a 32KB direct mapped main cache and parallel stream cache. (a) is hula, (b) is ower, and (c) is easter.


| | | | | | | |

4

|

4

8

|

8

12

|

12

16


|

16

20

|

20

24

|

24

28

|

28

32

|

32

36

|

36

|


148

| 70

| 140

| 210

| 280

| | | | 350 420 490 560 sorted instruction id

0| 0

|

|

0| 0

| 70

| 140

| 210

| |


| | | | |

8

|

|

4

|

4

12

|

8

|

12

28

16

|

16

32

20

|

20

36

24

|

24

|

28

|

32

| 70

| 140

| 210

| 280 (c)


0| 0

|

|

0| 0


(b)

|


(a) 36

| 280


| 70

| 140

| 210

| 280


(d)

Figure 7.15: Eect of main cache size. All graphs are for movie hula with a direct mapped main cache and a parallel stream cache. (a) is a 2KB main cache, (b) is a 4KB main cache, (c) is a 32KB main cache, and (d) is a 1MB main cache.


149

| | |


|

2.0 1.5 1.0 0.5

|

|

2.5

|

|

3.0

|

9

3.5

|

18

4.0

|

27

|

36

|

45

|

54

|

63

|

72

4.5

|

81

|


The total number of useful prefetches is much greater for the parallel stream cache. Approximately 200 instructions are needed to capture all the useful prefetches. This number is approximately the same as the number of instructions required for hula in the previous graphs. For the graphs shown, the percentage of prefetches that can be correctly captured with a static prediction is far less for the series stream cache than the parallel stream cache (note the scale on the y axis). This eect is even greater for a 2KB cache, although the data is not shown here.

| 200

| 400

| 600


(a)

0.0 | 0

|

|

0| 0

total useful prefetches best static guess prefetches with const stride | 200

| 400

| 600


(b)

Figure 7.16: Eect of cache type for a direct mapped 32KB main cache and movie

owers. (a) is a parallel stream cache and (b) is a series stream cache.

7.8.4 Eect of Associativity Finally, gure 7.17 illustrates the eect of varying the main cache associativity while holding movie, cache size, and cache con guration constant. The cache con guration selected is a 2KB direct mapped main cache with a parallel stream cache for the movie hula. Data is shown for a direct mapped, 2, and 4 way associative main cache. The three curves have approximately the same shape and breakdown of stride composition. The main dierence is that the total number of useful prefetches decreases as the associativity becomes higher. This is expected since as the associativity

150


of the main cache increases, the base performance becomes better, and the number of prefetches that can bring in useful data becomes less. There is also a small decrease in the percentage of prefetches that can be captured with a static prediction for the 4-way associative case. The ability to dynamically predict the stride becomes increasingly important as associativity increases.

7.9 Summary In this chapter we investigated a number of hardware prefetching techniques for multimedia benchmarks. The regular memory access pattern of these applications makes some form of data prefetching an attractive strategy for improving memory performance. Stream buers can eliminate up to about 50% of data misses for small and moderately sized caches. It is the small cache sizes, where the large number of misses contribute signi cantly to total execution time, where a large reduction in misses is desirable. The series stream cache added improvement over the stride prediction table for smaller sized caches, and left the performance improvements intact for large caches. Finally, the parallel stream cache resulted in very good performance enhancements for small cache sizes with a small amount of additional hardware, but in some cases did slightly worse than the stride prediction table for large cache sizes. Data is presented for both direct mapped and 4 way associative caches to show that the same trends exist regardless of associativity. 4 way associativity was chosen since simulations done but not published showed that 4 way associativity behaves approximately the same as fully associative. The stride prediction table was shown to perform very well for large caches, and the series stream cache and parallel stream cache perform very well for small cache sizes. In these cases, performance improvements will result from an increase in hardware. Extremely cost or area sensitive applications, where a small cache is required, can bene t signi cantly from employing such a technique even though die area is increased to add logic for the stride prediction.

| | | | | | | |

4

|

4

8

|

8

12

|

12

16


|

16

20

|

20

24

|

24

28

|

28

32

|

32

36

|

36

151

|


7.9. SUMMARY

| 70

| 140

| 210

| 280

| 350


36

|

32

|

28

|

24

|

20

|

16

|

12

|

8

|

4

|


(a)

|

0| 0

0| 0

|

|

0| 0

| 70

| 140

| 210

| 280

| 350


(b)


| 70

| 140

| 210

| 280

| 350

| | | | 420 490 560 630 sorted instrucion id

(c)

Figure 7.17: Eect of main cache associativity on movie hula with a 2KB main cache and a parallel stream cache. (a) is a direct mapped main cache, (b) is a 2 way associative main cache, and (c) is a 4 way associative main cache.

152


Finally, analysis of the eectiveness of hardware directed stride prediction is presented. Only a few hundred prefetch instructions out of thousands executed generate the large majority of useful prefetches. Furthermore, although the dynamically changing stride value possible with hardware directed stride prediction is necessary to achieve maximum performance, the majority of useful prefetches can be executed with a best guess static prediction scheme. This is shown to be true for three dierent movies. The percentage of useful prefetches that can be captured with a best guess static prediction remains approximately unchanged when movie or cache associativity is varied. There is some change when cache type is varied, although this does not appear to be signi cant. The cache sizes reported in this chapter perform as slightly larger caches in a real system because only single applications were run in each cache. The movie frame size used also has a signi cant eect on the break point in a given cache performance curve. For this reason, results should be considered relative to one another, rather than xed at a given cache size. Although a speci c cache simulated here may perform as a smaller cache in a given system, the relative trends remain constant regardless. In the next section we investigate a software based stride prediction mechanism motivated by this study of hardware stride prediction eectiveness.

Chapter 8 Software Directed Stride Based Prefetching 8.1 Introduction In the previous section the eectiveness of hardware directed stride-based prefetching was investigated. The results indicate that although the hardware mechanism allows us to fully exploit a dynamically changing stride value, using a best guess static stride value works almost equally eectively. Furthermore, only a few hundred individual instructions, not the thousands that are in the executable, cause most of the useful prefetches. In this chapter, we investigate a software directed stride based prefetching technique that replaces the hardware SPT. In sections 8.2 and 8.3 the technique is evaluated as a strict replacement to the SPT and performance is compared for three cases: no stream cache, series stream cache, and parallel stream cache. In section 8.4, software prefetching is investigated assuming no additional stream caches. In this case, no additional hardware is required for the stream cache or SPT, so the only hardware support required is a software controlled prefetch instruction. The software prefetching technique works by gathering execution pro le information from a simulation of the hardware SPT. A prefetch hint le is generated based on tracing which instructions caused the most useful prefetches in the hardware SPT 153

154 CHAPTER 8. SOFTWARE DIRECTED STRIDE BASED PREFETCHING simulation. The hint le is then used to insert software prefetch instructions. To do this automatically in a compiler would be possible by rst pro ling, then inserting prefetch instructions into the code in two separate steps. The pro le step simulates a hardware SPT. By tracking which instructions caused which cache lines to be prefetched, and then keeping track of which prefetch data is actually used by the application, we determine which instructions were useful in prefetching data that is subsequently used by the application. Furthermore, by keeping track of the stride value that is used to prefetch the data, we can determine what the best value is to use for a static stride prediction. The data is generated in a table similar to the one shown in gure 8.1. The data shown is from the movie hula. The left most column is the instruction address, the next shows the total number of times that the cache lines prefetched by that instruction were accessed. The next column shows the last stride value that was used, and the next shows the number of times the stride value changed. If this column has a value of zero, it shows that the stride never changed and that a static stride prediction would work perfectly for this instruction. If the stride value did change, however, the rightmost column indicates the single stride value that prefetched the greatest amount of useful cache lines, and the second to the rightmost column shows the number of times these cache lines have been accessed. Notice that we can generate a similar table describing the prefetching eectiveness for any combination of main cache and stream cache sizes. Furthermore, we can collect data for both a parallel and series stream cache. In the example shown, the value 704 appears to be a common stride value. This equals exactly twice the movie frame width and supports the hypothesis that movie frame size is an important stride parameter. After obtaining data describing which prefetches are useful, we can selectively insert software prefetch instructions into the executable code using a static stride prediction. The results in this chapter are generated by simulating a discrete software prefetch instruction. The particular prefetch instruction could be implemented in a variety of ways.

8.1. INTRODUCTION

155

opid

used count

last stride

stride change count

1241088 1241099 1241101 1241103 1241104 1241106 1241109 1241111 1241114 1241116 1241119 1241121 1241124 1241126 1241129 1241131 1241132 1241134

17 355 514 1985 369 375 375 320 348 390 397 412 432 656 493 17025 7367 677

32 -31 352 32 22 704 42 704 3 704 -9 704 13 704 -2 704 704 35

0 307 83 0 292 55 327 44 295 62 349 56 365 80 443 1216 625 609

grtst freq count

most common stride

0 15 412 0 21 302 18 251 18 317 15 346 25 561 16 14737 6217 17

0 -17 704 0 -10 704 -6 704 -3 704 -4 704 -3 704 -19 704 704 8

Figure 8.1: Trace output from hardware simulation for movie hula.

156 CHAPTER 8. SOFTWARE DIRECTED STRIDE BASED PREFETCHING For this simulation, an atomic prefetch-by-stride instruction is assumed. The instruction prefetches into either the special purpose stream cache or prefetches directly into the main cache depending on the cache con guration simulated. This prefetchby-stride instruction is invoked with an immediate stride value. The last executed load or store address is added to the stride value and a prefetch from this new address is initiated. The stride value is available at compile time and is derived from the hint le generated at the pro ling step. The task now is to determine which software prefetch instructions to insert. Notice that the software inserted prefetch instructions do nothing but emulate the hardware stride prediction table. For smaller caches software prefetching often does better than hardware prefetching. Recall that for the small caches, the hardware will prefetch an excessive amount of data knocking out useful data, and degrade performance. The software directed prefetching eliminates this problem since only the more eective prefetches are inserted. Another advantage of software prefetching is earlier initiation of prefetching. In hardware prefetching, a given load or store instruction must access the SPT twice before a stride value can be calculated and a prefetch issued. Assuming the stride prediction works as expected, the rst two memory accesses for a given instruction will miss in the cache, and the third and subsequent accesses will hit. For software prefetching, since the stride is known at compile time, the prefetch can be issued immediately after the rst access. Although the rst access will miss, the second and subsequent accesses will hit. This is probably what causes software prefetching to perform better than hardware in some cases. The design space that must be explored to evaluate this technique is complicated. Not only do we have the same set of main cache parameters that were investigated before in the eectiveness of hardware prefetching in the previous chapter, but now we must examine the whole range of parameters used to collect the stride eectiveness history. For example, one might use stride data from a 2KB parallel stream cache to do software prefetching to an 8KB series stream cache. This occurs if an executable compiled with a pro le from one machine is executed on another machine. The total number of combinations quickly becomes large. The following sections investigate the eect of dierent parameters by varying a

8.2. EFFECT OF PREDICTION FILE

157

given parameter while holding the other parameters constant. In this chapter, all prediction les have been generated modeling a 128 entry SPT. The stream caches used also have 128 entries. The best static guess is derived using the methodology of section 7.8. In sections 8.2 and 8.3, other than section 8.2.1 where the eect of the movie le is evaluated, the same movie is used to generate the hint le and then executed to measure performance. Beginning with section 8.4, three dierent movies are executed to measure performance with a hint le generated from a single movie.

8.2 Eect of Prediction File In this section we investigate the eects of varying the simulation parameters used to generate the prefetch hint information.

1.00

|

0.60

|

0.80

|


8.2.1 Eect of Prediction File Movie

|

0.20

| |

0.00

| 2

| 4

0.40

| 8

| 16

hints generated from hula hints generated from flower hints generated from easter

| 32

| 64


Figure 8.2: Eect of varying movie used to make prediction le. Prediction le was generated using a 2KB direct mapped main cache and series stream cache. Only the 200 most eective instructions issue prefetches. Fraction of misses eliminated is reported for execution of hula performed with a direct mapped main cache and parallel stream cache.

158 CHAPTER 8. SOFTWARE DIRECTED STRIDE BASED PREFETCHING Figure 8.2 illustrates the eect of varying the movie displayed while holding the cache size, associativity, and con guration of the cache used to collect the prefetch data constant. The movie hula was used with a series stream cache and a direct mapped main cache of size 2KB was used to collect the prefetch statistics. The rst 200 most useful instructions are used to execute prefetches. Data is shown across a range of cache sizes for a direct mapped main cache with a parallel stream cache for the movies hula, owers, and easter. Data is presented in terms of fraction of misses eliminated. The movies hula and owers perform approximately the same and easter appears signi cantly worse. This is due to the frame size of the movies. Hula and owers share the same frame size at 352 x 240 while easter has a frame size of 240 x 176. We hypothesize that the frame dimensions are important components of stride information. However we have not yet tested this hypothesis. With some manual manipulation of the stride data, it may be possible to parameterize the stride in terms of frame size and thereby achieve the same performance for dierent sized movies. Although easter did not perform as well as hula or owers, it still did better than no prefetching.

8.2.2 Eect of Prediction File Cache Type Figure 8.3 illustrates the eect of varying cache con guration used to collect the prefetch eectiveness data while holding the movie, associativity, and size of the cache used to collect the prefetch data constant. The movie hula was used in a direct mapped 2KB cache to generate the hint le. The 200 most useful instructions are added as software prefetch instructions. Data is shown across a range of cache sizes for the three dierent types of cache con gurations in (a), (b), and (c). Data is presented in terms of fraction of misses eliminated. Data for comparable hardware directed prefetching is also given. In general the shapes of the three curves for no stream cache, parallel, and series stream caches match the shape of the same curve for the hardware based stride prediction. For the no stream cache case, the parallel stream cache prediction data does better for the 2KB and 4KB sizes, and the hardware prefetching does better

| 4

| 8


0.20

| 16

| 32

| 64


0.00

|

2

0.40

| 2

| 4

| 8

|

0.80

|

0.60

|


(a)

1.00

0.40

|

0.20

| |

0.00

| 2

| 4

| 8

| 16

| 32

|

0.60

|

0.80

|

|

0.00

|

0.20

|

1.00

|

|

0.40

|

0.60

|

0.80

159

|

1.00

|



| 16

| 32


(b)

| 64

| 64


sw pf hint from parallel cache sw pf hint from series cache hw pf

(c)

Figure 8.3: Eect of varying cache type used for generating the prediction le. All graphs compare the eects of predictions generated using a parallel stream cache, a series stream cache, and hardware directed prefetching using the movie hula and a 2KB direct mapped main cache. Only the 200 most eective instructions issue prefetches in the software case. Fraction of misses eliminated is reported for execution performed with a direct mapped main cache and in (a) no stream cache, (b) a parallel stream cache, and (c) a series stream cache.

160 CHAPTER 8. SOFTWARE DIRECTED STRIDE BASED PREFETCHING than software prefetching for 8KB and 16KB caches. The series prediction data does better for the 32KB and larger. For the parallel stream cache, one would expect the parallel prediction data to do better than the series prediction data in every case, but in fact, the parallel data does worse than the series prediction data for the 2KB and greater than 32KB main cache sizes. The hardware directed prefetching does best for cache sizes up to 32KB. The series stream cache behaves as expected with the series prediction data resulting in better performance than the parallel prediction data for every case. The hardware directed stride prediction does best for 4KB thru 32KB caches, and again does best for 512KB and 1MB caches. In general, the software prediction results in similar performance to the hardware directed prefetching without the cost of a hardware stride prediction table. Indeed, the software directed prefetching does even better for a number of cases. Assuming the two largest cache sizes in gure 8.3c are outlying points, we conclude that hardware prefetching in these graphs is best for cache sizes of 8KB and 16KB, and the software series stream prefetch hints do best for cache sizes of 32KB and larger. The 2KB cache results vary based on execution cache con guration. Finally, we conclude that all three graphs look approximately the same in the range from 32KB to 512KB indicating that the execution cache con guration is less sensitive to the cache type used to generate hints in this range. We should expect the hardware directed prefetching to do better for the smaller caches. In the previous chapter it was shown that a smaller percentage of all useful prefetches could be captured by a best static guess stride prediction for the small cache sizes than for the large cache sizes. Therefore the dynamic stride prediction possible with the hardware based prediction is more bene cial for the smaller cache sizes. For the larger caches, the best static guess does almost as good at stride prediction as the hardware stride prediction table.

| 4

| 8

| 16

| 32

| 64



0.60

0.40

0.20

0.00

|

|

| 2

0.80

|

|

|

|

0.00

|

|

0.40

1.00

|

0.60

0.20

|

0.80

161

|

1.00

|



| 2

| 4

| 8

|

0.80

|

0.60

0.40

|

0.20

| |

0.00

| 2

| 4

| 8

| 32

| 32

| 64


| 16

| 16

(b)

|


(a)

1.00

| 64


hint generated from 32KB main cache hint generated from 4KB main cache hint generated from 2KB main cache

(c)

Figure 8.4: Eect of varying cache size used for generating the prediction le. All graphs compare the eects of predictions generated using a 2KB, 4KB, and 32KB direct mapped main cache and parallel stream cache using the movie hula. Only the 200 most eective instructions issue prefetches. Fraction of misses eliminated is reported for execution performed with a direct mapped main cache and in (a) no stream cache, (b) a parallel stream cache, and (c) a series stream cache.

162 CHAPTER 8. SOFTWARE DIRECTED STRIDE BASED PREFETCHING

8.2.3 Eect of Prediction File Cache Size Figure 8.4 illustrates the eect of varying the cache size used to collect the prefetch eectiveness data while holding the movie, associativity, and con guration of the cache used to collect the prefetch data constant. The movie hula was used with a parallel stream cache and a direct mapped main cache of sizes 2KB, 4KB, and 32KB cache to generate the prediction le. The rst 200 most useful instructions are used to execute prefetches. Data is shown across a range of cache sizes for the three dierent types of cache con gurations executed with the movie hula. Data is presented in terms of fraction of misses eliminated. As in the above section, the shape of the curves for the three types of cache con gurations is approximately the same as the hardware based prefetching methodology. For the no stream cache case, the 32KB cache data always does best followed by the 4KB and 2KB caches respectively. This is surprising since one would expect the 4KB and 2KB cache data sets to do better for their respective cache sizes. For the parallel stream cache, the 2KB and 4KB prediction data tie for the best performance at the 2KB cache, the 4KB prediction data does best for the 4KB and 8KB cache, and the 32KB prediction data does best for 16KB and larger caches. The data is similar for the series stream cache with the 2KB prediction data doing slightly better for the 2KB cache size, the 4KB prediction data does better for the 4KB and 8KB caches, and the 32KB cache prediction data does best for 16KB and larger caches. Each prefetch data set does best for its cache size, and the prediction data set with the largest cache size does best for the large caches. However, note that there is much less dierence in performance between the 4KB and 32KB cache sizes, than between the 2KB and 4KB cache sizes. Hence, this suggests using at least a 4KB cache for generating software predictions is desirable, and a 16KB or 32KB cache size even superior.

8.3. EFFECT OF CACHE

163

8.3 Eect of Cache In this section, the cache con guration used to collect the prefetch hint le is xed and the con guration on which the benchmark is executed is varied. In section 8.3.1 the execution cache type is varied and in section 8.3.2 the execution main cache associativity is varied.

0.60

0.40

| 16

| 32

| 64 (a)


0.00

|

| 8

| 2

| 4

0.20

|

|

| 4

| 2

|

|

0.00

|

0.20

0.80

|

0.40

|

0.60


|

|

0.80

1.00

|

1.00

|


8.3.1 Eect of Cache Type

| 8

| 16

| 32

parallel stream cache no stream cache series stream cache

| 64


(b)

Figure 8.5: Eect of main cache type. Both graphs use hint les generated using a direct mapped main cache and parallel stream cache using the movie hula. Only the 200 most eective instructions issue prefetches. (a) used a 2KB cache to generate the hint le, and (b) used a 32KB cache. Figure 8.5 compares the eect of dierent prefetch cache types across a range of cache sizes while holding the prefetch hint data constant. The graphs compare no stream cache, series stream cache, and parallel stream cache using a single movie and a single prefetch data le. Two graphs are shown for two types of prefetch history data. The prefetch history data is collected using the movie hula with a parallel stream cache and a direct mapped main cache of sizes 2KB and 32KB. The rst 200 most useful instructions are used to execute prefetches.

164 CHAPTER 8. SOFTWARE DIRECTED STRIDE BASED PREFETCHING These graphs show that consistently the parallel stream cache performs the best followed by the series stream cache with the software directed prefetch directly into a main cache performing the worst up to 32KB cache size. For cache sizes of 64KB and larger the no stream cache surprisingly does better than the series stream cache, but still does worse than the parallel stream cache. At 1MB, there is a small crossover point where the no stream cache does the best, but this is really not very signi cant since there are so few total misses for this large cache size. In the graphs shown, the larger caches do better with the data set collected from the 32KB cache as might be expected, but in general the dierence does not appear to be signi cant. Furthermore, in this set of graphs, the series stream cache always performs worse than the parallel stream cache or no stream cache. This indicates that the series stream cache is not useful for software directed prefetching.

8.3.2 Eect of Cache Associativity Finally, gure 8.6 compares the eect of dierent main cache associativities while holding cache con guration and prefetch data pro ling constant. Three graphs are presented for three types of cache con gurations. The prefetch history data was collected using the movie hula with a parallel stream cache and a direct mapped main cache of size 2KB. The rst 200 most useful instructions are used to execute prefetches. For no stream cache with software directed prefetching, the 4-way associative has the highest fraction of misses eliminated followed by the 2-way and direct mapped cache. There is a small crossover at the 2KB cache size, but this is probably an exception since this is the cache size for which that the data was collected. The performance for no stream cache and 2 way associativity is better than either stream cache in cache sizes of greater than 64KB, except for the 512KB parallel stream cache. This indicates that no stream cache with an associativity greater than one is an attractive con guration. For the parallel stream cache, the dierent associativities all result in approximately the same fraction of misses eliminated. This means that compared to the no

|

0.60

|


|

0.80

0.40

|

0.20

| |

0.00

0.40

0.20

| 8

| 16

| 32

| 64


0.00

| 2

| 4

| 8

(a)

1.00

|

|

| 4

| 2

0.60

|

|

|

0.00

0.80

|

0.20

|

0.40

|

0.60


|

|

0.80

1.00

|

1.00

165

|


8.3. EFFECT OF CACHE

| 2

| 4

| 8

| 32

| 32

| 64


(b)

| 16

| 16

| 64


4 way associative main cache 2 way associative main cache direct mapped main cache

(c)

Figure 8.6: Eect of varying main cache associativity. All graphs use hint les generated using a direct mapped 2KB main cache and parallel stream cache using the movie hula. Only the 200 most eective instructions issue prefetches. (a) is executed using no stream cache, (b) a parallel stream cache, and (c) a series stream cache.

166 CHAPTER 8. SOFTWARE DIRECTED STRIDE BASED PREFETCHING stream cache the 2-way and direct mapped do better than expected. The parallel stream cache acts as a type of 2-way associativity and helps eliminate more of the con ict misses for the 2-way and direct mapped caches. The series stream cache behaves much like the no stream cache except the direct mapped cache does better for the 2KB and 4KB caches. This is probably an artifact of the prefetch data set since it was collected from a direct mapped cache. Both the series and parallel stream caches have a decrease in fraction of misses eliminated for large cache sizes. This is not signi cant since the large cache sizes have only a small number of misses to begin with. The apparent decrease in fraction of misses eliminated is caused by only a small change in number of misses. In these sections we have studied replacing the hardware SPT with software stride prediction. We have shown that it is important that the movie used to generate the hint le have the same frame size as the movie executed. Furthermore, we have shown that for mid range cache sizes, a series stream cache with a main cache of at least 4KB should be used to generate the hint le. We have shown that execution with a parallel stream cache results in the best performance for direct mapped caches. However, for the mid range cache sizes with at least 2 way associativity, performance can be superior with no stream cache. This is especially attractive since a savings in die area results from abandoning the stream cache. In the next section we investigate in more detail software prefetching with no stream cache.

8.4 Software Prefetching with No Stream Cache In this section we look in detail at software prefetching with no stream cache and no SPT. This is an attractive strategy since only a simple software controlled prefetch instruction with no additional hardware enhancements are required. In this case, the software prefetch instruction prefetches data directly to the main cache since there is no stream cache. We begin by determining the best parameters to select when generating the hint le. Next we con rm this selection for three movies and a range of cache parameters. Finally we investigate the eect of modifying the parameters used to model execution

8.4. SOFTWARE PREFETCHING WITH NO STREAM CACHE

167

time. A complete set of data is presented in appendix D. As described in section 8.1, there are a number of parameters that must be chosen when performing the simulation to generate the hint le. The main cache size, main cache associativity, stream cache type, and movie displayed all determine a particular hint le. Furthermore, once the hint le is generated, the number of instructions to add as prefetches to the enhanced executable must be determined. In the previous section we learned a larger cache size was preferable in generating the hint le. In this section we x the main cache type used to generate the hint le to a 32KB direct mapped cache in all simulations. We learned in the last section that a movie with the same frame size as the executed movie should be chosen to generate the hint le. In this section the movie used to generate the hint le is always xed at easter. Although easter often has a dierent frame size from the movie executed, this choice illustrates that good performance is still achieved even for a variety of frame sizes. In the next two sections, the stream cache type and fraction of prefetch instructions to insert is chosen. Execution time is calculated assuming a constant memory access time of 50 cycles for both misses and prefetches, and a fully interleaved memory so that multiple outstanding accesses are allowed. We continue to follow the load/store limited superscalar processor model in which it is assumed execution time consists only of one cycle to execute any memory operation plus time required to service cache misses. Relative execution time compares execution time to a cache with the same con guration but no prefetching. A relative execution time of less than one indicates a performance bene t from prefetching.

8.4.1 Choosing Hint File Cache Type In the previous sections a simple emulation of the hardware SPT was used to generate the hint le. The useful prefetches brought in by the SPT's dynamic striding mechanism are sorted and the rst 200 are inserted as simulated prefetch instructions in the code. In this section we investigate a new method for sorting the useful prefetches. The

168 CHAPTER 8. SOFTWARE DIRECTED STRIDE BASED PREFETCHING hint table in this section is generated by sorting only useful prefetches that would have been performed by a best guess static prefetch. Data that was prefetched by the SPT's dynamic striding mechanism is not included. There are two possible cache con gurations used to generate the prefetch hint le: the series and parallel stream cache. The no stream cache case is not investigated. In this case, a large number of non-useful prefetches bring excess data into the main cache. The main cache is unnecessarily polluted and atypical cache behavior may result. In the previous section we learned that usually only a small number of prefetch instructions are necessary to capture the majority of useful prefetches. Table 8.1 shows the number of prefetch instructions that must be inserted to capture various percentages of useful prefetches for the two types of stream caches for the movie easter in a 32KB direct mapped main cache. The series stream cache always requires fewer instructions to capture the same percentage of useful prefetches. In both cases, 95% of useful prefetches are captured with less than 20% of the total prefetch instructions. % of useful prefetches 50% 60% 70% 80% 90% 95% 100% series cache 12 15 20 28 50 106 777 parallel cache 22 32 44 65 121 157 872 Table 8.1: Number of prefetch instructions that capture the indicated percentage of useful prefetches for movie easter in a 32KB direct mapped main cache. Figure 8.7 shows performance with dierent fractions of prefetch instructions inserted for execution of the movie hula in a 128KB direct mapped main cache. Performance of the two stream cache con gurations is compared. Relative execution time and fraction of misses eliminated are both reported compared to a cache of the same con guration with no prefetching. A range of graphs of this type are presented in appendix D.1. Fraction of misses eliminated continues to improve as more and more instructions are inserted, yet relative execution time reaches a minimum and then again increases. This indicates the tradeo between increased overhead and increased performance when adding prefetch instructions. At some point the bene t gained from adding


169

prefetch instructions is oset by the overhead of executing additional instructions. This eect is not expressed in gure 8.7a. For this reason, fraction of misses eliminated is not reported for the remainder of the chapter. For the 128KB cache shown, the series stream cache is superior and adding instructions to capture 90% of available prefetches results in the best performance.

In gure 8.8, the comparison between series and parallel stream cache is made across a range of cache sizes with fraction of prefetch instructions inserted xed at 90%. Relative execution time is shown for both a direct mapped and 2 way associative cache. In all cases the series stream cache results in superior performance.

This is because the series stream cache orders prefetches based on the binary attribute of whether or not the data was used, while the parallel stream cache orders prefetches both by whether or not the data was used and how many times it is used once it has been prefetched. With the series stream cache, data is copied to the main cache once it has been accessed. Every time a piece of data is copied to the main cache it is counted as one useful prefetch. With the parallel stream cache, the data is not copied to the main cache, but can be reaccessed repeatedly from the parallel stream cache. Every time the data is accessed, it is counted as a useful prefetch. The number of times the data is used is not as relevant a metric to prioritize prefetches since after a single initial miss, data is already in the main cache or parallel stream cache. An example constructed to illustrate this point is shown in gure 8.9. 6 instructions are shown that cause the number of useful prefetches indicated for series and parallel stream caches. If 3 instructions are to be inserted, the series stream cache ranks 103, 100, and 101 as the best instructions. These instructions cause 15+8+7 = 30 misses to be eliminated. If, however, the parallel stream cache simulation is used instructions 103, 104, and 102 will be chosen eliminating 15 + 5 + 5 = 25 prefetches. The series stream cache is superior.

0.40

|

Relative Execution Time

0.60

| 55

| 60

| | | | | | | | 65 70 75 80 85 90 95 100 Percent of Total Prefetches Inserted

0.00 | 45

| 50

|

|

| 50

series stream cache sort parallel stream cache sort

|

0.20

0.00 | 45

|

0.20

0.80

|

|

0.40

|

1.00

|

|

0.60

|

0.80

1.20

|

1.00

|



| 55

| 60


(a)

(b)

1.00

0.60

0.40

|

0.20

|

| 4

| 8

| 16

| 32

| 64 (a)


0.00

|

|

| 2

0.80

|


|

|

0.00

|

0.20

1.20

|

0.40

|

0.60

|

0.80

|

|

1.00

|

1.20

1.40

|

1.40

|


Figure 8.7: Performance comparison for series vs. parallel prefetch insertion executed with a 128KB direct mapped main cache and movie hula. (a) shows fraction of misses eliminated and (b) shows relative execution times.

| 2

| 4

| 8

hints from parallel stream cache hints from series stream cache

| 16

| 32

| 64


(b)

Figure 8.8: Relative execution time for 90% of available prefetches inserted comparing prediction le generated from series and parallel stream caches. The movie displayed is hula. (a) is a direct mapped cache, and (b) is 2 way associative cache.


171

opid useful series useful parallel prefetches prefetches 100 8 100 101 7 199 102 5 200 103 15 300 104 5 212 105 3 50 Figure 8.9: Hypothetical example illustrating the dierence between series and parallel stream cache data generation.

8.4.2 Choosing Fraction of Prefetch Instructions to Add In this section we determine the best number of prefetch instructions to insert. Figure 8.10a shows the relative execution time for dierent fraction of prefetch instructions inserted across a range of direct mapped cache sizes and movie hula. The same data is presented in tabular format in table 8.2. Clearly, inserting 100% of available prefetch instructions results in the worst performance. The other choices perform similarly with 90% better than the others for cache sizes from 16KB to 512KB. Figure 8.10b xes the cache size at 128KB and shows similar data for movies easter and bicycle as well as hula. Since the hint le was generated using the movie easter, the similar performance of the three movies shows that the hint le generated is not movie speci c when reporting execution time. The curve minimum at 90% con rms that this is a correct fraction of prefetch instructions to insert for this cache size. A full set of graphs of the format of gure 8.10b is presented in appendix D.2. For a given cache size, the tangent to the minimum point of the relative execution time shows the optimal relative execution time achievable. This also indicates the optimal percentage of prefetch instructions to insert. Inserting instructions to capture 90% of available prefetches is the optimal choice for most cache sizes.

| | 4

| 8

50% of prefetches inserted 70% of prefetches inserted 90% of prefetches inserted 100% of prefetches inserted | 16

| 32

| 64


0.80

0.60

0.40

0.20

0.00 | 45

|

|

| 2

1.00

|

|

0.00

|

0.20

|

|

0.40

|

|

0.60

|

0.80

|

|

1.00

|

1.20


1.40

1.20

|

1.60

|



| 50

| 55

| 60

hula easter bicycle | | | | | | | | 65 70 75 80 85 90 95 100 Percent of Total Prefetches Inserted

(a)

(b)

Figure 8.10: Relative execution time for dierent numbers of prefetch instructions inserted in a direct mapped cache. (a) shows execution of movie hula, and (b) shows three movies for a cache size of 128KB.

2KB 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1024KB

50% 0.987 0.867 0.944 0.895 0.850 0.845 0.869 0.878 0.979 0.916

fraction of useful pfs captured 60% 70% 80% 90% 95% 0.970 0.975 0.923 0.888 1.055 0.763 0.826 0.855 0.835 0.910 0.902 0.957 1.016 0.955 1.034 0.888 0.936 0.825 0.814 0.923 0.856 0.813 0.787 0.779 0.785 0.849 0.808 0.764 0.748 0.762 0.871 0.842 0.790 0.773 0.789 0.879 0.871 0.810 0.792 0.810 0.980 0.983 0.946 0.935 0.961 0.916 0.921 0.919 0.924 0.957

100% 1.103 1.093 1.256 1.118 1.099 1.077 1.010 1.055 1.298 1.338

Table 8.2: Relative execution time for inserting dierent numbers of prefetch instruction for movie hula in a direct mapped cache.


173

8.4.3 Con rmation of Hint File Parameter Selection

|

1.00

0.80

|

0.60

|

0.40

|

0.20

| 4

| 8

| 16

| 32

| 64 (a)


0.00

|

| 2

|

|

0.00

|

0.20

|

0.40

1.20

|

0.60

|

0.80


1.00

|

|

1.20

|


In gure 8.11, we con rm our parameter selection by showing relative execution time across a range of cache sizes for 90% of prefetches inserted generated with a series stream cache. For the parameters chosen for this model, we can achieve up to a 20% improvement in execution time for a range of caches between 16KB and 256KB by adding only software prefetch instructions. For the smaller direct mapped cache sizes, prefetching is not eective at reducing execution time. For the larger cache sizes, the overhead of adding prefetch instructions osets the performance gains of prefetching. Furthermore, performance bene t is enhanced when using a 2 way associative cache. In the next section we investigate how adjusting our memory access model aects performance.

| 2

| 4

| 8

| 16

| 32

easter hula bicycle

| 64


(b)

Figure 8.11: Relative Execution Time for 90% of prefetches inserted. (a) is executed with a direct mapped cache, and (b) with a 2 way associative cache.

8.4.4 Eect of Modifying Execution Model Assumptions Finally, in this section we investigate how altering parameters in our execution model eects relative execution time. First we investigate the eect of modifying the cache

174 CHAPTER 8. SOFTWARE DIRECTED STRIDE BASED PREFETCHING miss penalty and modifying the instruction mix parameters. Next, we investigate the eect of modifying the assumptions concerning the number of outstanding prefetches that can occur in parallel.

Eect of Execution Time Parameters Through this point we have assumed that accesses to memory for both demand and prefetch misses cost 50 cycles, and that instructions consist only of loads and stores. In this section we consider the impact of altering the memory access cost to 100 cycles. We also consider an instruction mix in which loads and stores comprise only 30% of instructions. This is the measured instruction mix for mpeg play as determined in [FZF96]. We assume that all instructions except cache misses execute in one cycle. We include data for three movies, and dierent combinations of access time and instruction mix for a direct mapped cache. A complete set of data for memory access costs of 10, 25, 50, and 100 cycles and instruction mix of 30%, 50%, and 100% loads and stores is presented in Appendix D.3. In general, the limit as to how much bene t is added by prefetching is increased as the memory access time is increased. As the memory access time is increased the fraction of execution time spent in the memory system is also increased. The more time that is spent in the memory system, the more potential there is for prefetching to improve performance. This is indicated in gure 8.12 in the way the curves tend to dip lower and lower for the higher memory access costs. At a memory access cost of 100 cycles, a 60% improvement in performance is possible. Changing the instruction mix from 100% to 30% similarly aects the fraction of execution time that is spent in the memory system. With less time spent in the memory system, there is less potential for improvement. This is illustrated in that the downward dip on the graphs becomes smaller as you increase the instruction mix. This has the additional eect of decreasing the overhead spent in additional prefetch instructions. Since there are additional instructions executed in the 30% mix, the added overhead of additional prefetch instructions is less noticeable. These graphs show that prefetching can have almost no eect on execution time for gure 8.12b, to a 60% improvement in execution time in gure 8.12c. In all cases,

8.4. SOFTWARE PREFETCHING WITH NO STREAM CACHE label simultaneous number of demand references outstanding prefetches given priority A 1 no B 1 yes C 3 yes D 25 yes

175

pf queue size 32 in nite in nite in nite

Table 8.3: Summary of memory access models simulated. performance can be degraded for cache sizes less than 16KB or greater than 512KB. In the next section we investigate the eect of a more complicated memory access model.

More Complete Memory Access Models In this section, we present data taking into account dierent memory access models. Previously, we simply assumed both full and partial cache hits completed in one cycle. That is, as long as a memory request was issued it was counted as a hit regardless of how many cycles remained for the data to return to memory. To calculate total execution time, we simply multiplied the total number of misses by a constant miss penalty. In this section, we now fully account for the case of partial hits. If the miss penalty is 25 cycles, and a memory access occurs for that data 10 cycles after the prefetch miss, the balance of 15 cycles is charged to the execution time. Furthermore, we investigate dierent possibilities for simultaneous number of outstanding prefetches. Previously, we assumed that the memory system was fully interleaved such that there was no limit to how many outstanding prefetches are allowed at one time. We now limit accesses such that only a nite number of accesses are allowed at once. In the limit, where there are as many prefetches allowed as memory access cycles, the result should be the same as previously where a constant memory access penalty was assumed. We also look at dierent techniques for prioritizing memory accesses and prefetches

| |

0.60

|

0.40

|

0.20

|


|

0.80

| 2

| 4

| 8

| 16

| 32

| 64


0.00

|

|

0.00

|

0.20

|

0.40

1.20

|

0.60

|

0.80

1.00

|

1.00

|

1.20

1.40

|

1.40

|



| 2

| 4

| 8

| 16

| 32

| 64

0.60

|

0.40

|

0.20

| 4

| 8

| 16

| 32

| 64 (c)


0.00

|

|

| 2


|

|

1.00

|

0.00

|

0.20

|

0.40

1.20

0.80

|

0.60


|

0.80

|

|

1.00

1.40

|

1.20

|

1.40

(b)

|


(a)

| 2

| 4

| 8

| 16

| 32

easter hula bicycle

| 64


(d)

Figure 8.12: Relative execution time for a direct mapped cache with 90% of prefetches inserted. (a) is a memory access cost of 25 cycles and an instruction mix of 100% loads and stores, (b) is a memory access cost of 25 cycles and 30% loads and stores, (c) is a memory access cost of 100 cycles and an instruction mix of 100% loads and stores, and (d) is a memory access cost of 100 cycles and 30% loads and stores.

1.00

0.80

|

0.40

|

0.20

|

| 4

| 8

| 16

| 32

| 64


0.00

|

| 2

| 2

| 4

| 8

| 16

| 32

| 64

0.80

0.60

|


|

1.00

|

|

0.40

|

0.20

|

| 2

| 4

| 8

| 16

| 32

| 64 (c)


0.00

|

|

0.00

|

0.20

1.20

|

0.40

|

0.60

|

0.80

|

1.00


(b)

|


(a) 1.20

177

|


|

0.60

|

0.00

|

0.20

|

0.40

|

0.60

|

0.80

|

1.00

1.20

|

1.20

|



| 2

| 4

tennis hula easter

| 8

| 16

| 32

| 64


(d)

Figure 8.13: Relative Execution time for dierent memory models. (a) is for con guration A and a direct mapped cache, (b) is for con guration A and a 2-way associative cache, (c) is for con guration B and a direct mapped cache, and (d) is for con guration B and a 2-way associative cache.

1.00

0.80

|


|

0.60

|

0.40

|

0.20

|

| 2

| 4

| 8

| 16

| 32

| 64


0.00

|

|

0.00

|

0.20

|

0.40

|

0.60

|

0.80

|

1.00

1.20

|

1.20

|



| 4

| 8

| 16

| 32

| 64

| 2

0.80

|

0.20

|

| 8

| 16

| 32

| 64 (c)


0.00

|

| 4


tennis hula easter

|

0.60

| 2

|


1.00

|

|

0.40

|

0.00

|

0.20

1.20

|

0.40

|

0.60

|

0.80

|

1.00

(b)

|


(a) 1.20

| 2

| 4

| 8

| 16

| 32

| 64


(d)

Figure 8.14: Relative Execution time for dierent memory models. (a) is for con guration C and a direct mapped cache, (b) is for con guration C and a 2-way associative cache, (c) is for con guration D and a direct mapped cache, and (d) is for con guration D and a 2-way associative cache.


179

and dierent possibilities for simultaneous numbers of outstanding prefetches. The con gurations simulated are summarized in table 8.3. In con guration A, we assume demand loads and stores have no priority over prefetches. In other words, the memory system services all requests in the order received. This has an advantage in implementation simplicity. Since all memory accesses are equivalent, no special considerations must be made for prefetches. This has a signi cant disadvantage in performance in that prefetches will stall the processor even if the data requested is not necessary. We further assume that a maximum of 32 pending prefetches can be queued. When the queue becomes full, additional prefetches are discarded. If we allowed an in nite number of prefetches to queue up, then the performance with prefetching is always greatly degraded from the base case since large numbers of waiting prefetches hopelessly stall the servicing of demand misses. Only 1 outstanding memory access can occur at a time. Performance for con guration A is shown in gures 8.13a and 8.13b. In con guration B, we continue to assume that only one memory access can occur at a time, but we now assume loads and stores have priority over prefetches. Now even if there is a queue of prefetches waiting to be executed, a load or store will jump to the front of the queue to be executed. If there is an outstanding memory operation, the balance of the memory penalty must still be incurred as the prefetch returns, but the load or store will be executed immediately afterwards. Because demand loads and stores have priority over prefetches, there is no need to limit the pending queue to 32 entries. Performance data for this case is shown in gures 8.13c and 8.13d. Con guration C makes the same assumption but now assumes the memory is three way interleaved so that three memory accesses can be outstanding at once. Finally, con guration D allows for a maximum interleaving so that all memory access return in a constant time of 50 cycles. Performance for these con gurations is shown in gure 8.14. These gures indicate that as long as demand loads and stores are given priority over prefetches, the number of simultaneous outstanding prefetches does not signi cantly eect performance. Furthermore, the gures con rm that performance bene t from prefetching is enhanced when using 2 way associativity.


8.5 Summary In this chapter we present a design space exploration for software directed stride based prefetching. We rst study software directed prefetch instructions to replace the hardware SPT. Prefetching data into both a specialized prefetch cache and directly into the general purpose cache is investigated. We study the eect of varying dierent parameters including the eects of cache type, size, and associativity both when generating the hint le and when executing the benchmark. We also present a methodology to use software directed prefetching as a post processing tool to insert software prefetches into existing code. This is done with a pro le step followed by a prefetch insertion step. We study the eects of dierent cache con gurations used to generate the prefetch hint le, dierent numbers of prefetch instructions inserted, and dierent execution time parameters. We conclude that software prefetching reliably improves performance for cache sizes from 16KB to 512KB. Furthermore, a 2 way associative cache is required to achieve maximum prefetch bene ts. We nd that using a series stream cache is best for generating prefetch hint data, and inserting prefetch instructions to capture 90% of total prefetches results in the best performance. Throughout this study we have focused only on MPEG applications. Application of these prefetching techniques to other applications is a promising area for future research.

Chapter 9 Conclusions

1.6

Normalized Execution Time

1.4 1.2 1 0.8 0.6

Figure 9.1: Summary of speedup techniques.

In this thesis we have presented a number of techniques for improving mpeg deapplications on a desktop workstation. Figure 9.1 summarizes the eect of 0 .compress 4 these techniques for a typical case. The base bar shows the execution time on a DEC 3000/400 Alpha workstation for mpeg play. The I/O bar shows the execution time 0 .after 2 the I/O prefetching technique has been applied. Additionally, there is some time savings from a small software change to the benchmark. By releasing some of the synchronization restraints in writing the frames, some additional time is saved. 0 The next bar shows the savings in using software only DE arithmetic. Additional I/O assisted arithmetic. DE time could be base saved by using 4 way hardware Finally, the last Stream Ca 181

CHAPTER 9. CONCLUSIONS

182

bar shows the eect of adding a 128 element stride prediction table and 128 element parallel stream cache. The misses consist of approximately 50% instruction misses and 50% data misses. Since no enhancements were applied to the instruction cache, approximately half of the misses remain. Mpeg play performance was improved by approximately 1.5 times. In this work, the following contributions have been presented.

A system level decomposition and measurement of mpeg application execution time

An I/O improvement strategy based on I/O prefetching and attribute caching A software-only technique to achieve subword oating point data path parallelism

An implementation of this software only technique to validate speedup in a real system

Hardware extensions to make this technique more generally applicable and utilize higher parallelism

A comparison of hardware based prefetching techniques De ned the stream cache which was shown to reduce miss rates by approximately 80%

Proposed a software directed prefetch mechanism to greatly reduce stream cache hardware costs

Developed extensive new capabilities for RYO, a PA-RISC assembly instrumentation software tool

Digital multimedia will be a signi cant component of computing workloads in the future. Processors and algorithms are improving so rapidly that it is dicult to predict what will be possible. Just in the course of this work, digital video has been transformed from an expensive high end technology to a commonplace component of new personal computers.

Appendix A RYO RYO expects an assembly level, or .s, le for input. It makes two passes through the program and replaces memory operations (in the case of RYOLS) or oating point operations (in the case of RYOFP) with an unconditional jump to the RYO library calling stub. This stub saves the state of the machine, does any necessary parameter passing, calls the proper RYO library procedure, and nally unconditionally jumps back to the original execution sequence ( gure A.1). The overall design simplicity is an attractive feature, yet actual implementation involves overcoming a number of pesky problems. For this purpose, a number of small companion programs| xtargets, oset.awk, and lbranch.awk|have also been implemented.

A.1 Advantages and Disadvantages Using this relatively simple substitution strategy has a number of advantages and disadvantages. On the plus side, substituting a single branch instruction for the single memory instruction results in an exact one for one instruction substitution that leaves the relative positions of the instructions unchanged. Of course, code must be inserted to build the library calling stubs, but this code is placed between procedures where it can do little harm. Keeping the relative instruction position constant not only eliminates the need to translate a branch to a target plus a constant oset to a 183

APPENDIX A. RYO

184

original code: ... .SPACE $TEXT$,SORT=8 .SUBSPA $CODE$,QUAD=0,ALIGN=8,ACCESS=0x2c,CODE_ONLY,SORT=24 DoDitherImage .PROC .CALLINFO CALLER,FRAME=16,SAVE_RP,ARGS_SAVED .ENTRY STW %r2,-20(%r30) LDO 64(%r30),%r30 ADDIL LR'ditherType-$global$,%r27 ...

modified code: ... .SPACE $TEXT$,SORT=8 .SUBSPA $CODE$,QUAD=0,ALIGN=8,ACCESS=0x2c,CODE_ONLY,SORT=24 DoDitherImage .PROC .CALLINFO CALLER,FRAME=16,SAVE_RP,ARGS_SAVED .ENTRY $n1 b,n $m1 ; STW %r2,-20(%r30) ; substituted instruction LDO 64(%r30),%r30 ADDIL LR'ditherType-$global$,%r27 ...

procedure calling stub: $m1 stw %r2,156(0,%r30) bl $saveall,%r2 nop copy %r30,%r25 ;copy basereg to %r25 bl .+12,%r26 ldo -3(%r26),%r26 STW %r2,-20(%r30) bl ryo_word,%r2 ldo 320(%r30),%r30 bl $restall,%r2 ldo -320(%r30),%r30 STW %r2,-20(%r30) b,n $n1+4

; save machine state ; pass parameters ; copy op to pass to ryo_word ; call instrumentation ; routine ; restore machine state ; do mem op ; return to normal ; program flow

Figure A.1: Simple instruction substitution.

A.1. ADVANTAGES AND DISADVANTAGES

185

corrected oset, but is also the only way to ensure correct indirect branching. In the case of indirect branching, the target address cannot be known until runtime, so the oset cannot be translated when doing the instrumentation. Another advantage is the use of high level C-language routines for instruction substitution. This makes it relatively simple to implement arbitrarily complex onthe- y analysis. Finally, because the application code is actually executed along with the extra analysis routines, veri cation is performed simply by comparing program output before and after instrumentation. For example, when instrumenting an mpeg decoder, many subtle bugs were noticed only because the video image displayed did not look quite the way it did when running the uninstrumented code. The major disadvantage of this tool is that source code is required to do the instrumentation. This relies on the compiler to correctly generate an intermediate assembly output which does not always happen. Furthermore, library calls and other pieces of code dynamically linked in to the executable are not instrumented. Finally, in the current implementation, no information is maintained concerning basic blocks and cycles between instrumented instructions. This strategy was chosen to make implementation simpler. However, the the high level instrumentation routine does not know any global information about the program state. It only knows that a given instruction was executed. These drawbacks could be corrected in future revisions if necessary. Another disadvantage is that library routine implementation can actually eect execution characteristics of the instrumented executable. This is illustrated by comparing hula in gure 8.11 and the lower half of gure 8.14. Ideally, these curves should look the same but there is a dierence. Because the simulator is actually linked to the executable, changes in the simulator code eect the memory placement of the application. Changes in memory placement will lead to memory mapping eects, which are particularly noticeable in the smaller cache sizes. The simulator used to generate these two curves would produce the same data, but because the simulator data structures are signi cantly enlarged between the two gures, the memory behavior of the underlying application is modi ed and the simulation results are eected.

186

APPENDIX A. RYO

A.2 Implementation In order to cause a one for one instruction substitution, the basic strategy is to replace a targeted instruction with an unconditional jump to a stub, save the machine state, and unconditionally jump back to the instruction following the targeted instruction (see gure A.1). If the return address was not hard coded at the time of instrumentation, it would be necessary to store a return address in a register, and thereby alter the machine state.

A.2.1 Branch Delay Slot A problem arises when a load or store instruction falls in the branch delay slot of a preceding branch instruction. In this case, a branch would be substituted in the branch delay slot. The instruction instrumentation routine is correctly called as anticipated, but the nal branch is to the return point statically calculated at instrumentation time. Thus, the control ow returns to the second instruction following the branch rather than the correct branch target ( gure A.2). The branch placed in the branch delay slot of the initial branch correctly modi es control ow when a nullifying branch is used. This problem is corrected as shown in gure A.3. Now the branch instruction with a load or store following it is also targeted for replacement. It now has its own stub with an appropriate instrumentation call. This time, though, the nal instruction in the stub is not the load or store instruction as before, but instead the branch with the load or store in the delay slot. The nal branch returns to the second instruction past the branch, since the instruction immediately following was already executed in the delay slot. Furthermore, the initial instruction in the stub is the same branch followed by another branch in the delay slot that branches to the next sequential instruction. This determines if the instruction in the delay slot has been nulli ed. If it was nulli ed, then the branch to the next sequential instruction is now nulli ed instead, and the branch goes to the intended target. If the instruction was not nulli ed, then the branch in the delay slot is executed, and the control ow continues sequentially. Additionally, the load or store instruction in the main program body is replaced with

A.2. IMPLEMENTATION

187

original code: ... ADDIL BL STW LDO ... ... $target$ ADDIL LDO ...

LR'ditherType-$global$,%r27 %r2,$target$ %r2,-20(%r30) 64(%r30),%r30

LR'M$2-$global$,%r27 RR'M$2-$global$(%r1),%r26

modified code: ... ADDIL LR'ditherType-$global$,%r27 BL %r2,$target$ $n1 b,n $m1 LDO 64(%r30),%r30 ... ... $target$ ADDIL LR'M$2-$global$,%r27 LDO RR'M$2-$global$(%r1),%r26 ... ... $m1 stw %r2,156(0,%r30) bl $saveall,%r2 nop copy %r30,%r25 ;copy basereg to %r25 bl .+12,%r26 ldo -3(%r26),%r26 STW %r2,-20(%r30) bl ryo_word,%r2 ldo 320(%r30),%r30 bl $restall,%r2 ldo -320(%r30),%r30 STW %r2,-20(%r30) b,n $n1+4

Figure A.2: Mistake in control ow for branch delay slot.

APPENDIX A. RYO

188

initial code: ... $L0 B STW ...

$00000005 %r29,-52(%r30)

modified code: ... $L0 $n5 $n6

b,n $m5 b,n $m6 ...

; B $00000005 ; STW %r29,-52(%r30)

procedure calling stubs: $m5 B $00000005 b,n .+4 stw %r2,156(0,%r30) bl $saveall,%r2 nop copy %r30,%r25 bl .+12,%r26 ldo -3(%r26),%r26 STW %r29,-52(%r30) bl ryo_word,%r2 ldo 320(%r30),%r30 bl $restall,%r2 ldo -320(%r30),%r30 B $00000005 STW %r29,-52(%r30) b,n $n5+8

; test for branch nullification

; execute branch and instruction ; in branch delay slot

$m6 ... ... ... b,n

; simple substitution routine as in fig 1 $n6+4

Figure A.3: Unconditional branch instrumentation.

A.2. IMPLEMENTATION

189

an unconditional branch to its own stub as before. At this point one might wonder why it is necessary to replace both the branch and the instruction in the delay slot with branches to stubs. Since the instrumentation routine for the load or store in the delay slot is called from the stub for the branch, why is it necessary for the instruction in the delay slot to have its own calling stub? It is possible for the instruction in the delay slot to be a target for some previous jump. In this case it should be executed as a standalone instruction and not in the delay slot of a branch. A nal complication is added in the case of an unconditional branch. Recall that the rst instructions in the calling stub are the original branch and a branch in its delay slot (see gure A.3). This was done to determine if the delay slot instruction should be nulli ed. For unconditional branches, the nulli cation is dependent upon the direction of the branch, so that care must be taken to ensure that the branch direction is the same as it was in the original code. This is done by having the branch go to a forward or reverse stub, which then branches to the original target (see gure 4). The decision whether to use a forward or reverse stub can be determined at instrumentation time by comparing the address of the branch instruction with the target. In this nal form, all branches will be correctly instrumented and executed.

A.2.2 Saving and Restoring Machine State and Parameter Passing After it has been determined that the load or store instruction targeted is not to be nulli ed, the state of the machine must be stored. This is done simply by copying the register set to the stack. All caller-save general purpose and oating point registers, as well as relevant special registers such as the shift amount register, and space id registers are copied. The stack pointer is then incremented to point to the new top of stack. Performance could be improved by saving only those registers currently in use, but doing the necessary dependency checking would probably greatly complicate the implementation.

APPENDIX A. RYO

190 initial code: ... ADDIL LR'ditherType-$global$,%r27 COMIB,=,N 0,%r8,$0002004A ;offset 0x2ac LDWX,S %r9(%r7),%r26 ...

modified code: $n115 $n116

... ADDIL LR'ditherType-$global$,%r27 b,n $m115 b,n $m116 ...

; COMIB,=,N 0,%r8,$0002004A ; LDWX,S %r9(%r7),%r26

procedure calling stub: $m115 COMIB,=,N 0,%r8,$forwstub115 b,n .+4 stw %r2,156(0,%r30) bl $saveall,%r2 nop copy %r9,%r24 copy %r7,%r25 bl .+12,%r26 ldo -3(%r26),%r26 LDWX,S %r9(%r7),%r26 bl ryo_wx,%r2 ldo 320(%r30),%r30 bl $restall,%r2 ldo -320(%r30),%r30 COMIB,=,N 0,%r8,$forwstub115 LDWX,S %r9(%r7),%r26 b,n $n115+8 $forwstub115 b,n $0002004A

; replace target field

; do branch and op in delay slot

Figure A.4: Conditional branch instrumentation.

Next, the appropriate parameters must be copied to the correct registers for passing to the instrumentation routine. First, a BL instruction is executed such that the return address points to the targeted load store instruction. This pointer is passed to the instrumentation routine so that the instruction in machine format is available for analysis. In most cases, the base address and oset are also passed as parameters, but this can vary for dierent types of memory operations.

After the instrumentation routine is called, the machine state is restored in the opposite way from which it was saved, and the control ow is returned to the next proper instruction. In this way, when the program is executed, all instructions will be executed as normal, but instrumentation routines will also be called for all targeted instructions.

A.3. OVERCOMING COMPILER ERRORS

191

A.3 Overcoming Compiler Errors The above implementation assumes that the .s assembly le is correct. Unfortunately, the HP version 9.75 C compiler does not always produce a correct .s le. The author has found two problems for which solutions are given below, however, it is likely that other compiler bugs exist. The rst problem has been xed in version 10.0 and higher compilers [Lee97]. If the instrumented code does not behave as expected, the rst thing to check is to see if it runs correctly without instrumentation, but compiled through an intermediate assembly language step. For the above example, one would see if the executable created with : cc -S -O test.c as -o test.o test.s cc test.o -o test_s

behaves the same as one compiled in the normal way. The GNU C compiler actually uses an intermediate .s le in program compilation, since it generates object code with the native assembler. For this reason, one would expect generation of a correct .s le when compiling with GCC, however, use of the non-native compiler may generate less ecient code.

A.3.1 Branch Targets Occasionally, the compiler will lose track of branch targets and substitutes \???" for the branch target eld. If left uncorrected, this causes an error in the assembler. The tool xtargets xes this problem. Fixtargets requires the existence of a correct .o and the awed .s le. These can be created with : cc -S -c test.c.

Next, type : fixtargets test

APPENDIX A. RYO

192

and the correct .o code will be disassembled to provide the correct branch targets for substitution into the .s le. Branches to a symbolic target plus an oset are handled such that the oset is always in decimal.

A.3.2 Indirect Procedure Calls If a procedure name is used as a parameter passed to another procedure, incomplete assembly code is generated. If, for example, a procedure named key is the parameter to another procedure call, the compiler will generate : .WORD key.

towards the end of the .s le. This must be altered so that it instead reads : .WORD

P'key.

A.4 Example Data This section presents experimental data actually obtained with RYOFP, the version of RYO that instruments oating point instructions and was originally written at HP Labs. The example shows arithmetic data generated from a JPEG [Wal91] decoder. To generate this data, RYOFP was run on djpeg, the Independent JPEG Group's JPEG software decoder modi ed to use oating point for the DCT calculations. It was also modi ed to use DE arithmetic enhancement for increased performance [ZL94]. Floating point operations were selected for instrumentation. The instrumentation library dynamically calculated the length of operands in bits for each oating point arithmetic operation encountered when executing the instrumented code on the image Lena, and wrote the longer of the two operands to an output le. Finally, a graph was constructed to show a histogram of operand lengths. The most interesting thing to note from the data is the large number of zero length operands. A compressed image tends to have many zero elements in each 8x8 macro block, and this is clearly indicated in the experimental data obtained.

A.4. EXAMPLE DATA

193

30000

Number of Operations

25000

20000

15000

10000

5000

52

50

48

46

44

42

40

38

36

34

32

30

28

26

24

22

20

18

16

14

12

8

10

6

4

2

0

0

Length of Longer Operand (bits)

Figure A.5: Operand length histogram. There is also a noticeable hump at approximately 27 bits. This is an eect of DE calculation methodology. In this case, operands are 27 bits long when one of the two operands packed into a single double precision data word equals zero.

194

APPENDIX A. RYO

Appendix B MPEG1 and MPEG2 Miss Rate Data B.1 mpeg play{hula Application/Movie

Line Size (bytes)

Cache Size (bytes)

Assoc

Miss Rate

mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula

16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16

1024 1024 1024 1024 2048 2048 2048 2048 4096 4096 4096 4096 8192 8192 8192 8192 16384 16384 16384 16384

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

0.173727 0.123089 0.0957745 0.0812528 0.10327 0.0672075 0.0583218 0.053411 0.0598037 0.0412208 0.03731 0.0364255 0.0409572 0.0272722 0.0261475 0.0256683 0.0284977 0.0221618 0.0201778 0.019812

195

196 mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula

APPENDIX B. MPEG1 AND MPEG2 MISS RATE DATA 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32

32768 32768 32768 32768 65536 65536 65536 65536 131072 131072 131072 131072 262144 262144 262144 262144 524288 524288 524288 524288 1048576 1048576 1048576 1048576 1024 1024 1024 1024 2048 2048 2048 2048 4096 4096 4096 4096 8192 8192 8192 8192

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

0.0230678 0.0190093 0.0184723 0.0183742 0.0207588 0.0183038 0.0180217 0.0179907 0.0162466 0.017166 0.0177115 0.0177952 0.0125277 0.0108428 0.0108948 0.0106243 0.00792647 0.00594057 0.00630757 0.00751395 0.00792593 0.00157595 0.00088785 0.000887917 0.222435 0.199197 0.17398 0.147824 0.13031 0.078905 0.0661928 0.0509563 0.0730507 0.0370752 0.0286085 0.0272042 0.0366148 0.0190033 0.0157857 0.0151324

B.1. MPEG PLAY{HULA mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula

32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 64 64 64 64 64 64 64 64 64 64 64 64

197 16384 16384 16384 16384 32768 32768 32768 32768 65536 65536 65536 65536 131072 131072 131072 131072 262144 262144 262144 262144 524288 524288 524288 524288 1048576 1048576 1048576 1048576 1024 1024 1024 1024 2048 2048 2048 2048 4096 4096 4096 4096

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

0.022034 0.0128458 0.0109474 0.0106511 0.0154609 0.0100974 0.00962988 0.00955588 0.0125588 0.00948208 0.00923127 0.00920293 0.00917025 0.00883663 0.00903965 0.0090659 0.0067375 0.00560902 0.00565972 0.00560805 0.00416393 0.00307735 0.00322865 0.0038255 0.00416308 0.00086545 0.000444467 0.0004445 0.33465 0.293885 0.267528 0.273152 0.190422 0.16267 0.159533 0.140879 0.108702 0.0553127 0.0405808 0.0372855

198 mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula


8192 8192 8192 8192 16384 16384 16384 16384 32768 32768 32768 32768 65536 65536 65536 65536 131072 131072 131072 131072 262144 262144 262144 262144 524288 524288 524288 524288 1048576 1048576 1048576 1048576 1024 1024 1024 1024 2048 2048 2048 2048

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

0.0455612 0.0216375 0.0134686 0.0131259 0.025339 0.00950583 0.00656767 0.00604455 0.0151346 0.00604578 0.00523577 0.00517535 0.0106414 0.00514088 0.00478702 0.00475285 0.00656647 0.00462863 0.00463943 0.00463498 0.00404335 0.00293547 0.00295845 0.0029798 0.00230207 0.00159677 0.0016563 0.0019429 0.00230135 0.000468183 0.000222667 0.000222717 0.426183 0.367688 0.338053 0.319898 0.328637 0.271227 0.252057 0.256913

B.1. MPEG PLAY{HULA mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula mpeg play-hula

128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128

199 4096 4096 4096 4096 8192 8192 8192 8192 16384 16384 16384 16384 32768 32768 32768 32768 65536 65536 65536 65536 131072 131072 131072 131072 262144 262144 262144 262144 524288 524288 524288 524288 1048576 1048576 1048576 1048576

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

0.193955 0.144649 0.142501 0.14058 0.0775492 0.0460755 0.0280903 0.0225227 0.045588 0.0117395 0.0076803 0.0053182 0.020463 0.00547812 0.00320302 0.00299907 0.0127719 0.00321852 0.00253928 0.00248692 0.00645965 0.0025462 0.00239278 0.00237035 0.00310587 0.00156337 0.00154647 0.00158423 0.00143582 0.000825717 0.00084805 0.000982933 0.00143513 0.000248133 0.000111683 0.000111717

200

APPENDIX B. MPEG1 AND MPEG2 MISS RATE DATA

B.2 mpeg play{easter Application/Movie

Line Size (bytes)

Cache Size (bytes)

Assoc

Miss Rate

mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter

16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16

1024 1024 1024 1024 2048 2048 2048 2048 4096 4096 4096 4096 8192 8192 8192 8192 16384 16384 16384 16384 32768 32768 32768 32768 65536 65536 65536 65536 131072 131072 131072 131072 262144 262144 262144

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4

0.16497 0.118648 0.0963868 0.0865605 0.0953762 0.0656227 0.0552283 0.0496558 0.0497065 0.0357527 0.031474 0.0304098 0.0346463 0.021328 0.0193057 0.0181097 0.0245062 0.0175313 0.0157837 0.0156819 0.0188722 0.0153944 0.0147521 0.0146136 0.0152251 0.0144144 0.0143832 0.0143871 0.0112757 0.0106018 0.0110248 0.011499 0.00650997 0.00538833 0.00431623

B.2. MPEG PLAY{EASTER mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter

16 16 16 16 16 16 16 16 16 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32

201 262144 524288 524288 524288 524288 1048576 1048576 1048576 1048576 1024 1024 1024 1024 2048 2048 2048 2048 4096 4096 4096 4096 8192 8192 8192 8192 16384 16384 16384 16384 32768 32768 32768 32768 65536 65536 65536 65536 131072 131072 131072

8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4

0.00441248 0.00531197 0.00131958 0.00102027 0.000937383 0.00531197 0.00131357 0.000930317 0.000930317 0.201212 0.165568 0.140379 0.123536 0.116393 0.0763218 0.0611285 0.0535513 0.0600482 0.0344328 0.0276647 0.0262318 0.0332618 0.0164208 0.0130852 0.0114369 0.0210482 0.0109575 0.00881257 0.00859655 0.0133214 0.00846745 0.00780023 0.00767323 0.00957458 0.00766898 0.00745207 0.00742252 0.0066824 0.00564533 0.00581685

202 mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter


131072 262144 262144 262144 262144 524288 524288 524288 524288 1048576 1048576 1048576 1048576 1024 1024 1024 1024 2048 2048 2048 2048 4096 4096 4096 4096 8192 8192 8192 8192 16384 16384 16384 16384 32768 32768 32768 32768 65536 65536 65536

8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4

0.00604782 0.0037632 0.00286083 0.00232857 0.00240052 0.00286582 0.000711183 0.000524533 0.000470817 0.00286582 0.00070645 0.000465867 0.000465867 0.291505 0.24248 0.212377 0.19021 0.169652 0.1323 0.114725 0.10355 0.087009 0.0490873 0.0351385 0.0317628 0.0418542 0.0194598 0.0144684 0.0139946 0.0251108 0.00892775 0.006368 0.00533752 0.0137695 0.00536372 0.00432873 0.00418003 0.00872863 0.00436363 0.00391525

B.2. MPEG PLAY{EASTER mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter

64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128

203 65536 131072 131072 131072 131072 262144 262144 262144 262144 524288 524288 524288 524288 1048576 1048576 1048576 1048576 1024 1024 1024 1024 2048 2048 2048 2048 4096 4096 4096 4096 8192 8192 8192 8192 16384 16384 16384 16384 32768 32768 32768

8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4

0.00386038 0.00475045 0.0030871 0.00308973 0.00318955 0.00254078 0.00154065 0.0012672 0.00131685 0.00165993 0.000386733 0.000271467 0.000237367 0.00165993 0.000383167 0.000233467 0.000233467 0.397128 0.333203 0.293957 0.264602 0.277078 0.225822 0.18157 0.178898 0.152244 0.109121 0.0953535 0.0942937 0.0813195 0.0369962 0.0239925 0.0192903 0.0570378 0.0135737 0.00830055 0.00805632 0.0273543 0.00665888 0.00300117

204 mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter mpeg play-easter

APPENDIX B. MPEG1 AND MPEG2 MISS RATE DATA 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128

32768 65536 65536 65536 65536 131072 131072 131072 131072 262144 262144 262144 262144 524288 524288 524288 524288 1048576 1048576 1048576 1048576

8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

0.00244428 0.011262 0.0037695 0.00214672 0.00203092 0.0048689 0.00191477 0.00166272 0.00168248 0.00235388 0.000854333 0.000695883 0.00072455 0.001155 0.000210767 0.000141017 0.000119583 0.001155 0.00020775 0.000117033 0.000117033

B.3 mpeg2play{tennis Application/Movie

Line Size (bytes)

Cache Size (bytes)

Assoc

Miss Rate

mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis

16 16 16 16 16 16 16 16 16 16 16 16

1024 1024 1024 1024 2048 2048 2048 2048 4096 4096 4096 4096

1 2 4 8 1 2 4 8 1 2 4 8

0.186968 0.132596 0.115162 0.106516 0.123094 0.0951899 0.0785815 0.0739184 0.0809824 0.0495559 0.0472664 0.0425461

B.3. MPEG2PLAY{TENNIS mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis

16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 32 32 32 32 32 32 32 32

205 8192 8192 8192 8192 16384 16384 16384 16384 32768 32768 32768 32768 65536 65536 65536 65536 131072 131072 131072 131072 262144 262144 262144 262144 524288 524288 524288 524288 1048576 1048576 1048576 1048576 1024 1024 1024 1024 2048 2048 2048 2048

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

0.0447799 0.0305126 0.0247439 0.0237545 0.0284645 0.0217187 0.0188995 0.0183738 0.0218229 0.0179976 0.0173013 0.0172221 0.0191369 0.0168032 0.0166025 0.0165768 0.0177898 0.0164802 0.01638 0.0163643 0.0170476 0.0163525 0.0163106 0.0163053 0.0162386 0.0161048 0.0161866 0.0162634 0.0138219 0.0142154 0.0145329 0.014761 0.262466 0.150253 0.114163 0.0979899 0.155245 0.0961399 0.0740676 0.0631944

206 mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis


4096 4096 4096 4096 8192 8192 8192 8192 16384 16384 16384 16384 32768 32768 32768 32768 65536 65536 65536 65536 131072 131072 131072 131072 262144 262144 262144 262144 524288 524288 524288 524288 1048576 1048576 1048576 1048576 1024 1024 1024 1024

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

0.102585 0.045923 0.0442231 0.038067 0.0565661 0.0235759 0.0181404 0.0168804 0.0207827 0.0141375 0.0109206 0.0100877 0.0142995 0.00976334 0.00916897 0.00907125 0.0113745 0.00876725 0.00859606 0.00858411 0.00992687 0.00850988 0.00843345 0.00842271 0.00916686 0.00842631 0.0083945 0.00839148 0.00855839 0.00828885 0.00832617 0.0083666 0.0072305 0.00731045 0.00746947 0.00756831 0.330335 0.190018 0.144659 0.123351

B.3. MPEG2PLAY{TENNIS mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis

64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64

207 2048 2048 2048 2048 4096 4096 4096 4096 8192 8192 8192 8192 16384 16384 16384 16384 32768 32768 32768 32768 65536 65536 65536 65536 131072 131072 131072 131072 262144 262144 262144 262144 524288 524288 524288 524288 1048576 1048576 1048576 1048576

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

0.189663 0.117953 0.08501 0.0623301 0.124261 0.0534165 0.0520572 0.0386796 0.0683933 0.022922 0.0181627 0.0168581 0.0200005 0.0113234 0.00807598 0.00687686 0.0121448 0.00573247 0.00507519 0.00488039 0.00825181 0.00470246 0.00448456 0.00447931 0.00631117 0.0044261 0.00434653 0.00433679 0.00534344 0.0043538 0.00432084 0.00431925 0.00473175 0.00426694 0.00428044 0.00430265 0.00386204 0.00372696 0.00380682 0.0038438

208 mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis


1024 1024 1024 1024 2048 2048 2048 2048 4096 4096 4096 4096 8192 8192 8192 8192 16384 16384 16384 16384 32768 32768 32768 32768 65536 65536 65536 65536 131072 131072 131072 131072 262144 262144 262144 262144 524288 524288 524288 524288

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

0.407086 0.258531 0.235089 0.237746 0.237534 0.16048 0.121662 0.0981334 0.15809 0.0709259 0.0689798 0.0456893 0.0897267 0.0282121 0.0230844 0.0215451 0.0293946 0.0137853 0.0094672 0.00897664 0.0144705 0.00470252 0.00357486 0.00307048 0.00837652 0.00280709 0.0024092 0.00240274 0.00531847 0.00238745 0.00226435 0.00225344 0.00382762 0.00228657 0.00224208 0.00224113 0.00300868 0.00221826 0.00221754 0.0022293

B.3. MPEG2PLAY{TENNIS mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis mpeg2play-tennis

128 128 128 128

209 1048576 1048576 1048576 1048576

1 2 4 8

0.00223146 0.00189864 0.00193272 0.00194487

210

APPENDIX B. MPEG1 AND MPEG2 MISS RATE DATA

Appendix C Hardware Design Truth Tables

211

APPENDIX C. HARDWARE DESIGN TRUTH TABLES

212

DE mode 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

a ? Eb

E E

or

a1 ?

1 2 3 4 5 .. . 54 55 0 0 0 0 0 .. . 0 0 1 1 1 1 1 .. . 1 1 2 2 2 2 2 .. . 2 2 .. . 25 25 25 25 25 .. . 25 25 26 26 26 26 26 .. . 26 26

E

b1

a2 ?

E

E

XXX XXX XXX XXX XXX XXX .. . XXX XXX 0 1 2 3 4 .. . 25 26 0 1 2 3 4 .. . 25 26 0 1 2 3 4 .. . 25 26 .. . 0 1 2 3 4 .. . 25 26 0 1 2 3 4 .. . 25 26

b2

O54 i52

0 0 0 0 0 0 0

O53

O52

i52

i51

i51

0 0 0 0 0 0

i50

i52

0 0 0 0 0

i52

i51

i50

i52

i51

i50

i52 i52 i52

i52

i52

0 0 0 0 0 0 0 0 0 0 0 0

i51 i51 i51

i50 i50 i50

i51

i50

i52

i51

i52

i51

i51

i52 i52 i52

i52 i52

0 0 0 0 0

i50

i51 i51 i51

i51 i51

i52 i52 i52 i52 i52

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

G1

R1

O31

O30

O29

i30

i29

i28

i29

i32

... ... ... ... ... ... ...

0 0 0 0 0 0 0

0 0 0 0 0 0 0

... ... ... ... ... ... ...

0 0

0 0 0 0 0 0 0

... ... ... ... ... ... ...

0 0

... ... ... ... ... ... ...

i28

i27

i31

i30

i29

i33

i32

i31

i32 i34

0 0

i29 i29 i29 i29 i29

i29

i29

i31 i33

i30

i29

i30

i29

i30 i30 i30

i29 i29 i29

i30

i30

i29

i31

i30

i29

i31

i30

i29

i30

i31 i31 i31

i29

i30 i30 i30

0 0

0 0

i52 i52

... ...

i31

i30

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

... ... ... ... ...

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

... ... ... ... ... ... ...

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0

0 0

0 0

... ...

0 0

0 0

i31

i30

i29 i29 i29

i29 i29

i52 i52 i52 i52 i52

... ... ... ... ... ... ...

O25

O24

i24

i23

i23

i25

i24

i27

i26

i26 i28

0 0 0 0 0 0

i23

0 0 i23 0 0 0 0 0 0 0 0 0 0

i23

0 0 i23

0 0 0 0

0 0 0 0 0

... ... ... ... ... ... ...

0 0 i23 0 0 0 0

0 0

... ...

0 0

i52 i52

i22

i25 i27

0 0

i22

i23

0 0 0

0 0

i22

i23

0 0 0 0 0

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

0 0 0

... ... ... ... ... ... ...

0 0

... ...

i22

i23

i22

i23

0 0 0 0 0

... ... ... ... ...

0 0 0

... ... ... ... ... ... ...

0 0

... ...

i22

i23

Table C.1: Truth table for 4 way DE mantissa shifter.

G2

R2

O2

O1

O0

i1

i0

i0

0

0 0

i2

i1

i0

i4

i3

i2

i3 i5

0 0

i0

i1

i2 i4

0 0 0

i0

i1 i3

i52

0 0 0

i2

i1

i0

i4

i3

i2

i3

0 0

i0

i1

i2

0 0 0

i0

i1

i23

0 0 0

i2

i1

i0

i4

i3

i2

i3

0 0

i0

i1

i2

0 0 0

i0

i1

i23

0 0 0

i2

i1

i0

i4

i3

i2

i3

0 0 i0

i1

i2

0 0 0

i0

i1

i23

0

0 0

i2

i1

i0

i4

i3

i2

i3

0 0

i0

i1

i2

0 0 0

i0

i1

i23

0 0 0

i2

i1

i0

i4

i3

i2

i3

0 0

i2

0 0

i1

i23

0

213

DE mode 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

a ? Eb

S or S2 0 0 0

E

or Ea1 ? Eb1 0 1 2 3 4 5 6 .. . 55 0 0 0 0 0 0 0 .. . 0 1 1 1 1 1 1 1 .. . 1 .. . 3 3 3 3 3 3 3 .. . 3 4 4 4 4 4 4 4 .. . 4 5 5 5 5 5 5 5 .. . 5 .. .

a2 ? Eb2

E

XXX XXX XXX XXX XXX XXX XXX .. . XXX 0 1 2 3 4 5 6 .. . 26 0 1 2 3 4 5 6 .. . 26 .. . 0 1 2 3 4 5 6 .. . 26 0 1 2 3 4 5 6 .. . 26 0 1 2 3 4 5 6 .. . 26 .. .

S1

X X X X X X X

_0 0 _ 1_ 0 3 _ 2 _ _ i

i

X 0 0 0 0 0 0 0

i1

i2

i

i

i52

_

i51

i1

i23

_

i22

i

i3

i23

i29 i29 i29

i0

i29

i23

i29

i

i29

_ 31 _ 31 _ 31 _ 31 _ 31 _ 31 _

_

i29 i29

i29

i23

_ 30 _ 30 _ 30 _ 30 _ 30 _ 30 _

i

i29

i i i

i31

_

i i i

i0

i30

i0

_

i22

_

i2

i1

i

i0

_ _

0 0 0

i0

i

i

i29 i29

_

i22

i3

i29

i23

i1

_ _

i0

i

i

_

i22

i0

0 0 0

_ _ _0 _2 2 _1 _0 i

i29

i

i

i29

i29

i

i

_0 _ 1 _0 0 3 _ 2 _ _

i

i

i

_ _

i

i29

i29

i

i0

i

i29

i30

i

i1

i29

i31 i

i

_ _ 1 _0 0 3 _ 2 _ _

i

i30

i0

i

i

i22

i2

i29

i30

i

_ _

0 0 0

i

_

i0

0 0 0

i29

i

i

_0 2_ 1_ 0 _ 2 _ _ i1

i29

i

i0

i

i

i

i29

i

_ _

0 0 0

_0 _ 1 _0 0 3 _ 2 _ _ i2

i

0

_ _ 30 _ 30 _ 30 _ 30 _ 30 _

i0

i

0 0 0 0 0 0 0 0

i30

i

i

i0

i

i

i0

_ _

i0

Table C.2: Truth table for 4 way DE sticky bit logic (continued in table C.3).

APPENDIX C. HARDWARE DESIGN TRUTH TABLES

214

DE mode 1 1 1 1 1 1 1 1 1 1

a ? Eb

E E

or

a1 ?

.. . 26 26 26 26 26 26 26 .. . 26

E

b1

a2 ?

E

.. . 0 1 2 3 4 5 26 .. . 26

E

b2

S or

S1

_ _ 52 _ 52 _ 52 _ 52 _ 52 _

_ _ _ _ 51 _ _ 51 _ _ 51 _ _ 51 _ _ 51 _ _

S2

i52

i51

i29

i

i

i29

i52 i i i i

i52

_

i51 i i i i

i51

_ _

0 0 0

i29 i29 i29 i29

_0 _ 1 _0 0 3 _ 2 _ _ i

i29

i

i29

i23

i2

i1

i

i

i

_

i22

i

i0

_ _

i0

Table C.3: Truth table for 4 way DE sticky bit logic (continued from table C.2).

Appendix D Software Prefetch Miss Rate Figures D.1 Series vs. Parallel Stream Cache Hint Generation In this appendix detailed data is presented for the software prefetching technique described in chapter 8. Data subsets are rst presented in graphical format and then the same data is exhaustively presented in tabular format. In section D.1, the eect of parallel vs. series stream cache in generating the hint le is examined. The hint le is generated from execution of the movie easter executed in a 32KB direct mapped main cache and data is presented for display of the movie hula. Miss rates are presented as fraction of misses eliminated in the graphs and absolute miss rate in the tables. Relative execution time is calculated relative to an equivalent cache con guration with no prefetching assuming a memory access penalty of 50 cycles. A relative execution time of less than one indicates a performance improvement for prefetching. In section D.2, performance is investigated for the display of three movies: hula, easter, and bicycle. The hint le is generated executing the movie easter in a 32KB direct mapped main cache and series stream cache. Metrics are reported using the 215

216

APPENDIX D. SOFTWARE PREFETCH MISS RATE FIGURES

same assumptions as in section D.1. Section D.3 relative execution time is investigated when varying instruction mix and memory access parameters. The hint le is generated executing the movie easter in a 32KB direct mapped main cache and series stream cache. 90% of useful prefetches are inserted. Instruction mix is reported when 100%, 50%, and 30% of instructions are loads and stores. Cache miss penalty of 10, 25, 50, and 100 cycles is examined. Again, relative execution time is to a comparable con guration with no prefetching. Relative execution time of less than one indicates an improvement in performance for prefetching.

D.1.1 Graphs


| | |


0.40

|

0.20

|

0.20

0.60

| 50

| 55

| 60


0.00 | 45

|

|

0.00 | 45

| 50

| 55

| 60


1.20

|

1.00

|

0.80

|

0.60

|

0.40

|


1.40


|

|

0.20

|

|

0.40

|

0.60

|


0.80

|

1.00

217

|

0.40

1.00

0.80

|

0.60

|

0.80

1.20

|

1.00

|


D.1. SERIES VS. PARALLEL STREAM CACHE HINT GENERATION

| 55

| 60

0.20



0.00 | 45

|

| 50

|

|

0.00 | 45


| 50

| 55

| 60


Figure D.1: Performance comparison for series vs. parallel prefetch insertion. All caches are direct mapped. The left side shows fraction of misses eliminated, the right shows relative execution times. The top is a 2KB cache and the bottom is an 8KB cache.


0.60

0.40

|


0.80

|

|

0.20


1.00

|

|

|

0.40

|

0.60

|

0.80

1.20

|

1.00

|


218

| 50

| 55

| 60


0.00 | 45

|

|

0.00 | 45


|

0.20

| 50

| 55

| 60


Figure D.2: Performance comparison for series vs. parallel prefetch insertion. All caches are direct mapped. The left side shows fraction of misses eliminated, the right shows relative execution times. The cache size is 32KB.




0.60

0.40

0.20

| 50

| 55

| 60

|

0.00 | 45

|


0.80

|

| 60

1.00

|

| 55

1.20

|

|

| 50



|

|

0.00 | 45

stream cache sort series parallel stream cache sort

1.40

|

|

0.00 | 45

|

|


| 60

0.20

|

| 55

|

0.40


|

0.60

| 50

|

|

0.80

0.40

0.20

0.00 | 45

1.00

|

0.20

0.60

|

0.80

|

|

0.40

|

1.00

|

|

0.60

|

0.80

1.20

219

|

1.00

|



| 50

| 55

| 60

series stream cache sort parallel stream cache sort | | | | | | | | 65 70 75 80 85 90 95 100 Percent of Total Prefetches Inserted

Figure D.3: Performance comparison for series vs. parallel prefetch insertion. All caches are direct mapped. The left side shows fraction of misses eliminated, the right shows relative execution times. The top is a cache size of 128KB and the bottom is 512KB.



0.40


| 55

| 60


|

1.00

| 50

|

0.60

|

0.40

|

|

0.80

|


|

0.00 | 45

1.20


|

0.20

|


0.60

|

| 60


|

0.20

0.80

|

| 55

|

0.40

| 50


|

0.60

|

|

0.80

1.00

|

0.00 | 45

1.00

|

0.20

|

0.40

|

0.60

|

0.80

1.20

|

1.00

|


220

| 50

| 55

| 60


0.00 | 45

|

|

0.00 | 45

|

0.20


| 50

| 55

| 60


Figure D.4: Performance comparison for series vs. parallel prefetch insertion. All caches are 2-way associative. The left side shows fraction of misses eliminated, the right shows relative execution times. The top is a cache sizes of 2KB and the bottom is 8KB.

0.60

0.40


| 55

| 60


0.00 | 45

|

|

| 50


|

0.20

|

0.00 | 45


|

0.20

|

|

0.40

0.80

|

0.60

|

|

0.80

1.00

221

|

1.00

|



| 50

| 55

| 60


Figure D.5: Performance comparison for series vs. parallel prefetch insertion. All caches are 2-way associative. The left side shows fraction of misses eliminated, the right shows relative execution times. The cache size is 32KB.

0.60

|

0.40

|

|

0.20

| 55

| 60


1.00

| 55

| 60


|

1.20

| 50

|

1.40


|

|

0.80

|

0.60

|

0.40

|

0.00 | 45

|

| 50

|

0.60

|

0.80

series stream cache sort parallel stream cache sort |

|

0.00 | 45

1.00

|

0.20



0.80

0.40


|

1.00

|

|

0.60

|

0.80

|

1.00



222


|

0.20

|

0.40

| 50

| 55

| 60


0.00 | 45

|

|

0.00 | 45

|

0.20

| 50

| 55

| 60

series stream cache sort parallel stream cache sort | | | | | | | | 65 70 75 80 85 90 95 100 Percent of Total Prefetches Inserted

Figure D.6: Performance comparison for series vs. parallel prefetch insertion. All caches are 2-way associative. The left side shows fraction of misses eliminated, the right shows relative execution times. The top is a 128KB cache sizes and the bottom is 512KB.


D.1.2 Table

Cache Assoc Size 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB

1 1 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 4 4 4 4 1 1 1 1 1 1 1 2 2 2 2 2 2 2

PF Instr Inserted 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100%

Series Miss Rate Rel Exec Time 0.1077 0.1056 0.1061 0.0992 0.0944 0.1152 0.1126 0.0623 0.0626 0.0629 0.0586 0.0580 0.0665 0.0729 0.0475 0.0501 0.0476 0.0491 0.0520 0.0525 0.0576 0.0613 0.0516 0.0574 0.0599 0.0578 0.0641 0.0725 0.0342 0.0351 0.0343 0.0352 0.0337 0.0401 0.0352

0.99 0.97 0.98 0.92 0.89 1.05 1.10 0.89 0.89 0.90 0.85 0.85 0.95 1.11 0.87 0.91 0.88 0.90 0.94 0.95 1.13 0.87 0.76 0.83 0.85 0.84 0.91 1.09 0.87 0.89 0.88 0.89 0.87 0.99 1.05

223

Parallel Miss Rate Rel Exec Time 0.0973 0.0970 0.1033 0.1109 0.1252 0.1075 0.1059 0.0634 0.0601 0.0604 0.0666 0.0651 0.0690 0.0638 0.0502 0.0496 0.0498 0.0530 0.0546 0.0572 0.0521 0.0600 0.0630 0.0701 0.0694 0.0909 0.0644 0.0582 0.0334 0.0337 0.0349 0.0364 0.0381 0.0371 0.0349

0.91 0.91 0.96 1.03 1.15 1.02 1.06 0.90 0.87 0.88 0.95 0.95 1.01 1.02 0.91 0.91 0.92 0.97 1.01 1.05 1.07 0.86 0.89 0.98 0.97 1.22 0.95 0.95 0.87 0.88 0.91 0.94 0.99 0.99 1.06


224 4KB 4KB 4KB 4KB 4KB 4KB 4KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 16KB 16KB 16KB 16KB 16KB 16KB 16KB 16KB 16KB 16KB 16KB 16KB

4 4 4 4 4 4 4 1 1 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 4 4 4 4 1 1 1 1 1 1 1 2 2 2 2 2

50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90%

0.0304 0.0311 0.0286 0.0297 0.0299 0.0318 0.0312 0.0383 0.0357 0.0391 0.0425 0.0385 0.0426 0.0477 0.0209 0.0209 0.0193 0.0198 0.0196 0.0206 0.0199 0.0191 0.0192 0.0174 0.0172 0.0174 0.0182 0.0167 0.0232 0.0229 0.0251 0.0196 0.0188 0.0233 0.0240 0.0157 0.0158 0.0140 0.0127 0.0120

0.87 0.88 0.84 0.86 0.87 0.92 1.06 0.94 0.90 0.96 1.02 0.95 1.03 1.26 0.85 0.85 0.82 0.84 0.84 0.88 1.04 0.85 0.86 0.82 0.82 0.83 0.86 1.02 0.89 0.89 0.94 0.83 0.81 0.92 1.12 0.85 0.85 0.81 0.79 0.77

0.0302 0.0302 0.0307 0.0313 0.0307 0.0325 0.0293 0.0356 0.0409 0.0353 0.0388 0.0354 0.0394 0.0347 0.0208 0.0212 0.0213 0.0213 0.0219 0.0209 0.0188 0.0190 0.0187 0.0190 0.0179 0.0181 0.0178 0.0171 0.0232 0.0279 0.0212 0.0217 0.0215 0.0241 0.0192 0.0155 0.0155 0.0142 0.0128 0.0121

0.87 0.88 0.90 0.92 0.93 0.98 1.04 0.91 1.00 0.92 0.98 0.95 1.03 1.06 0.86 0.88 0.89 0.90 0.94 0.94 1.03 0.86 0.86 0.88 0.87 0.90 0.92 1.04 0.90 1.01 0.88 0.90 0.93 1.00 1.03 0.86 0.87 0.85 0.83 0.84

D.1. SERIES VS. PARALLEL STREAM CACHE HINT GENERATION 16KB 16KB 16KB 16KB 16KB 16KB 16KB 16KB 16KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB

2 2 4 4 4 4 4 4 4 1 1 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 4 4 4 4 1 1 1 1 1 1 1 2 2 2

95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70%

0.0122 0.0116 0.0133 0.0135 0.0116 0.0099 0.0092 0.0096 0.0093 0.0168 0.0170 0.0151 0.0138 0.0132 0.0127 0.0176 0.0126 0.0126 0.0108 0.0086 0.0076 0.0074 0.0067 0.0120 0.0120 0.0103 0.0079 0.0069 0.0066 0.0056 0.0142 0.0143 0.0125 0.0106 0.0097 0.0095 0.0135 0.0118 0.0118 0.0101

0.80 0.99 0.84 0.84 0.80 0.76 0.75 0.78 0.99 0.85 0.86 0.81 0.79 0.78 0.79 1.10 0.84 0.84 0.80 0.75 0.73 0.74 0.95 0.84 0.84 0.80 0.74 0.72 0.73 0.93 0.84 0.85 0.81 0.76 0.75 0.76 1.08 0.84 0.84 0.80

0.0123 0.0111 0.0133 0.0130 0.0120 0.0101 0.0100 0.0095 0.0091 0.0166 0.0172 0.0161 0.0135 0.0135 0.0134 0.0128 0.0125 0.0123 0.0109 0.0084 0.0082 0.0077 0.0065 0.0120 0.0117 0.0104 0.0078 0.0074 0.0067 0.0056 0.0142 0.0139 0.0129 0.0106 0.0104 0.0098 0.0095 0.0119 0.0115 0.0102

0.87 1.00 0.84 0.85 0.84 0.80 0.83 0.85 0.99 0.85 0.88 0.87 0.82 0.85 0.87 1.00 0.85 0.86 0.84 0.78 0.81 0.83 0.96 0.85 0.85 0.83 0.78 0.80 0.81 0.95 0.85 0.86 0.85 0.80 0.83 0.84 0.99 0.85 0.85 0.83

225


226 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 256KB 256KB 256KB 256KB 256KB 256KB 256KB 256KB

2 2 2 2 4 4 4 4 4 4 4 1 1 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 4 4 4 4 1 1 1 1 1 1 1 2

80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50%

0.0077 0.0067 0.0063 0.0053 0.0115 0.0115 0.0098 0.0073 0.0062 0.0058 0.0046 0.0112 0.0113 0.0101 0.0081 0.0072 0.0070 0.0062 0.0111 0.0111 0.0096 0.0072 0.0061 0.0057 0.0046 0.0113 0.0113 0.0097 0.0072 0.0061 0.0056 0.0044 0.0085 0.0085 0.0081 0.0060 0.0051 0.0049 0.0042 0.0068

0.74 0.72 0.73 0.93 0.84 0.84 0.80 0.73 0.71 0.72 0.92 0.87 0.87 0.84 0.79 0.77 0.79 1.01 0.84 0.84 0.81 0.74 0.72 0.73 0.94 0.84 0.84 0.80 0.74 0.71 0.72 0.92 0.88 0.88 0.87 0.81 0.79 0.81 1.06 0.88

0.0075 0.0072 0.0065 0.0053 0.0116 0.0112 0.0099 0.0071 0.0067 0.0060 0.0048 0.0112 0.0110 0.0099 0.0080 0.0078 0.0072 0.0068 0.0112 0.0108 0.0095 0.0070 0.0066 0.0058 0.0048 0.0114 0.0110 0.0096 0.0069 0.0066 0.0058 0.0046 0.0086 0.0083 0.0071 0.0058 0.0056 0.0050 0.0043 0.0069

0.77 0.80 0.81 0.95 0.85 0.85 0.83 0.77 0.80 0.80 0.94 0.88 0.89 0.87 0.83 0.87 0.88 1.04 0.85 0.86 0.84 0.78 0.81 0.82 0.96 0.85 0.85 0.83 0.77 0.80 0.81 0.94 0.89 0.90 0.88 0.86 0.89 0.91 1.08 0.90

D.1. SERIES VS. PARALLEL STREAM CACHE HINT GENERATION 256KB 256KB 256KB 256KB 256KB 256KB 256KB 256KB 256KB 256KB 256KB 256KB 256KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB

2 2 2 2 2 2 4 4 4 4 4 4 4 1 1 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 4 4 4 4 1 1 1 1 1 1

60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95%

0.0068 0.0066 0.0046 0.0039 0.0037 0.0032 0.0069 0.0069 0.0066 0.0046 0.0039 0.0037 0.0031 0.0041 0.0041 0.0040 0.0029 0.0025 0.0023 0.0019 0.0036 0.0036 0.0035 0.0024 0.0020 0.0019 0.0017 0.0035 0.0035 0.0035 0.0024 0.0019 0.0019 0.0016 0.0009 0.0009 0.0009 0.0007 0.0006 0.0005

0.88 0.88 0.82 0.80 0.82 1.09 0.89 0.89 0.88 0.82 0.80 0.82 1.09 0.98 0.98 0.98 0.95 0.94 0.96 1.30 0.92 0.92 0.93 0.89 0.88 0.91 1.25 0.90 0.90 0.91 0.87 0.86 0.89 1.21 0.92 0.92 0.92 0.92 0.92 0.96

0.0068 0.0057 0.0045 0.0044 0.0038 0.0032 0.0070 0.0069 0.0058 0.0044 0.0043 0.0037 0.0032 0.0042 0.0041 0.0034 0.0029 0.0027 0.0024 0.0019 0.0037 0.0036 0.0029 0.0023 0.0023 0.0019 0.0017 0.0036 0.0036 0.0028 0.0022 0.0022 0.0019 0.0016 0.0010 0.0010 0.0008 0.0007 0.0006 0.0005

0.91 0.89 0.87 0.91 0.92 1.11 0.90 0.91 0.90 0.87 0.91 0.93 1.11 1.00 1.01 1.01 1.01 1.06 1.09 1.32 0.94 0.96 0.96 0.95 1.01 1.03 1.27 0.92 0.94 0.93 0.93 0.98 1.01 1.24 0.94 0.96 0.98 0.99 1.05 1.09

227


228 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB

1 2 2 2 2 2 2 2 4 4 4 4 4 4 4

100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100%

0.0005 0.0005 0.0005 0.0005 0.0003 0.0003 0.0003 0.0002 0.0005 0.0005 0.0005 0.0003 0.0003 0.0003 0.0002

1.34 0.97 0.97 0.98 0.98 0.99 1.03 1.44 0.99 0.99 1.00 1.00 1.01 1.05 1.46

0.0006 0.0006 0.0006 0.0004 0.0003 0.0003 0.0003 0.0002 0.0006 0.0006 0.0004 0.0003 0.0003 0.0003 0.0002

1.36 1.00 1.02 1.04 1.06 1.12 1.17 1.46 1.02 1.04 1.06 1.08 1.15 1.19 1.49

D.2 Adding Dierent Numbers of Prefetch Instructions D.2.1 Graphs


0.20


|

0.00 | 45

|

| 50

| 55

| 60

1.40

|

1.00

|

1.20


|


1.60

hula easter bicycle

|

0.60

|

0.40

|

0.20

|

|

0.80

|


0.40

|

|

|

0.20

| 60

|

0.40

0.60

|

| 55

hula easter bicycle

|

0.60

0.80

|

| 50

0.80

1.00

|

|

0.00 | 45

1.00

|

|

0.20

1.20

|

0.40

|

0.60

|

0.80

1.40

|

1.00

|


D.2. ADDING DIFFERENT NUMBERS OF PREFETCH INSTRUCTIONS 229

| 50

| 55

| 60


0.00 | 45

|

|

0.00 | 45

hula easter bicycle

| 50

| 55

| 60


Figure D.7: Performance comparison for dierent numbers of prefetch insertion. All caches are direct mapped. The left side shows fraction of misses eliminated, the right shows relative execution times. The top is a 2KB cache and the bottom is an 8KB cache.

|


0.20

| 55

| 60


0.40

0.20

0.00 | 45

|

| 50

|

|

0.00 | 45

hula easter bicycle

0.60

|

0.80

|

|

1.00

|

0.40

|

0.60

|

1.20

|

|

0.80

|

1.00



230

| 50

| 55

| 60


Figure D.8: Performance comparison for dierent numbers of prefetch insertion. All caches are direct mapped. The left side shows fraction of misses eliminated, the right shows relative execution times. The cache size is 32KB.

hula easter bicycle


1.00

0.80

0.60

0.40

0.20

0.00 | 45

|

| 60

| 50

| 55

| 60

|

| 55

|

|

| 50


1.20

|

1.40

|

0.00 | 45

|

|

0.00 | 45

|

|

0.20

|

0.40

|

0.60




| 60

0.40

0.20

|


| 55

hula easter bicycle

|

| 50

0.60

|

0.80

|

|

0.00 | 45

1.00

|

0.20

|

0.40

|

0.80

|

|

0.60

1.00

|

|

|

0.80

1.20

|

1.00

|



| 50

| 55

| 60


Figure D.9: Performance comparison for dierent numbers of prefetch insertion. All caches are direct mapped. The left side shows fraction of misses eliminated, the right shows relative execution times. The top is a 128KB cache and the bottom is a 512KB cache.


1.00

|

0.80

|

0.60

|

0.40

|


|

| 50

| 55

| 60

0.20


0.00 | 45

|

|

0.00 | 45

hula easter bicycle

|

|

0.20

1.20

| 50

| 55

| 60



0.60

0.40


0.20

0.00 | 45

|

| 60

0.80

|

| 55

1.00

|

|

| 50

hula easter bicycle

hula easter bicycle

|

|

0.00 | 45

1.20

|

|

0.20

|

0.40

|

|

0.60

|

0.80

|


1.00

|

0.40

|

0.60

|

0.80

1.40

|

1.00

|


232

| 50

| 55

| 60


Figure D.10: Performance comparison for dierent numbers of prefetch insertion. All caches are 2-way associative. The left side shows fraction of misses eliminated, the right shows relative execution times. The cache sizes are 2KB on the top and 8KB on the bottom.

| 50

| 55

| 60


0.60

0.40

hula easter bicycle


0.20

0.00 | 45

|

|

0.00 | 45

|

|

0.20

|

0.40

|

|

0.60

0.80

|

|

|

0.80

1.00

|

1.00

|



| 50

| 55

| 60

hula easter bicycle


Figure D.11: Performance comparison for dierent numbers of prefetch insertion. All caches are 2-way associative. The left side shows fraction of misses eliminated, the right shows relative execution times. The cache size is 32KB.


0.80

0.60

|

0.40

|


1.00

|

1.20

|

|

0.60

1.40

|

0.00 | 45

|

0.80

|


| 60

0.20

|

| 55

0.40

|

| 50

hula easter bicycle

|

1.00

0.60

|

|

0.00 | 45

|

0.20

|

0.40

|

0.60

|

0.80

|

0.80

|


1.00

|

|

1.00



234

| 50

| 55

| 60

hula easter bicycle


|

0.20

|

0.40

hula easter bicycle

| 50

| 55

| 60


0.00 | 45

|

|

0.00 | 45

|

0.20

| 50

| 55

| 60


Figure D.12: Performance comparison for dierent numbers of prefetch insertion. All caches are 2-way associative. The left side shows fraction of misses eliminated, the right shows relative execution times. The cache sizes are 128KB on the top and 512KB on the bottom.


D.2.2 Table Cache Assoc PF Instr Hula Size Inserted Miss Rel Rate Exec Time

Miss Rate

2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB

0.1005 0.0957 0.0869 0.0768 0.1091 0.0865 0.0913 0.0623 0.0591 0.0561 0.0517 0.0515 0.0468 0.0534 0.0467 0.0457 0.0451 0.0430 0.0428 0.0408 0.0393 0.0474 0.0415 0.0456 0.0451 0.0666 0.0533 0.0512 0.0296 0.0290 0.0253 0.0239 0.0254 0.0246 0.0201 0.0255

1 1 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 4 4 4 4 1 1 1 1 1 1 1 2 2 2 2 2 2 2 4

50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50%

0.1077 0.1056 0.1061 0.0992 0.0944 0.1152 0.1126 0.0623 0.0626 0.0629 0.0586 0.0580 0.0665 0.0729 0.0475 0.0501 0.0476 0.0491 0.0520 0.0525 0.0576 0.0613 0.0516 0.0574 0.0599 0.0578 0.0641 0.0725 0.0342 0.0351 0.0343 0.0352 0.0337 0.0401 0.0352 0.0304

0.99 0.97 0.98 0.92 0.89 1.05 1.10 0.89 0.89 0.90 0.85 0.85 0.95 1.11 0.87 0.91 0.88 0.90 0.94 0.95 1.13 0.87 0.76 0.83 0.85 0.84 0.91 1.09 0.87 0.89 0.88 0.89 0.87 0.99 1.05 0.87

Easter Rel Exec Time 1.10 1.06 0.98 0.89 1.19 0.99 1.10 1.01 0.97 0.93 0.88 0.89 0.84 1.01 0.92 0.90 0.89 0.87 0.87 0.86 0.93 0.94 0.86 0.92 0.91 1.22 1.05 1.12 0.90 0.89 0.83 0.80 0.84 0.85 0.89 0.89

Miss Rate

Bicycle Rel Exec Time

0.1115 0.1104 0.1063 0.1018 0.1015 0.1228 0.1266 0.0676 0.0713 0.0645 0.0639 0.0648 0.0734 0.0795 0.0557 0.0590 0.0543 0.0560 0.0594 0.0592 0.0638 0.0598 0.0600 0.0610 0.0605 0.0639 0.0692 0.0831 0.0380 0.0399 0.0370 0.0383 0.0375 0.0449 0.0406 0.0332

1.01 1.00 0.97 0.94 0.94 1.11 1.20 0.97 1.01 0.94 0.93 0.94 1.05 1.21 0.93 0.97 0.92 0.94 0.98 0.99 1.15 0.97 0.97 0.99 0.98 1.03 1.10 1.37 0.90 0.93 0.89 0.91 0.91 1.03 1.10 0.91


236 4KB 4KB 4KB 4KB 4KB 4KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 16KB 16KB 16KB 16KB 16KB 16KB 16KB 16KB 16KB 16KB 16KB 16KB 16KB 16KB 16KB 16KB

4 4 4 4 4 4 1 1 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 4 4 4 4 1 1 1 1 1 1 1 2 2 2 2 2 2 2 4 4

60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60%

0.0311 0.0286 0.0297 0.0299 0.0318 0.0312 0.0383 0.0357 0.0391 0.0425 0.0385 0.0426 0.0477 0.0209 0.0209 0.0193 0.0198 0.0196 0.0206 0.0199 0.0191 0.0192 0.0174 0.0172 0.0174 0.0182 0.0167 0.0232 0.0229 0.0251 0.0196 0.0188 0.0233 0.0240 0.0157 0.0158 0.0140 0.0127 0.0120 0.0122 0.0116 0.0133 0.0135

0.88 0.84 0.86 0.87 0.92 1.06 0.94 0.90 0.96 1.02 0.95 1.03 1.26 0.85 0.85 0.82 0.84 0.84 0.88 1.04 0.85 0.86 0.82 0.82 0.83 0.86 1.02 0.89 0.89 0.94 0.83 0.81 0.92 1.12 0.85 0.85 0.81 0.79 0.77 0.80 0.99 0.84 0.84

0.0250 0.0211 0.0203 0.0183 0.0184 0.0169 0.0306 0.0283 0.0252 0.0267 0.0263 0.0321 0.0359 0.0149 0.0137 0.0124 0.0117 0.0100 0.0126 0.0082 0.0128 0.0116 0.0100 0.0082 0.0075 0.0073 0.0056 0.0191 0.0161 0.0143 0.0132 0.0109 0.0153 0.0121 0.0114 0.0101 0.0086 0.0069 0.0055 0.0058 0.0046 0.0097 0.0083

0.89 0.81 0.80 0.77 0.79 0.90 0.94 0.89 0.84 0.87 0.87 1.00 1.19 0.85 0.82 0.79 0.78 0.75 0.84 0.90 0.85 0.82 0.79 0.75 0.74 0.76 0.89 0.91 0.84 0.80 0.78 0.74 0.87 0.95 0.85 0.82 0.78 0.74 0.72 0.76 0.90 0.84 0.81

0.0350 0.0313 0.0325 0.0336 0.0359 0.0363 0.0380 0.0417 0.0400 0.0420 0.0417 0.0463 0.0600 0.0228 0.0233 0.0214 0.0225 0.0228 0.0242 0.0241 0.0212 0.0215 0.0195 0.0198 0.0202 0.0214 0.0203 0.0255 0.0241 0.0278 0.0221 0.0216 0.0228 0.0253 0.0175 0.0174 0.0156 0.0152 0.0147 0.0150 0.0152 0.0154 0.0153

0.94 0.88 0.90 0.92 0.98 1.13 0.93 0.99 0.97 1.00 1.00 1.09 1.45 0.88 0.89 0.85 0.88 0.89 0.93 1.11 0.87 0.88 0.84 0.85 0.86 0.91 1.06 0.90 0.88 0.95 0.84 0.84 0.88 1.10 0.86 0.86 0.82 0.82 0.81 0.84 1.04 0.85 0.85

D.2. ADDING DIFFERENT NUMBERS OF PREFETCH INSTRUCTIONS 237 16KB 16KB 16KB 16KB 16KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 32KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB 64KB

4 4 4 4 4 1 1 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 4 4 4 4 1 1 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4

70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70%

0.0116 0.0099 0.0092 0.0096 0.0093 0.0168 0.0170 0.0151 0.0138 0.0132 0.0127 0.0176 0.0126 0.0126 0.0108 0.0086 0.0076 0.0074 0.0067 0.0120 0.0120 0.0103 0.0079 0.0069 0.0066 0.0056 0.0142 0.0143 0.0125 0.0106 0.0097 0.0095 0.0135 0.0118 0.0118 0.0101 0.0077 0.0067 0.0063 0.0053 0.0115 0.0115 0.0098

0.80 0.76 0.75 0.78 0.99 0.85 0.86 0.81 0.79 0.78 0.79 1.10 0.84 0.84 0.80 0.75 0.73 0.74 0.95 0.84 0.84 0.80 0.74 0.72 0.73 0.93 0.84 0.85 0.81 0.76 0.75 0.76 1.08 0.84 0.84 0.80 0.74 0.72 0.73 0.93 0.84 0.84 0.80

0.0070 0.0052 0.0038 0.0037 0.0029 0.0126 0.0114 0.0098 0.0081 0.0065 0.0061 0.0079 0.0092 0.0077 0.0063 0.0045 0.0029 0.0027 0.0022 0.0087 0.0071 0.0058 0.0039 0.0025 0.0021 0.0017 0.0096 0.0082 0.0069 0.0053 0.0039 0.0036 0.0056 0.0084 0.0069 0.0056 0.0038 0.0024 0.0020 0.0016 0.0083 0.0067 0.0054

0.77 0.73 0.70 0.73 0.90 0.85 0.82 0.78 0.75 0.72 0.74 0.96 0.84 0.80 0.76 0.71 0.68 0.71 0.89 0.83 0.79 0.76 0.71 0.68 0.70 0.88 0.84 0.81 0.77 0.73 0.71 0.73 0.98 0.84 0.79 0.76 0.71 0.69 0.71 0.89 0.83 0.79 0.75

0.0136 0.0127 0.0121 0.0126 0.0126 0.0187 0.0188 0.0169 0.0161 0.0153 0.0148 0.0182 0.0144 0.0141 0.0124 0.0109 0.0095 0.0092 0.0089 0.0138 0.0134 0.0118 0.0102 0.0087 0.0084 0.0079 0.0160 0.0159 0.0142 0.0129 0.0116 0.0113 0.0139 0.0134 0.0131 0.0115 0.0097 0.0080 0.0075 0.0069 0.0131 0.0127 0.0111

0.81 0.79 0.79 0.82 1.03 0.85 0.86 0.82 0.80 0.79 0.80 1.06 0.85 0.84 0.81 0.77 0.74 0.76 0.96 0.85 0.84 0.80 0.76 0.73 0.74 0.95 0.84 0.84 0.80 0.77 0.75 0.76 1.02 0.84 0.83 0.80 0.76 0.72 0.73 0.93 0.84 0.83 0.80


238 64KB 64KB 64KB 64KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 128KB 256KB 256KB 256KB 256KB 256KB 256KB 256KB 256KB 256KB 256KB 256KB 256KB 256KB 256KB 256KB 256KB 256KB 256KB

4 4 4 4 1 1 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 4 4 4 4 1 1 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 4

80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80%

0.0073 0.0062 0.0058 0.0046 0.0112 0.0113 0.0101 0.0081 0.0072 0.0070 0.0062 0.0111 0.0111 0.0096 0.0072 0.0061 0.0057 0.0046 0.0113 0.0113 0.0097 0.0072 0.0061 0.0056 0.0044 0.0085 0.0085 0.0081 0.0060 0.0051 0.0049 0.0042 0.0068 0.0068 0.0066 0.0046 0.0039 0.0037 0.0032 0.0069 0.0069 0.0066 0.0046

0.73 0.71 0.72 0.92 0.87 0.87 0.84 0.79 0.77 0.79 1.01 0.84 0.84 0.81 0.74 0.72 0.73 0.94 0.84 0.84 0.80 0.74 0.71 0.72 0.92 0.88 0.88 0.87 0.81 0.79 0.81 1.06 0.88 0.88 0.88 0.82 0.80 0.82 1.09 0.89 0.89 0.88 0.82

0.0035 0.0021 0.0017 0.0013 0.0069 0.0057 0.0051 0.0037 0.0026 0.0024 0.0021 0.0061 0.0048 0.0041 0.0029 0.0017 0.0014 0.0011 0.0063 0.0048 0.0041 0.0029 0.0016 0.0013 0.0010 0.0033 0.0029 0.0027 0.0020 0.0015 0.0013 0.0011 0.0027 0.0023 0.0021 0.0015 0.0009 0.0008 0.0006 0.0024 0.0019 0.0017 0.0012

0.70 0.67 0.70 0.88 0.87 0.83 0.82 0.78 0.76 0.79 1.00 0.86 0.82 0.80 0.77 0.75 0.78 0.99 0.85 0.81 0.79 0.76 0.73 0.76 0.97 0.91 0.89 0.89 0.87 0.87 0.91 1.16 0.92 0.91 0.90 0.89 0.89 0.93 1.19 0.93 0.91 0.91 0.90

0.0092 0.0074 0.0068 0.0060 0.0132 0.0130 0.0117 0.0104 0.0090 0.0086 0.0080 0.0127 0.0124 0.0109 0.0091 0.0073 0.0068 0.0060 0.0128 0.0124 0.0108 0.0089 0.0071 0.0065 0.0056 0.0088 0.0086 0.0081 0.0070 0.0058 0.0054 0.0047 0.0086 0.0083 0.0077 0.0068 0.0054 0.0050 0.0043 0.0090 0.0087 0.0081 0.0071

0.75 0.71 0.72 0.92 0.85 0.85 0.82 0.79 0.76 0.77 0.97 0.85 0.84 0.80 0.76 0.72 0.73 0.93 0.84 0.83 0.79 0.75 0.71 0.72 0.91 0.91 0.90 0.89 0.86 0.83 0.84 1.09 0.89 0.88 0.87 0.84 0.81 0.82 1.07 0.89 0.88 0.86 0.84

D.2. ADDING DIFFERENT NUMBERS OF PREFETCH INSTRUCTIONS 239 256KB 256KB 256KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB 1024KB

4 4 4 1 1 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 4 4 4 4 1 1 1 1 1 1 1 2 2 2 2 2 2 2 4 4 4 4 4

90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90% 95% 100% 50% 60% 70% 80% 90%

0.0039 0.0037 0.0031 0.0041 0.0041 0.0040 0.0029 0.0025 0.0023 0.0019 0.0036 0.0036 0.0035 0.0024 0.0020 0.0019 0.0017 0.0035 0.0035 0.0035 0.0024 0.0019 0.0019 0.0016 0.0009 0.0009 0.0009 0.0007 0.0006 0.0005 0.0005 0.0005 0.0005 0.0005 0.0003 0.0003 0.0003 0.0002 0.0005 0.0005 0.0005 0.0003 0.0003

0.80 0.82 1.09 0.98 0.98 0.98 0.95 0.94 0.96 1.30 0.92 0.92 0.93 0.89 0.88 0.91 1.25 0.90 0.90 0.91 0.87 0.86 0.89 1.21 0.92 0.92 0.92 0.92 0.92 0.96 1.34 0.97 0.97 0.98 0.98 0.99 1.03 1.44 0.99 0.99 1.00 1.00 1.01

0.0007 0.0006 0.0004 0.0018 0.0014 0.0014 0.0008 0.0005 0.0004 0.0003 0.0005 0.0004 0.0004 0.0003 0.0002 0.0002 0.0001 0.0003 0.0003 0.0003 0.0001 0.0001 0.0001 0.0000 0.0003 0.0002 0.0002 0.0001 0.0001 0.0001 0.0000 0.0003 0.0002 0.0002 0.0001 0.0001 0.0001 0.0000 0.0003 0.0002 0.0002 0.0001 0.0001

0.90 0.94 1.21 0.97 0.96 0.96 0.95 0.96 1.00 1.29 0.95 0.95 0.96 0.96 0.98 1.03 1.34 0.97 0.98 0.98 0.99 1.01 1.06 1.38 0.97 0.97 0.97 0.98 1.00 1.05 1.37 0.98 0.98 0.99 0.99 1.02 1.07 1.39 0.98 0.98 0.99 0.99 1.01

0.0056 0.0052 0.0045 0.0034 0.0033 0.0033 0.0027 0.0022 0.0019 0.0017 0.0025 0.0024 0.0023 0.0021 0.0015 0.0014 0.0012 0.0017 0.0016 0.0016 0.0014 0.0010 0.0009 0.0008 0.0007 0.0007 0.0007 0.0006 0.0005 0.0004 0.0005 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001

0.80 0.81 1.05 0.93 0.92 0.93 0.91 0.90 0.92 1.25 0.94 0.94 0.94 0.94 0.93 0.96 1.31 0.96 0.96 0.96 0.96 0.95 0.98 1.36 0.99 0.99 1.00 1.00 1.01 1.04 1.46 1.00 1.00 1.01 1.01 1.02 1.06 1.49 1.00 1.00 1.01 1.01 1.02

240


1024KB 4 1024KB 4

95% 100%

0.0003 0.0002

1.05 1.46

0.0001 0.0000

1.07 1.39

0.0001 0.0000

1.06 1.49

D.3 Varying Instruction Mix and Memory Parameters D.3.1 Graphs

| 64

1.00

0.80

0.60

0.40

|

1.20

|

0.00

|




| 32

|


| 16

0.20

|

| 8

0.40

|

| 4

0.60

|

| 2

easter hula bicycle

0.80

|

1.00

| 2

| 4

| 8

easter hula bicycle

| 16

| 32

| 64

easter hula bicycle


| | | 4

| 8

| 16

| | | | | | 32 64 128 256 512 1024 Percent of Total Prefetches Inserted

0.00

|

|

| 2

|

0.20

|

0.00

|

0.20

|

0.40

|

0.60

|

0.80

|

1.00

|

1.20

|

|

1.40

|

|

0.40

0.00

1.20

|

0.60

0.20

|

0.80

|

1.00

241

|

1.20

|


D.3. VARYING INSTRUCTION MIX AND MEMORY PARAMETERS

| 2

| 4

| 8

| 16

| 32

| 64


Figure D.13: Relative execution time for a direct mapped cache. The top has memory access of 10 cycles and the bottom is 25 cycles. The left is instruction mix of 100% loads and stores and the right side is 30% loads and stores. 90% of prefetches inserted.

0.60

|

0.40

|

1.00

0.80

0.60

0.40

| 4

|

|

1.20

0.00

|

| 2

|


|

| 64

|

| 32

0.20

|

| 16

easter hula bicycle



0.80

|


|

|

| 8

| 16

| 32

| 4

| 8

| 16

| 32

| 64


0.20

0.00

|

| 2

easter hula bicycle

|

|

0.00

| 8

|

0.20

|

0.40

| 4

|

0.60

| 2

|

0.80

|

1.00

|

1.20

|

1.40

|

0.00

|

0.20

1.00

|

0.40

1.20

easter hula bicycle

|

0.60

|

0.80

|

1.00

|

1.20

|

1.40



242

| 2

| 4

| 8

| 16

| 32

| 64


easter hula bicycle

| 64


Figure D.14: Relative Execution time for direct mapped cache. The top is memory access of 50 cycles and the bottom is 100 cycles. The left is a instruction mix of 100% loads and stores and the right side is 30% loads and stores. 90% of prefetches inserted.

| 8

| 16

| 32

| 64

0.20


| | | 4

| 8

| 16

| 32

| 64


1.00

0.80

0.60

0.40

0.20

0.00

|

|

| 2

easter hula bicycle

| 2

| 4

| 8

| 2

| 4

| 8

easter hula bicycle

| 16

| 32

| 64

easter hula bicycle

|

|

1.20

|

0.00

|

0.40

|

0.60

|

0.80

|

| 4

easter hula bicycle

1.00

|

| 2



|


|

|

0.40

0.00

|

0.60

0.20

|

|

0.80

|

1.00

|

|

1.20

|

|

0.40

0.00

|

0.60

0.20

|

0.80

|

1.00

1.20

243

|

1.20

|



| 16

| 32

| 64



Figure D.15: Relative Execution time for 2-way associative cache. The top is memory access of 10 cycles and the bottom is 25 cycles. The left is a instruction mix of 100% loads and stores and the right side is 30% loads and stores. 90% of prefetches inserted.

|

0.40

|

| | 4

| 8

| 16

| 32

easter hula bicycle

| 64


0.00

1.20

1.00

0.80

0.60

0.40

0.20

0.00

|

| 2


| 2

| 4

| 8

| 16

| 32

|

|

0.00

|

0.20

| 64

|

|

0.40

| 32

|

| 16

|

|

| 8

0.20

|

|

| 4

easter hula bicycle

|

| 2


0.60

|

1.00

0.80

|

1.20

|


|

0.60

|

0.00

1.00

0.80

|

0.20

|

0.40

|

0.60

|


|

0.80

|

|

1.00

1.20

|

1.20



244

| 2

| 4

| 8

| 16

| 32

easter hula bicycle

| 64


easter hula bicycle

| 64


Figure D.16: Relative Execution time for 2-way associative cache. The top is a memory access of 50 cycles and the bottom is 100 cycles. The left is a instruction mix of 100% loads and stores and the right side is 30% loads and stores. 90% of prefetches inserted.


D.3.2 Table

Cache Size 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB 2KB

Assoc Mem Cost (cycles) 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4

10 10 10 25 25 25 50 50 50 100 100 100 10 10 10 25 25 25 50 50 50 100 100 100 10 10 10 25 25 25 50 50 50 100 100 100

% Mem Ops 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30%

245

Relative Execution Time Easter Hula Bicycle 1.13 1.09 1.07 1.17 1.13 1.10 1.19 1.16 1.14 1.20 1.18 1.17 0.97 0.98 0.99 0.92 0.94 0.96 0.89 0.91 0.92 0.86 0.88 0.89 0.97 0.98 0.99 0.91 0.94 0.95 0.87 0.90 0.92 0.84 0.87 0.88

0.95 0.96 0.97 0.91 0.93 0.94 0.89 0.90 0.91 0.88 0.88 0.89 0.94 0.96 0.97 0.88 0.91 0.93 0.85 0.87 0.89 0.82 0.84 0.86 0.99 0.99 1.00 0.96 0.97 0.98 0.94 0.95 0.96 0.92 0.93 0.94

0.98 0.98 0.99 0.95 0.96 0.97 0.94 0.95 0.95 0.93 0.93 0.94 0.99 0.99 0.99 0.96 0.97 0.98 0.94 0.95 0.96 0.93 0.94 0.95 1.01 1.01 1.00 0.99 1.00 1.00 0.98 0.99 0.99 0.97 0.98 0.98

246


4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 4KB 8KB 8KB 8KB 8KB

1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1

10 10 10 25 25 25 50 50 50 100 100 100 10 10 10 25 25 25 50 50 50 100 100 100 10 10 10 25 25 25 50 50 50 100 100 100 10 10 10 25

100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100%

1.13 1.08 1.06 1.18 1.13 1.10 1.22 1.17 1.14 1.25 1.21 1.19 0.97 0.98 0.99 0.90 0.93 0.95 0.84 0.88 0.91 0.79 0.83 0.86 0.95 0.97 0.98 0.85 0.90 0.93 0.77 0.84 0.87 0.70 0.76 0.80 0.98 0.99 0.99 0.92

0.93 0.95 0.97 0.87 0.90 0.92 0.84 0.86 0.88 0.81 0.83 0.85 0.97 0.98 0.99 0.91 0.94 0.96 0.87 0.90 0.92 0.84 0.87 0.89 0.97 0.98 0.99 0.91 0.94 0.96 0.87 0.90 0.92 0.83 0.86 0.88 1.00 1.00 1.00 0.98

1.03 1.02 1.01 1.03 1.02 1.02 1.03 1.02 1.02 1.03 1.02 1.02 0.98 0.99 0.99 0.94 0.96 0.97 0.91 0.93 0.94 0.88 0.90 0.91 0.99 0.99 1.00 0.95 0.97 0.98 0.92 0.94 0.95 0.90 0.92 0.93 1.02 1.01 1.01 1.01

D.3. VARYING INSTRUCTION MIX AND MEMORY PARAMETERS 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 8KB 16KB 16KB 16KB 16KB 16KB 16KB 16KB 16KB

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1

25 25 50 50 50 100 100 100 10 10 10 25 25 25 50 50 50 100 100 100 10 10 10 25 25 25 50 50 50 100 100 100 10 10 10 25 25 25 50 50

50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50%

0.95 0.96 0.87 0.91 0.93 0.83 0.86 0.88 0.95 0.97 0.98 0.85 0.91 0.94 0.75 0.83 0.87 0.65 0.74 0.79 0.95 0.97 0.98 0.85 0.91 0.94 0.74 0.83 0.87 0.63 0.73 0.78 0.95 0.97 0.98 0.84 0.90 0.93 0.74 0.82

0.98 0.99 0.95 0.97 0.97 0.94 0.95 0.96 0.96 0.98 0.99 0.90 0.94 0.95 0.84 0.89 0.91 0.79 0.83 0.86 0.96 0.98 0.99 0.89 0.93 0.95 0.83 0.88 0.91 0.77 0.82 0.85 0.95 0.97 0.98 0.88 0.92 0.94 0.81 0.87

247 1.01 1.01 1.00 1.00 1.00 0.99 1.00 1.00 0.98 0.99 0.99 0.93 0.96 0.97 0.89 0.92 0.94 0.85 0.88 0.90 0.97 0.99 0.99 0.91 0.95 0.96 0.86 0.90 0.93 0.82 0.85 0.88 0.96 0.98 0.98 0.89 0.93 0.95 0.84 0.88

248



1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1

50 100 100 100 10 10 10 25 25 25 50 50 50 100 100 100 10 10 10 25 25 25 50 50 50 100 100 100 10 10 10 25 25 25 50 50 50 100 100 100

30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30%

0.87 0.65 0.73 0.78 0.95 0.97 0.98 0.83 0.90 0.93 0.72 0.81 0.86 0.59 0.70 0.76 0.95 0.97 0.98 0.83 0.90 0.93 0.70 0.81 0.86 0.56 0.68 0.75 0.95 0.97 0.98 0.83 0.90 0.93 0.72 0.81 0.86 0.60 0.70 0.76

0.90 0.76 0.81 0.84 0.95 0.97 0.98 0.86 0.91 0.94 0.77 0.85 0.88 0.69 0.76 0.81 0.94 0.97 0.98 0.84 0.91 0.93 0.75 0.83 0.87 0.65 0.74 0.79 0.95 0.97 0.98 0.86 0.91 0.94 0.78 0.85 0.88 0.70 0.77 0.81

0.91 0.79 0.83 0.86 0.96 0.98 0.98 0.88 0.93 0.95 0.81 0.87 0.90 0.74 0.80 0.84 0.95 0.97 0.98 0.87 0.92 0.94 0.79 0.86 0.89 0.71 0.78 0.82 0.95 0.97 0.98 0.87 0.92 0.94 0.79 0.86 0.89 0.72 0.78 0.82


2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2

10 10 10 25 25 25 50 50 50 100 100 100 10 10 10 25 25 25 50 50 50 100 100 100 10 10 10 25 25 25 50 50 50 100 100 100 10 10 10 25

100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100%

0.94 0.97 0.98 0.82 0.89 0.93 0.68 0.80 0.85 0.53 0.67 0.74 0.94 0.97 0.98 0.82 0.89 0.93 0.68 0.80 0.85 0.53 0.66 0.74 0.95 0.97 0.98 0.83 0.90 0.93 0.71 0.81 0.86 0.57 0.69 0.76 0.95 0.97 0.98 0.82

0.94 0.97 0.98 0.83 0.90 0.93 0.73 0.82 0.86 0.62 0.72 0.77 0.94 0.96 0.98 0.83 0.90 0.93 0.72 0.81 0.86 0.61 0.71 0.77 0.94 0.97 0.98 0.84 0.91 0.93 0.75 0.83 0.87 0.65 0.74 0.79 0.93 0.96 0.98 0.83

249 0.94 0.97 0.98 0.84 0.90 0.93 0.74 0.83 0.87 0.65 0.73 0.79 0.94 0.97 0.98 0.83 0.90 0.93 0.73 0.82 0.87 0.63 0.72 0.78 0.94 0.97 0.98 0.84 0.90 0.93 0.75 0.83 0.87 0.66 0.74 0.79 0.93 0.96 0.97 0.83

250



2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2

25 25 50 50 50 100 100 100 10 10 10 25 25 25 50 50 50 100 100 100 10 10 10 25 25 25 50 50 50 100 100 100 10 10 10 25 25 25 50 50

50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50%

0.90 0.93 0.69 0.80 0.85 0.53 0.67 0.74 0.94 0.97 0.98 0.82 0.89 0.93 0.67 0.79 0.85 0.52 0.66 0.74 0.97 0.99 0.99 0.88 0.93 0.95 0.76 0.86 0.90 0.62 0.74 0.81 0.97 0.99 0.99 0.87 0.93 0.95 0.75 0.85

0.90 0.93 0.72 0.81 0.86 0.60 0.70 0.77 0.93 0.96 0.98 0.82 0.90 0.93 0.71 0.81 0.86 0.59 0.70 0.76 0.96 0.98 0.98 0.87 0.92 0.95 0.77 0.85 0.89 0.67 0.76 0.81 0.94 0.97 0.98 0.83 0.90 0.93 0.72 0.82

0.90 0.93 0.72 0.81 0.86 0.61 0.71 0.77 0.93 0.96 0.97 0.82 0.89 0.92 0.71 0.81 0.86 0.60 0.70 0.76 0.95 0.97 0.98 0.85 0.91 0.94 0.76 0.84 0.88 0.66 0.75 0.80 0.93 0.96 0.98 0.83 0.90 0.93 0.72 0.82


2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2

50 100 100 100 10 10 10 25 25 25 50 50 50 100 100 100 10 10 10 25 25 25 50 50 50 100 100 100 10 10 10 25 25 25 50 50 50 100 100 100

30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30%

0.89 0.60 0.73 0.80 0.97 0.98 0.99 0.86 0.92 0.95 0.73 0.84 0.88 0.58 0.71 0.78 1.01 1.01 1.00 0.95 0.97 0.98 0.87 0.93 0.95 0.75 0.85 0.89 1.02 1.01 1.01 0.96 0.98 0.99 0.89 0.94 0.96 0.77 0.86 0.90

0.87 0.61 0.71 0.77 0.93 0.96 0.98 0.83 0.90 0.93 0.71 0.81 0.86 0.59 0.70 0.76 0.97 0.98 0.99 0.89 0.94 0.95 0.79 0.87 0.91 0.68 0.78 0.83 0.97 0.99 0.99 0.89 0.94 0.96 0.80 0.88 0.91 0.69 0.79 0.84

251 0.86 0.61 0.71 0.77 0.93 0.96 0.97 0.82 0.89 0.92 0.71 0.81 0.86 0.60 0.70 0.76 0.98 0.99 0.99 0.91 0.95 0.96 0.83 0.89 0.92 0.73 0.81 0.86 0.97 0.99 0.99 0.90 0.94 0.96 0.81 0.88 0.91 0.71 0.80 0.85

252



4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4

10 10 10 25 25 25 50 50 50 100 100 100 10 10 10 25 25 25 50 50 50 100 100 100 10 10 10 25 25 25 50 50 50 100 100 100 10 10 10 25

100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100%

1.02 1.01 1.01 0.97 0.98 0.99 0.90 0.94 0.96 0.79 0.88 0.91 1.04 1.02 1.01 1.00 1.00 1.00 0.96 0.98 0.98 0.88 0.93 0.95 1.04 1.02 1.01 1.02 1.01 1.01 0.98 0.99 0.99 0.91 0.95 0.97 1.05 1.02 1.02 1.03

0.97 0.99 0.99 0.90 0.94 0.96 0.80 0.88 0.91 0.69 0.79 0.84 1.01 1.01 1.01 0.98 0.99 0.99 0.94 0.96 0.98 0.86 0.92 0.94 1.00 1.00 1.00 0.95 0.97 0.98 0.88 0.93 0.95 0.78 0.87 0.91 1.00 1.00 1.00 0.94

0.97 0.98 0.99 0.89 0.94 0.96 0.80 0.88 0.91 0.70 0.79 0.84 1.01 1.00 1.00 0.96 0.98 0.99 0.90 0.94 0.96 0.81 0.88 0.91 1.01 1.01 1.00 0.98 0.99 0.99 0.93 0.96 0.97 0.85 0.91 0.94 1.02 1.01 1.01 0.99


4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4

25 25 50 50 50 100 100 100 10 10 10 25 25 25 50 50 50 100 100 100 10 10 10 25 25 25 50 50 50 100 100 100 10 10 10 25 25 25 50 50

50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50% 30% 100% 50%

1.02 1.01 1.01 1.00 1.00 0.96 0.98 0.99 1.05 1.02 1.02 1.03 1.01 1.01 1.00 1.00 1.00 0.95 0.97 0.98 1.05 1.02 1.02 1.04 1.02 1.01 1.02 1.01 1.01 0.98 0.99 0.99 1.05 1.02 1.02 1.04 1.02 1.01 1.01 1.01

0.97 0.98 0.86 0.92 0.95 0.76 0.85 0.89 1.01 1.01 1.00 0.98 0.99 0.99 0.92 0.96 0.97 0.84 0.91 0.93 1.03 1.01 1.01 1.01 1.01 1.00 0.99 0.99 1.00 0.95 0.97 0.98 1.03 1.02 1.01 1.02 1.01 1.01 1.01 1.00

253 1.00 1.00 0.95 0.97 0.98 0.88 0.93 0.95 1.03 1.02 1.01 1.02 1.01 1.01 1.01 1.00 1.00 0.97 0.99 0.99 1.04 1.02 1.01 1.03 1.02 1.01 1.02 1.01 1.01 1.01 1.00 1.00 1.04 1.02 1.01 1.03 1.02 1.01 1.02 1.01

254


1024KB 1024KB 1024KB 1024KB

4 4 4 4

50 100 100 100

30% 100% 50% 30%

1.01 0.98 0.99 0.99

1.00 0.98 0.99 0.99

1.01 1.01 1.00 1.00

Bibliography [AOG92]

D.P. Anderson, Y. Osawa, and R. Govindan. A le system for continuous media. ACM Transactions on Computer Systems, 10(4):311{337, November 1992.

[BC91]

Jean-Loup Baer and Tien-Fu Chen. An eective on-chip preloading scheme to reduce data access penalty. In Proceedings of Supercomputing '91, pages 176{186, November 1991.

[BKL95]

V. Bhaskaran, Konstantinos Konstantinides, and Ruby Lee. Real-time MPEG-1 software decoding on HP workstations. In Proceedings of the SPIE - The International Society for Optical Engineering, volume 2419, pages 466{473, Feb 1995.

[Boo51]

A. D. Booth. A signed binary multiplication technique. Quarterly Journal of Mechanics and Applied Mathematics, 4:236{240, 1951.

[CB95]

Tien-Fu Chen and Jean-Loup Baer. Eective hardware-based data prefetching for high-performance processors. IEEE Transactions on Computers, 44:318{328, May 1995.

[CFKL95] P. Cao, E.W. Felten, A. Karlin, and K. Li. A study of integrated prefetching and caching strategies. In Proceeding of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), pages 188{197, May 1995. 255

256

BIBLIOGRAPHY

[Eck95]

Stefan Eckart. High performance software MPEG video player for PCs. In Proceedings of the SPIE - The International Society for Optical Engineering, volume 2419, pages 446{454, Feb 1995.

[EF95]

Stefan Eckart and Chad Fogg. ISO/IEC MPEG-2 software video codec. In Proceedings of the SPIE - The International Society for Optical Engineering, volume 2419, pages 100{109, Feb 1995.

[Fly93]

Michael J. Flynn. ee486, Stanford University. class notes, 1993.

[FP91]

John W. C. Fu and Janak H. Patel. Data prefetching in multiprocessor vector cache memories. In Proc. of the 18th Annual International Symposium on Computer Architecture, pages 54{63, May 1991.

[FPJ92]

J. Fu, J. Patel, and B. Janssens. Stride directed prefetching in scalar processors. In Proc. of the 25th International Symposium on Microarchitecture, pages 102{110, December 1992.

[FS75]

R. Floyd and L. Steinberg. An adaptive algorithm for spatial gray scale. In Society for Information Display 1975 Symposium Digest of Technical Papers, page 36, 1975.

[FvDFH90] James D. Foley, Andries van Dam, Steven K. Feiner, and John F. Hughes. Computer Graphics: Principles and Practice. Addison-Wesley Publishing Company, Reading, Massachusetts, 1990. [FZF96]

Steve T. Fu, Daniel F. Zucker, and Michael J. Flynn. Memory hierarchy synthesis of a multimedia embedded processor. In Proceedings of International Conference of Computer Design, October 1996.

[Gal91]

D. Le Gall. MPEG: a video compression standard for multimedia applications. Communications of the ACM, 34(4):46{58, April 1991.

[Jou90]

Norman P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buers. In Proc.

BIBLIOGRAPHY

257

of the 17th Annual International Symposium on Computer Architecture, pages 364{373, May 1990.

[Kan96]

Gerry Kane. PA-RISC 2.0 Architecture. Prentice Hall PTR, Upper Saddle River, New Jersey, 1996.

[KMT 95] L. Kohn, G. Maturana, M. Tremblay, A. Prabhu, and G. Zyner. The visual instruction set (VIS) in UltraSPARC. In Compcon, pages 462{ 469, March 1995. +

[Lee89]

Ruby B. Lee. Precision architecture. Computer, 22(1):78{91, January 1989.

[Lee94]

Ruby B. Lee. Multimedia enhancements for PA-RISC processors. In Hot Chips VI, pages 183{192, Aug 1994.

[Lee95]

Ruby B. Lee. Accelerating multimedia with enhanced microprocessors. IEEE Micro, 15(2):22{32, April 1995.

[Lee96]

Ruby B. Lee. Subword parallelism with MAX-2. IEEE Micro, 16(4):51{ 59, August 1996.

[Lee97]

Ruby B. Lee. private communication, May 1997.

[LH96]

Ruby Lee and Jerry Huck. 64-bit and multimedia extensions in the PARISC 2.0 architecture. In Compcon96, pages 152{160. IEEE Computer Society, Feb 1996.

[LLM89]

Christoph Loeer, Adriaan Ligtenberg, and George S. Moschytz. Practical fast 1-D DCT algorithms with 11 multiplications. Proceedings International Conference on Acoustics, Speech, and Signal Processing, pages 988{991, 1989.

[LM97]

Ruby Lee and Larry McMahan. Mapping of application software to the multimedia instructions of general-purpose microprocessors. In Multimedia Hardware Architectures 1997, volume 3021. SPIE{The International Society for Optical Engineering, February 1997.

BIBLIOGRAPHY

258 [MHR90]

R.K. Montoye, E. Hokenek, and S.L. Runyon. Design of the IBM RISC system/6000 oating-point execution unit. IBM Journal of Research and Development, 34(1):59{70, Jan 1990.

[MLG92]

T. Mowry, M. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In SIGPLAN Notices, pages 62{73, September 1992.

[ODH 85] J. K. Ousterhout, H. DaCosta, D. Harrison, J. A. Kunze, M. Kupfer, and J. G. Thompson. A trace-driven analysis of the unix 4.2 bsd le system. In Proceedings of the 10th Symposium on Operating System Principles, pages 15{24, December 1985. +

[Pac92]

Hewlett Packard. PA-RISC 1.1 Architecture and Instruction Set Reference Manual. Hewlett Packard Company, 1992.

[PGG 95] R. Hugo Patterson, Garth A. Gibson, Eka Ginting, Deaniel Stodolsky, and Jim Zelenka. Informed prefetching and caching. In Proceeding of the 15th ACM Symposium on Operating System Principles, pages 79{95, December 1995. +

[PK94]

Subbarao Palacharla and R.E. Kessler. Evaluating stream buers as a secondary cache replacement. In Proc. of the 21st Annual International Symposium on Computer Architecture, pages 24{33, April 1994.

[Por89]

A.K. Porter eld. Software methods for improvement of cache performance on supercomputer applications. Technical Report COMP TR 8993, Rice University, May 1989.

[PSR93]

K. Patel, B.C. Smith, and L.A. Rowe. Performance of a software MPEG video decoder. In Proceedings ACM Multimedia 93, pages 75{82, August 1993.

[PW96]

Alex Peleg and Uri Weiser. MMX technology extension to the Intel architecture. IEEE Micro, 16(4):51{59, August 1996.

BIBLIOGRAPHY

259

[PWW97] Alex Peleg, Sam Wilkie, and Uri Weiser. Intel MMX for multimedia PCs. Communications of the ACM, 40(1):25{38, January 1997. [QF90]

Nhon T. Quach and Michael J. Flynn. An improved algorithm for highspeed oating-point addition. Technical Report CSL-TR-90-442, Stanford University, Computer Systems Laboratory, August 1990.

[RF93]

Kathy J. Richardson and Michael J. Flynn. Strategies to improve I/O cache performance. In Proceeding of the Twenty-Sixth Hawaii International Conference on System Sciences, pages 31{39, January 1993.

[RV91]

P. Venkat Rangan and H.M. Vin. Designing le systems for digital video and audio. Operating Systems Review, 25(5):81{94, 1991.

[RW93]

A. L. Narasimha Reddy and J. Wyllie. Disk scheduling in a multimedia I/O system. In Proceedings ACM Multimedia 93, pages 225{233, August 1993.

[Sch93]

Eric Mark Schwarz. High-radix algorithms for high-order arithmetic operations. Technical Report CSL-TR-93-559, Stanford University, Computer Systems Laboratory, January 1993.

[SE94]

A. Srivastava and A. Eustace. Atom: A system for building customized program analysis tools. In Proceedings of the SIGPLAN '94 Conference on Programming Language Design and Implementation, pages 196{205, June 1994.

[SGH97]

Vatsa Santhanam, Edward H. Gornish, and Wei-Chung Hsu. Data prefetching on the HP PA-8000. In The 24th Annual International Symposium on Computer Architecture, June 1997.

[Skl92]

Ivan Sklenar. Prefetch unit for vector operations on scalar computers. ACM Computer Architecture News, 20:31{37, September 1992.

[Smi82]

Alan Jay Smith. Cache memories. ACM Computing Surveys, 14:473{530, September 1982.

260

BIBLIOGRAPHY

[Smi85]

Alan Jay Smith. Disk cache{miss ratio analysis and design considerations. ACM Transactions on Computer Systems, 3(3):161{203, August 1985.

[Smi91]

Michael D. Smith. Tracing with pixie. ftp document, Center for Integrated Systems, Stanford University, April 1991.

[ST67]

N.S. Szabo and R.I. Tanaka. Residue Arithmetic and its Applications to Computer Technology. McGraw-Hill, New York, 1967.

[TONH96] Marc Tremblay, J. Michael O'Connor, Venkatesh Narayanan, and Liang He. VIS speeds new media processing. IEEE Micro, 16(4):51{59, August 1996. [Var93]

Dominic A. Varley. Practical experience of the limitations of gprof. Software{Practice and Experience, 23:461{463, April 1993.

[Wal91]

Gregory K. Wallace. The JPEG still picture compression standard. Communications of the ACM, 34(4):30{44, April 1991.

[WF82]

Shlomo Waser and Michael J. Flynn. Introduction to Arithmetic for Digital Systems Designers. Holt, Rinehart and Winston, 1982.

[ZFL95]

Daniel F. Zucker, Michael J. Flynn, and Ruby B. Lee. A comparison of hardware prefetching techniques for multimedia benchmarks. Technical Report No. CSL-TR-95-683, Computer Systems Laboratory, Stanford University, December 1995.

[ZFL96a]

Daniel F. Zucker, Michael J. Flynn, and Ruby B. Lee. A comparison of hardware prefetching techniques for multimedia benchmarks. In Proceedings of the International Conference on Multimedia Computing and Systems, pages 236{244, Hiroshima, Japan, June 1996.

[ZFL96b] Daniel F. Zucker, Michael J. Flynn, and Ruby B. Lee. Improving performance for software MPEG players. In Compcon96. IEEE Computer Society, Feb 1996.

BIBLIOGRAPHY [ZK95]

261

Daniel F. Zucker and Alan H. Karp. RYO: a versatile instruction instrumentation tool for PA-RISC. Technical Report No. CSL-TR-95-658, Computer Systems Laboratory, Stanford University, January 1995.

[ZKK 95] Chang-Guo Zhou, Ihtisham Kabir, Leslie Kohn, Aman Jabbi, D. Rice, and Xio-Ping Hu. MPEG video decoding with the UltraSPARC visual instruction set. In Compcon, pages 470{477, March 1995. +

[ZL94]

Daniel Zucker and Ruby Lee. Reuse of high precision arithmetic hardware to perform multiple concurrent low precision calculations. Technical Report No. CSL-TR-94-616, Computer Systems Laboratory, Stanford University, April 1994.

[ZLF97]

Daniel F. Zucker, Ruby B. Lee, and Michael J. Flynn. Achieving subword parallelism by software reuse of the oating point data path. In Multimedia Hardware Architectures 1997, volume 3021, pages 51{64, San Jose, California, February 1997. SPIE{The International Society for Optical Engineering.