Optimizing the Data Cache Performance of a Software ... - CiteSeerX

Optimizing the Data Cache Performance of a Software MPEG-2 Video Decoder Peter Soderquist

School of Electrical Engineering Cornell University Ithaca, New York 14853 E-mail: [email protected] Abstract Multimedia functionality has become an established component of core computer workloads. MPEG-2 video decoding represents a particularly important and computationally demanding application example. Instruction set extensions like Intel's MMX signi cantly reduce the computational challenges of this and other multimedia algorithms. However, memory subsystem de ciencies have now become the major barrier to increased performance, partly as a consequence of this improved CPU performance. Decoding MPEG-2 video data in software makes signi cant bandwidth demands on memory subsystems, which is seriously aggravated by cache ineciencies. Conventional data caches generate many times more cache-memory trac than required, at best double the minimum necessary to support decoding. Improving eciency requires understanding the behavior of the decoder and composition of its data set. We provide an analysis of the memory and cache behavior of software MPEG-2 video decoding, and lay out a set of cache-oriented architectural enhancements which offer relief for the problem of excess cache-memory bandwidth. Our results show that cache-sensitive handling of dierent data types can reduce trac by 50 percent or more. 1 Introduction

Multimedia computing has become a practical reality, and computer architectures are changing in response. Extensions for continuous media processing are a current or planned part of every market-leading instruction set architecture, as shown in Table 1. The primary feature of these extensions is SIMD-style processing of To appear in ACM Multimedia '97, November 1997, Seattle, Washington c 1997 by the Association for Computing Machinery, Inc. Permission Copyright to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept., ACM Inc., fax +1 (212) 869-0481, or ([email protected]).

Miriam Leeser

Dept. of Electrical and Computer Engineering Northeastern University Boston, Massachusetts 02115 E-mail: [email protected] small data types. While contemporary microprocessor datapaths are 32 or 64 bits wide, multimedia data usually consists of 16-bit or 8-bit integers. Furthermore, multimedia algorithms commonly feature an abundance of data parallelism. Therefore, modifying wide datapaths to process multiple small data types in a single instruction promises signi cant performance improvements. Architecture x86 SPARC PA-RISC Alpha MIPS PowerPC

ISA Extension MMX (MultiMedia eXtensions) VIS (Visual Instruction Set) MAX-2 (Multimedia Acceleration eXtensions) MVI (Motion Video Instruction) MDMX (MIPS Digital Media eXtensions) VMX (Video and Multimedia eXtensions)*

Table 1: ISA multimedia extensions for popular architectures (* = proposed) All of these extensions are targeted, at least partially, at MPEG-2 video decoding. Collectively, they mount a formidable assault on its computational complexity. Unfortunately they largely fail to address the signi cant challenge of MPEG processing for memory subsystems, which have become the primary performance bottleneck. The high data rate, large set sizes, and distinctive memory access patterns of MPEG exert a particular strain on caches. Manipulating standard parameters (cache size, associativity, and line size) fails to cost-eectively reduce excess memory trac. While miss rate levels are acceptable, standard caches, even very large and highly associative ones, generate signi cant excess cache-memory trac. Multimedia instruction extensions actually exacerbate this problem by enabling the CPU to consume and produce more data in fewer cycles. This cache ineciency is seriously limiting for desktop multitasking systems. A foregrounded video process can potentially cripple communications, print spooling, system management, network agent, and other background tasks. The video playback may also suer from

compressed video

CPU cache 2nd cache system bus

input buffer

file system

video window image

uncompressed video

frame buffer display

main memory

DMA

DMA

bus adapter

I/O bus

Figure 1: Typical desktop machine with software MPEG-2 decoding trac dropped frames, jitter, blocking, or other annoying artifacts. For low-power/cost systems and information appliances, cache ineciency can have a direct cost impact, requiring the use of faster or higher capacity components than strictly necessary to achieve speci ed functionality at the required quality level. This can drive up system cost, increase power consumption, or even prevent implementation. To explore the cache eciency problem and its solutions, we examine the trac and data cache behavior of an MPEG-2 decoder running Main Level streams on a general-purpose microprocessor. Trace-driven cache simulations of typical sequences reveal the memory behavior of the decoder. We show that, contrary to some predictions, there is a bene t to even relatively small, simple caches. However, even very large and complex caches generate signi cant excess trac. An analysis of the data types utilized by the decoder and their patterns of access provides the basis for proposed architectural enhancements. We also show that even relatively simple measures, such as selective caching of speci c data types, can dramatically improve eciency, reducing cache-memory trac by 50% or more for a wide range of cache sizes.

display. It is clear that there is a large amount of data, much of it time-sensitive, being transferred through the system. Most of this is concentrated on the main system bus. At the very least, there are two streams each of encoded and decoded video being concurrently transferred in and out of main memory, amounting to a minimum sustained load of 63 Mbytes/s. This ignores the bandwidth required by other applications running on the system and sharing interconnect, main memory, and peripherals. Any excess memory trac generated by cache ineciency will further exacerbate this situation. Unfortunately, our simulations indicate that standard caches generate at least twice the minimum required level of cache-memory trac, typically many times more, amounting to a signi cant strain on system capacity. The result is some combination of dropped frames, reduced frame rate, and degradation of overall system performance. 3 MPEG Overview

Appreciating the challenges of supporting video decoding requires some understanding of the MPEG standard [4]. MPEG attacks both the spatial and temporal redundancy of video signals to achieve compression. Video data is broken down into 8 8 pixel Blocks and passed through a discrete cosine transform (DCT). The resulting spatial frequency coecients are quantized, run-length encoded, and then further compressed with an entropy coding algorithm. To exploit temporal redundancy, MPEG encoding uses motion compensation with three dierent types of frames. I (intra) frames contain a complete image, compressed for spatial redundancy only. P (predicted) MPEG frames are built from 16 16 fragments known as macroblocks. These consist primarily of pixels from the closest previous I or P frame (the reference frame), translated as a group from their location in the source. This information is stored as a vector representing the translation, and a DCTencoded dierence term, requiring far fewer bits then the original image fragment. B (bidirectional) frames

2 Background

Consider the problem of decoding MPEG-2 in software on a general-purpose computing platform. The machine type of primary interest is a typical desktop PC or workstation, as envisioned in Figure 1. The diagram also illustrates the typical ow of video data in the system. Imagine the user is viewing, for example, an entry in a multimedia encyclopedia. Compressed data is transferred from the le system (e.g. CD-ROM or hard disk) by direct memory access (DMA) and buered in the main memory. The CPU reads the data through its cache hierarchy and writes the decoded, dithered RGB video, one frame at a time, to main memory where it is stored in an image of the displayed video window. Finally, another DMA transfer brings the video data to the frame buer where it eventually appears on the 2

can use the closest two I or P pictures { one before and one after in temporal order { as reference frames. Information not present in reference frames is encoded spatially on a block-by-block basis. All of data in P and B frames is also subject to run-length and entropy coding.

I

4 Related Work

De ciencies in the MPEG-2 decoding performance of systems are being actively addressed. This section discusses several signi cant methods for performance enhancement under investigation by researchers both in industry and academia. Some of these are merely proposals, while others have already been implemented in hardware or are forthcoming. We explore the bene ts and disadvantages of each approach, and assess its impact on the problem of excess cache-memory bandwidth.

B B P B B P B B I

4.1 Media Processors 1

2

3

4

5

6

7

8

A new breed of microprocessor, the so-called \media processor" has emerged in just the past few years. These chips combine programmability with specialized hardware to support multiple concurrent multimedia operations, including MPEG-2 decoding, and usually reside on peripheral cards in desktop systems. Their manufacturers, which include Chromatic Research, Fujitsu, Mitsubishi, Philips, and Samsung, claim that media processors are the best present and long-term solution for providing multimedia functionality in computer systems, not the host CPU. To their advantage, media processors yield reduced memory bandwidth requirements, since they deal in compressed data, and take a signi cant computational load o the host. They are also good at handling the real-time constraints of multimedia, while present-day operating systems and memory hierarchies are not. Finally, present-day multimedia functions consist of a limited set of operations, tied to international standards, an ecient target for highly specialized solutions. However, performing multimedia functionality with the host processor is inherently less expensive than using extra specialized hardware. Memory systems are also improving to meet the needs of multimedia, as illustrated by Intel's AGP and Direct RDRAM initiatives. Operating systems are also likely to become more real-time oriented as they evolve. More fundamentally, multimedia data are a new rst-order data type, on the same order as oating-point. These functions are becoming central to what computers do, not \peripheral" like mass-storage or printing. As applications evolve, media and other operations will become more and more interleaved { someday viewing a video clip will be as seamless as reading e-mail or editing a spreadsheet. Host multimedia processing best supports the construction of highly integrated, fully featured, yet portable applications. Hardware integration trends and the powerful vested interest of CPU manufacturers also favor this outcome. Nevertheless, parts of video processing are likely to remain in specialized hardware for quite some time,

9 10

Figure 2: Typical sequence of MPEG frames, showing interframe dependencies Figure 2 shows the interframe dependencies of the dierent frame types, superimposed on the displayed frame order. For decoding, these frames must be processed in the non-temporal order [1,4,2,3,7,5,6,10,8,9], which is a result of these dependencies. Interframe dependencies and the properties and sequence of frame types determine in critical ways the ow pattern of MPEG data and the nature of the hardware support required. Sequence Group Of Pictures Picture Slice Macroblock

... GOP GOP header Picture ... Picture Picture header Slice ... Slice Slice header Macroblock ... Macroblock Macroblock header block0 ... block5 Sequence header

GOP

Figure 3: MPEG bitstream structure The decoder reads the MPEG data as a stream of bits. Unique bit patterns, known as startcodes mark the division between dierent sections of the data. The bitstream has a hierarchical structure, shown in simpli ed form in Figure 3. A Sequence (video clip) consists of groups of pictures (GOP's). A GOP contains at least one I frame and typically a number of dependent P and B frames; Figure 2 shows a possible GOP of 9 frames (pictures) followed by the I frame of the subsequent GOP. Pictures consist of collections of macroblocks called slices. The \headers" shown at each level contain parameters relevant to each bitstream type but no actual image data. 3

such as color space conversion, scaling, and dithering. These are straightforward to implement in hardware, and to a large extent not data-dependent and therefore not subject to pipeline hazards and other impediments. These are more akin to truly \peripheral" operations.

tends to generate excess memory trac of its own. Selectively applied, prefetching may provide a boost to compute-intensive phases of MPEG and other multimedia algorithms. However, the emphasis of our research is not so much on getting values into the cache early, but keeping them around for as long as they are needed.

4.2 SIMD Instruction Extensions

4.5 Our Approach: Cache-Oriented Enhancements

Multimedia instruction extensions such as MMX, VIS, MAX-2, and the others in Table 1, achieve speedup primarily through performing SIMD-style operations on multimedia data types. These ISA enhancements are an essential step to improving microprocessor performance on multimedia applications { as well as insuring the continued viability of host processing. However, these operations tend to increase rather than relieve the strain on memory bandwidth. By consuming more operands per unit time, multimedia instructions expose the underlying weakness of the cache, system bus, and main memory. The caches in the MMX-enhanced Pentium are double the size of those in the earlier Pentium for this very reason. Intel researchers even blame the below expected speedup on MPEG with MMX on memory and I/O system de ciencies [9].

The methods discussed above largely bypass the issue of excess cache-memory trac, while some of them even exacerbate the problem. None of them directly target the issue of cache ineciency. Even prefetching focuses on reducing miss rates at the possible expense of memory trac. We suggest a dierent type of solution, driven by the internal dynamics of the decoder itself. Exploiting knowledge of individual data types { their sizes, access types, and access patterns { can lead to eective architectural solutions to cache ineciency. 5 Experimental Setup

Solving the problem of excess cache-memory trac requires rst establishing its extent, and how critical system parameters aect this behavior. Our results come from trace-driven cache simulations of video stream decoding. The clips themselves are progressive MPEG2 Main Level sequences, with a resolution of 704 480 pixels at 30 frames/s, and a compressed bitrate of 6 Mbits/s. These are comparable in perceptual quality to conventional analog broadcast (e.g. NTSC, PAL). Sequences are chosen as representative of dierent types of programming.

4.3 Video/Graphics Data Buses

There have been proposals for implementing special interconnect to transfer image data from the host processor to the frame buer, bypassing slower generalpurpose I/O buses. The Intel Advanced Graphics Port (AGP) is the most prominent such eort, intended to provide a proposed direct link between main memory and the graphics subsystem. While primarily meant to support the high bandwidth of 3D graphics rendering on the host processor, it is a de nite boon to video as well. This method provides another signi cant way to boost the capabilities of host multimedia processing relative to custom hardware solutions. However, it does nothing in and of itself to alleviate cache-memory traf c waste or reduce overall memory-system bandwidth needs. The undiminished requirement for extremely high-bandwidth memory is re ected in Intel's support of the Rambus architecture for future PC memory systems.

5.1 Machine and Software Environment

For experimental purposes we use a simpli ed machine model, limiting the microprocessor to a single level of caching. We further assume that MPEG-2 decoding is the only active process. The decoder is a version of mpeg2decode from the MPEG Software Simulation Group [1], running on a SuperSPARC processor under Solaris. This decoder has the advantage of portability and relative system independence, with the disadvantage that performance is suboptimal for any given architecture. Since it originates with the body responsible for MPEG, it is closely written to comply with that standard. Finally, it has the essential feature for research purposes of being available in source code form. We have chosen to treat the decoder implementation as a given, implementing minimal modi cation to support integration with our simulation tools. Trace generation is performed by QPT2 (the Quick Pro ler and Tracer) [6], a component of the Wisconsin Architectural Research Toolkit (WARTS). We have implemented the

4.4 Prefetching

Hardware prefetching has been suggested as a remedy for inadequate MPEG performance [11]. Yet it is primarily promoted as a means to reduce miss rates and without much consideration of cache-memory trac, which tends to increase with prefetching, especially the more aggressive schemes. Prefetching is essentially a form of latency hiding; for memory-bound problems, it merely exposes underlying bandwidth problems, and 4

cache simulator, Pacino, to support the unique qualities of continuous media applications. These are considerably dierent from conventional benchmarks of the SPEC variety in their consumption and production of data, and interaction with the operating system.

large register les { managed by software [5, 3, 2]. Yet we have found that despite the large set sizes and the magnitude of data consumed and discarded, there is suf cient re-use of values for caching to signi cantly reduce the required memory bandwidth. For example, decoding one Group Of Pictures from a typical MPEG-2 stream generates 2 1 108 memory requests. A 16 Kbyte, direct-mapped cache generates 3 9 107 words of memory trac. If we estimate conservatively that each word of cache data represents a memory request, then even a relatively small, simple cache keeps almost 85% of trac o of the memory bus. In fact, caches typically load several words at a time, while many CPU memory requests are for data types smaller than a word. Therefore, data caches can make far more ecient use of available memory bandwidth than registers alone. Even very clever register management would have trouble competing with anything but the smallest caches, so even information appliance-type platforms might bene t from implementing an actual cache. The remainder of this section examines in more detail how the basic parameters of a simple cache (cache size, set associativity, and line size) aect memory traf c as well as misses. All of the caches in our simulations implement a write-back policy on replacement. Not only is this the most popular write policy in actual implementations, but it typically generates far less memory trac than write-through, the best alternative. Write-allocate, where a write miss causes the loading of the associated line, is also a common feature of all simulated caches. If we assume a \perfect" cache, the trac required to support decoding is equal to the size of encoded stream, which is read in from memory, plus the decoded video data, which is written back. This absolute minimum cache-memory trac is used as a yardstick to compare the performance of dierent con gurations in our simulations.

5.2 Steady-State Simulation

:

To limit the considerable storage and CPU cycle requirements of trace-driven simulation, we restrict each run to one GOP, the largest logical bitstream unit below a sequence. Simulations run as if at steady state { i.e. preceded and followed by a large number of other GOP's, as if plucked from the middle of the sequence. This eliminates the side eects of program startup and termination. To achieve this the results of running initialization code are not logged, the cache is primed with data to avoid the eects of cold start, and dirty lines are not ushed at the end of the GOP.

:

5.3 Color Space and Bandwidth

The simulation framework implements one signi cant performance optimization beyond the generic hardware model. Decoded video is sent to the frame buer in 4:2:0 YCrCb form, the native representation of MPEG, rather than 4:4:4 RGB. YCrCb is a color representation with one luminance (grayscale) and two chrominance components per pixel, rather than one each of red, green, and blue. In 4:2:0 YCrCb, the chrominance components are subsampled in both dimensions. We assume that upsampling, color space conversion, and any dithering are done on the y by the video display subsystem. These features are becoming common in workstations and budget PC video accelerators alike. According to prior work on MPEG-1 decoders, this processing can account for almost 25% of total execution time if performed in software [8]. In addition, the 4:2:0 representation of an image is only half the size of the 4:4:4 version, reducing the minimum sustained system bus load to 31.5 Mbytes/s. Finally, these conversion operations account for so many memory and instruction references that they make trace-driven simulation prohibitive { a suciently compelling reason to avoid them.

6.1 Cache Size

The size of a cache is its most signi cant design parameter, certainly from a cost standpoint. Because cache size usually increases or decreases by factors of two, the decision of how large a cache to implement in a system is pivotal. For Main Level MPEG-2 decoding, the cachememory trac as a function of cache size and associativity shows little variation between video sequences. Figure 4 shows a representative plot, assuming a standard line size of 32 bytes. This is superimposed on a surface showing the minimum possible cache-memory trac for the sequence, consisting of the encoded and decoded video streams combined. The most prominent feature of the trac function is a large plateau, at a level of 6.3 times the minimum

6 Cache Simulation Results

Trace-driven cache simulations clarify how data requests from the MPEG-2 decoder translate into what trac is seen by the memory. First of all, there is a distinct bene t to caching video decoder data, even using naive, generic schemes. It has occasionally been suggested that caches are critically inecient for video data. Accordingly, several media processors dispense with data caches altogether in favor of SRAM banks { essentially 5

memory traffic [Mbytes/s]

Cache−Memory Traffic (line size = 32 bytes)

200 150 100

1

50

2 4

16K

32K

8 64K 128K 256K 512K

16

1M

2M

4M

8M

32

associativity

cache size [bytes]

Figure 4: Typical cache-memory trac for MPEG-2 Main Level decoding, over minimum possible trac level value. Except for direct-mapped caches, which produce a peak of 16 times the minimum, cache-memory traf c changes little with increasing cache size for most of the range. Trac starts to roll o at 1 Mbyte, and plummets at 2 Mbytes. This re ects the 2 Mbyte size of the decoder data set. Larger caches show negligible improvement, with the additional space providing no extra bene t. However, the smallest measured value is still almost 2 times higher than the minimum possible value. These data imply that a 32 Kbyte cache is just as good { or bad { for MPEG-2 Main Level decoding as a 64 Kbyte or 128 Kbyte one, for example. Improving on cache-memory trac requires a 1 Mbyte cache or higher, but a 2 Mbytes or larger cache only brings trac down to double the absolute minimum, leaving plenty of room for improvement.

is more than satisfactory, with an average value below 0.5%. 6.3 Line Size

Line size, another fundamental parameter, is less costly to experiment with than cache size. Subblock placement can help decouple the size of cache lines and that of the memory bus. Unfortunately, there is contention between miss rate and memory trac minimization. Low miss rates call for larger lines than the typical 32 bytes, as illustrated in Figure 5. Larger lines tend to provide superior spatial locality, but require more data to be read and possibly written back on a miss. For this reason, minimal memory trac occurs with the smaller lines. This relationship is readily apparent in Figure 6, where for the smallest caches, the largest line sizes lead to cache-memory trac almost 200 times the absolute minimum. Balancing the demands of miss rates and memory trac requires further investigation. For the time being, maintaining the ubiquitous 32 byte line size seems sensible. The interests of memory trac reduction argue against a switch to larger values.

6.2 Associativity

Increasing set associativity is a popular method for getting higher performance out of smaller caches. The memory trac eects of varying set associativity are also visible in Figure 4. Going from a direct-mapped cache to a 2-way set-associative one can reduce memory trac by as much as 50% for small caches. Increasing associativity to 4 can squeeze out almost another 10% improvement over the direct-mapped case. Set sizes of greater than 4, however, show minimal bene t across all cache sizes. Since increasingly higher levels of associativity add considerably to cache cost, complexity and access time, such enhancements are not justi ed. While memory trac is the primary focus, we would prefer to not adversely aect miss rates. As a function of cache size and set associativity, miss rates show the same behavioral pattern as memory trac. However, unlike for cache-memory trac, the performance

6.4 Summary

It is clear that conventional cache techniques provide limited relief from excess cache-memory trac. The simplest way to improve memory trac and miss rates is to have a very large cache { but a large cache is not always feasible, and simulations show that improving MPEG-2 decoding performance requires very big caches. In any case, one would prefer to extract better performance from smaller, lower-cost resources, improving performance through increased eciency rather than brute force. 6

Miss Rates (associativity = 2−way)

Cache−Memory Traffic (associativity = 2−way) 3500 memory traffic [Mbytes/s]

1.25

miss rate [%]

1 16K 32K 64K 128K 256K 512K 1M 2M 4M

0.75 0.5 0.25 0 16

32 64 128 line size [bytes]

256

3000 2500 2000 1500 1000 500 16

Figure 5: Typical miss rates for MPEG-2 Main Level decoding over dierent cache sizes

16K 32K 64K 128K 256K 512K 1M 2M 4M

32 64 128 line size [bytes]

256

Figure 6: Typical cache-memory trac for MPEG-2 Main Level decoding over dierent cache sizes

Reference. The current frame and the previously

7 Decoder Internal Analysis

decoded frames used to reconstruct it (up to two for a B frame), in YCrCb form.

This section explores the internal functioning of a software-based MPEG-2 video decoder, as a rst step to explaining and rectifying its suboptimal use of cache resources. First, we dissect the dierent kinds of data used by the decoder, including compressed input data, image output data, and all of the intermediate types. Second, we examine how these data objects are used in the process of decoding, and how this aects the performance of the cache.

Block. The DCT coecient and pixel values for a single macroblock.

State. Values incidental to the settings and operation of the decoder, yet not part of the image data per se.

This partitioning of data types may not be explicit within the decoder source, but represents an abstraction at a higher level of the data needed for decoder operation. Most of these data types are both read and written. The exceptions are the tabular data, which are read only, and the the output data which is write only. Within a memory hierarchy, reading and writing are not symmetric operations. Read misses and write misses in caches have very dierent latencies. The fact that some MPEG decoder data types are only read or written can be exploited for performance optimization.

7.1 Data Set Composition

The data set of an MPEG-2 decoder { the information that must be available to the program in the course of its execution { consists of several dierent data types which serve distinct purposes. As a result, they are quite heterogeneous with respect to the types and patterns of access utilized by the decoder, as well as the amount of memory space required. The following classes account for the global and static local values in user space.

Input. The compressed MPEG-2 sequence; data

Fraction of Data Type Access Type Size References Input read/write 2 KB 2.7% Output write only 500 KB 3.9% Tabular read only 5 KB 5.5% Reference read/write 1500 KB 23.7% Block read/write 1.5 KB 31.4% State read/write 0.5 KB 25.6%

are read in series from a xed-size buer and refreshed by system calls.

Output. Uncompressed picture data in YCrCb format, stored in a video window image buer. This data type is write only { written but never read by the CPU. System calls transfer each completed picture into the frame buer.

Tabular. Static, read only information used in the

Table 2: Summary of decoder data types, sizes, access types, and proportion of memory references accounted for

MPEG decoding process, such as various lookup tables.

7

The essential properties of the dierent data types are summarized in Table 2. Note how in terms of size, the reference and output types completely dominate the others. With respect to caching, this means that their presence will tend to repeatedly expel the other types from the cache, except for very large data caches. It doesn't matter how many times the other values are re-used. Capacity limitations alone will insure that reference and output data, when updated, will throw the other data types out. Ranking the data types in terms of the number of references rather than by size gives a very dierent picture. Block and state data account for by far the most memory requests. Reference data is next, but output data is in next to last place, accounting for 8 times fewer memory requests than block data. Note that the percentages don't add up to 100%; approximately 7.2% of data references are due to temporary variables and library functions.

initialization

... read headers

read block blocks/MB

reconstruct MB steady state GOP

IDCT

merge MB data MB’s/picture

write frame

...

7.2 Decoder Program Flow

The hierarchical composition of the MPEG bitstream, and its intrinsic sequentiality constrain the structure of the decoder program. Figure 7 shows the phases of operation for a single GOP from the perspective of data accesses. As the diagram makes evident, execution proceeds as a set of nested loops corresponding to dierent data levels in the bitstream as shown in Figure 3. The \initialization" and \termination" blocks represent the operations excluded by steady-state simulation. Table 3 shows which data types are accessed in the dierent phases of operation. Parsing header information and reading in block data involves operating on relatively small data types, and results in a small active cache set. However, during reconstruction, the decoder accesses portions of the reference frames and copies them into the frame currently under construction. The Inverse DCT (IDCT) phase focuses on smaller data types, but merging the decoded pixels back into the new picture requires once again accessing large data types. Writing the completed frame requires traversing signi cant portions of reference and output data. Notice that the decoder accesses state data on a fairly continuous basis. Each iteration through the macroblock loop accesses a new portion of one or two reference frames. Each new P or B frame repeats this traversal through its reference frame(s), recalling signi cant fractions of each frame back into the cache. During the \write frame" phase, the decoder spends most of its time copying the current frame from the reference space to the output space. For all but the largest caches, this has the eect of ushing out almost everything except video data for the next displayed picture. The other data types will have to be reloaded for each new picture, or possibly several times

pictures/GOP

termination

Figure 7: Model of MPEG-2 software decoder showing phases of operation for a GOP a picture for small caches, resulting in wasted cachememory trac. 7.3 System and Audio Interaction

Our simulations assume that MPEG video decoding is the only active process. In reality, video is usually viewed in conjunction with audio data, which requires the concurrent execution of an MPEG audio codec and an MPEG system parser to de-multiplex and synchronize the two media. We expect that running these other processes will not dramatically aect our results, since the amount of data they operate on at any given time is relatively small. The net eect is that of having a slightly smaller cache, and our simulations show that memory trac levels are fairly insensitive to small changes in cache size. Nevertheless, the presence of these other programs is a good argument for keeping the cache footprint of the video data as small as possible. 8 Cache-Oriented Enhancements

Our analysis of decoder behavior and data composition motivates a speci c approach to performance optimization, where the goal is to improve cache eciency for video decoding without adversely impacting other per8

Phases of Data Types Operation input output tabular reference block state p p p read headers p p p p read block p p reconstruct MB p p p IDCT p p p merge MB data p p p write frame

Table 3: Data types accessed in each phase of decoder operation formance metrics or other applications. In particular, there are great potential bene ts in treating dierent data types distinctly, guided by their dierent sizes, access types, and access patterns. For example, write-only values, like video output data, are clearly a waste of cache space. Likewise, while block and state data account for most of the memory requests, their relatively small size means that larger data types, like reference data, systematically crowd them out of the cache. Preventing blatant cache pollution and the predation of small data types with high re-use can yield considerable cache-memory trac savings. To this end, we are evaluating the following tracreduction techniques.

8.1 Selective Caching

Data objects with no possibility of re-use steal cache space from other data, which leads to excess memory trac when the excluded data are loaded in once again. Video output data is a perfect example, since it is written but never read by the CPU, and occupies a fairly large amount of space. Bypassing the cache entirely in favor of direct storage to main memory promises more ecient use. In our simulations, we have found that excluding video output data alone from the cache reduces cachememory trac by up to 50%. Across the con gurations considered, not caching video output yields a 25% reduction on average. Improvements in cache-memory trac are global, yet the shape of the curve is not substantially modi ed. The plateau out to 512 Kbytes persists, as does the swift drop down at 2 Mbytes. Excluding output data is even more helpful for miss rates, which drop by a maximum of 85% from earlier levels, and 60% on average. There are several distinct ways of implementing this feature. In the PA-RISC architecture, the load and store instructions can contain \locality hints" which signal that the data referred to is unlikely to be used. If supported by the particular implementation, the processor can elect to bypass the cache with the data [7]. The UltraSPARC has block load and store instructions, which move groups of data directly between registers and main memory at high speed [10]. This approach allows for potentially more ecient use of processor bandwidth. Finally, the PowerPC architecture provides for marking areas of memory as non-cacheable. Once marked, data is transparently and automatically transferred between registers and memory buers. The cache hint and block load/store approaches require more explicit programming eort.

Selective Caching. Exclude speci c data from the cache.

Cache Locking. Reserve selected cache lines for particular data objects; lines are untouched by cache replacement while locked.

Scratch Memory. Implement a small portion of addressable system memory on the processor { not as cache { for storage of small, frequently-used data.

Data Reordering. Perform cache-conscious modi cations of decoder memory accesses.

Cache Partitioning. Allocate dierent sections of the cache to dierent data types.

Many of these methods are relatively simple to implement, and most have precedents in other architectures since they have performance bene ts beyond video decoding. Perhaps their ecacy for MPEG-2 processing might encourage their broader adoption in architectures. In any case, the main challenge is to successfully apply these various techniques in a manner that enhances decoder performance while avoiding excessive cost and complexity. The remainder of this section considers each method on its own.

8.2 Cache Locking

One way to protect small but frequently reused data types, like input, state, and tabular values, from victimization by raw video data is to lock the parts of the 9

cache which contain the vulnerable data. While locked, these cache lines are invisible to the replacement algorithm, and the contents will not be thrown out only to be re-loaded when needed again. This is accomplished by special machine instructions which execute the locking and unlocking. There are two basic variations on this technique. Static locking simply freezes the tag and contents of the aected line, allowing for the writing of values but not replacement; the line is associated with the same portion of main memory until unlocked. Dynamic locking is somewhat more exible, treating locked lines as an extension of the register set, with special instructions to copy contents directly to and from main memory. This scheme is used in the Cyrix MediaGX processor to speed the software emulation of legacy PC hardware standards.

itself. Our goal is not to improve computational eciency, a task which we leave to the many others working on that problem. However, we suggest that modi cations of speci cally how the decoder software manages and addresses memory can provide signi cant improvements in cache eciency { some of which may require hardware support to properly implement. For example, there is strictly speaking no essential need for the output data type. B frame data, considered separately, is never used by the decoder to construct other frames. The only time it is read is for copying to the output buer space. I and P pictures need to remain available to serve as reference frames. Yet it turns out that in the natural ow of decoder data, by the time these frames are copied to the output space they are no longer needed. If pictures could be transferred directly by DMA from the reference frame space where they are stored, rather than routed through the output data, this would result in greater eciency. It would eliminate all of the cycle-consuming copying of output data, and transform the B frame data into write-only values which could be eciently excluded from the cache.

8.3 Scratch Memory

Another means of providing a safe haven for vulnerable data is to put them in a separate memory. Many processors today consist to a large extent not of logic but SRAM, in the form of caches. It is easy to imagine implementing a portion of the memory space itself on the die { not as a cache but addressable memory. Data which are subject to being prematurely ejected from the cache could then be kept in this space. The components of the MPEG-2 data which would most bene t from this are relatively small, so only a few KBytes would be required. This \scratch memory" would be relatively inexpensive to implement in hardware, smaller than most caches, and far simpler than any of them due to the lack of tags and logic for lookup, replacement, etc. Similar memories have actually been a featured in Intel microcontrollers for quite some time, for storing Page 0 of the address space. A scratch memory provides potentially very fast access to important data, since there is no possibility of a read or write miss, and the delay of searching tags can be eliminated. Intel microcontrollers even use special addressing modes to access Page 0 memory which are more concise and faster than those used for ochip addresses. Unlike with cache locking, cache performance and capacity are not compromised since the scratch memory is o to the side. Of course, getting a hardware feature used in embedded systems, where operating systems and even compilers are optional, to work eectively in an interactive, multitasking environment will take careful consideration, but the potential bene ts are considerable.

8.5 Cache Partitioning

It is possible to take the basic idea of cache locking in a slightly dierent direction. With cache partitioning, particular data types are relegated to speci c sections of the cache, which continue to behave like cache. However, rather than locking in the physical address associated with a line, or treating cache lines as extensions of the register set, it is as if each data type has its own separate, smaller cache space for storing its values. This method provides very precise control over cache behavior, and an extremely powerful tool for the optimal management of cache resources. However, it requires a level of hardware complexity not demanded by the methods discussed previously. Also, the task of ef ciently exploiting the available capabilities in software is considerably more challenging. 8.6 Summary

None of the methods proposed are mutually exclusive, although some of them, like cache locking and scratch memory, are arguably redundant if used in conjunction. Whether one of these techniques or some combination of them provides the most ecient solution remains to be seen. Also, each one raises issues of implementation cost and complexity, interaction with the operating system and other processes, and software interfacing and portability which need to be more thoroughly addressed. We will investigate these issues and perform a more detailed evaluation of these architectural enhancements in future work.

8.4 Data Reordering

This category of enhancement is a departure from our standard procedure of not tampering with the decoder 10

9 Conclusions

[8] Ketan Patel et al. Performance of a software MPEG video decoder. In Proc. 1st ACM Intl. Conf. on Multimedia, pages 75{82. ACM, Aug. 1993. [9] Alex Peleg et al. Intel MMX for multimedia PCs. Comm. of the ACM, 40(1):25{38, Jan. 1997. [10] Marc Tremblay et al. VIS speeds new media processing. IEEE Micro, 16(4):10{20, August 1996. [11] Daniel F. Zucker et al. A comparison of hardware prefetching techniques for multimedia. Technical Report CSL-TR-95683, Stanford University Depts. of Electrical Engineering and Computer Science, Stanford, CA, Dec. 1995.

The eectiveness of computer memory subsystems, including caches, is currently a signi cant barrier to improved performance on multimedia applications. We have documented the memory trac and data cache behavior of MPEG-2 video decoding on a general-purpose computer, and demonstrated that standard caches produce a signi cant excess of cache-memory trac. While almost any cache is superior to none { even simple caches can reduce required memory bandwidth considerably { experimenting with basic cache parameters like cache size, associativity, and line size has a limited ability to reduce cache-memory trac. The best value than can be achieved is double the absolute minimum traf c required. However, cache-oriented architectural enhancements, driven by an understanding of decoder behavior, can dramatically improve cache eciency. These enhancements include selective caching, scratch memory, data reordering, and cache partitioning. All have in common an emphasis on treating dierent elements of the data distinctly, accommodating their unique properties - size, usage patterns, temporal locality, etc. This approach provides for optimal management of available cache resources. We expect that further re nement of these techniques will enable improved video decoding performance of general purpose microprocessors, and encourage the broader proliferation of digital video platforms and applications. 10 Acknowledgments

This research was supported in part by NSF grant CCR9696196. Miriam Leeser is the recipient of an NSF Young Investigator Award. The research is also supported by generous donations from Sun Microsystems. The authors would like to thank Shantanu Tarafdar and Brian Smith for useful discussions. References [1] Stefan Eckart et al. ISO/IEC MPEG-2 software video codec. In Proc. Digital Video Compression: Algorithms and Technologies 1995, pages 100{109. SPIE, Jan. 1995. [2] Peter N. Glaskowsky. Fujitsu aims media processor at DVD. Microprocessor Report, 10:11{13, November 1996. [3] Linley Gwennap. Mitsubishi designs DVD decoder. Microprocessor Report, 10:1,6{9, December 1996. [4] International Organization for Standardization. ISO/IEC JTC1/SC29/WG11/602 13818-2 Committee Draft (MPEG2), Nov. 1993. [5] Paul Kalapathy. Hardware-software interactions on MPACT. IEEE Micro, 17(2):20{26, Mar./Apr. 1997. [6] James R. Larus. Ecient program tracing. IEEE Computer, 26(5):52{61, May 1993. [7] Ruby B. Lee. Subworld parallelism with MAX-2. IEEE Micro, 16(4):51{59, August 1996.

11