Download - IEEE Xplore Digital Library

7 downloads 0 Views 274KB Size Report
Eng. and Computer Science. Wichita State University. Wichita, Kansas, USA. [email protected]. Mizan Rahman. School of Computational Sci.
Performance-Power Analysis of H.265/HEVC and H.264/AVC Running on Multicore Cache Systems Abu Asaduzzaman, Vidya R. Suryanarayana

Mizan Rahman

Dept. of Elec. Eng. and Computer Science Wichita State University Wichita, Kansas, USA [email protected]

School of Computational Sci.&Eng. Georgia Institute of Technology Atlanta, Georgia, USA [email protected]

Abstract—The leading problem of adopting caches into multicore computing systems is twofold: cache worsens execution time unpredictability (that challenges supporting real-time multimedia applications) and cache is power hungry (that challenges energy constraints). Recently published articles suggest that using cache locking improves timing predictability. However, increased cache activities due to aggressive cache locking make the system consume more energy and become less efficient. In this paper, we investigate the impact of multicore cache parameters and cache locking on performance and power consumption for real-time multimedia applications. We consider an Intel Xeon-like multicore architecture with two-level cache memory hierarchy and use two popular multimedia applications: recently introduced H.265/HEVC (for improved video quality and data compression ratio) and H.264/AVC (the network friendly video coding standard). Experimental results suggest that cache optimization has potential to improve multicore performance by decreasing cache miss rate down to 36% and save power consumption up to 33%. It is observed that H.265/HEVC has significant performance advantage on multicore system over H.264/AVC for smaller cache memories. Keywords—Cache memory hierarchy, cache optimization, multicore architecture, multimedia applications, performance and power analysis

I.

INTRODUCTION

The growing popularity of low-power consumer electronics poses challenges to the designers to add more features that support real-time video applications. The newly added functionalities increase the complexity of such systems. It becomes obvious that more computation power is required to support the complex applications in real-time. Increased processing implies more traffic from CPU to main memory. The CPU processing speed is increasing at a much faster rate than the memory bandwidth, leading to a significant CPUmemory speed gap. A common practice to deal with memory bandwidth bottleneck is to use cache memory – a very fast and small but expensive memory. Cache improves system performance by reducing the effective memory access time [1, 2]. Multicore architecture is the new design trend to improve performance/power ratio by running multiple jobs on multiple cores concurrently. Most contemporary processors have caches. The cache memory hierarchy normally has level-1 cache (CL1), level-2 cache (CL2), and main memory. In most modern processors, CL1 is usually split into instruction (I1) and data (D1) caches and CL2 is a unified cache.

c 978-1-4673-6361-7/13/$31.00 ⃝2013 IEEE

174

Multimedia streams are encoded before transmission to save bandwidth. Encoders compress input video streams. Encoded files need to be decoded before playing them back. Decoders decode the compressed video data. There are dependencies among video frames during encoding and decoding video files [3, 4]. Because of the dependencies among frames, the right selection of cache parameters may significantly improve the performance and reduce the energy consumption. Therefore, it is important to understand the cache parameters and multimedia applications to design effective multimedia systems. Even though cache improves performance, cache consumes additional energy and worsens execution time predictability due to its adaptive and dynamic nature [5, 6]. Additional energy requirement becomes crucial for low-power systems, especially when they are operated by battery. Execution time is very crucial to develop real-time multimedia systems. For systems supporting real-time video applications, there is no straight forward way to judge the trade-off between performance and power consumption [7]. In general, a simple architecture that can support multimedia applications using minimum amount of energy is needed to design such an electronic system [8]. Recent studies show that cache locking may decrease overall execution time [5, 6]. Studies also show that cache locking may improve performance up to a limit. However, aggressive cache locking may increase energy consumption and reduce performance. In this work, we explore the impact of multicore cache optimization on performance and power consumption due to multimedia applications. The outline of this paper is as follows. Related work is summarized in Section II. Section III briefly discusses the dependencies in video CODECs (enCOder and DECoder pairs). Section IV reviews important cache parameters and cache locking technique. In Section V, experimental details are presented. Simulation results are discussed in Section VI. Finally, we conclude our work in Section VII. II.

RELATED WORK

Motion Pictures Expert Group (MPEG) introduces MPEG2 to provide compression support for TV quality transmission of digital video [4]. MPEG-4 (Part-2) compression is an improvement over MPEG-2 format (without compromising quality) which brings other advantages like bitrate (MPEG-2

bitrate: 4 to 9 MB per second; MPEG-4 bitrate: some Kilobytes per second) and bandwidth (MPEG-2 bandwidth: up to 40 MB per second; MPEG-4 bandwidth: around 64 KB per second) [8]. The MPEG-4 (Part-2) standard is generally used for streaming media. MPEG and ITU-T (International Telecommunication Union – Telecommunication Standardization Sector) Video Coding Experts Group jointly developed H.264 advanced video coding (AVC) which is more attractive for video network delivery [9]. Newly proposed H.265 high efficiency video coding (HEVC) is optimized for mobile devices; it is expected to improve video quality, double the data compression ratio compared to H.264, and support resolutions up to 8K UHD (7680 × 4320) [10, 11]. These articles ([4, 8-11]) show that there are dependencies among the video frames. However, they do not show how the underlying architecture (like cache memory organization) impact on the performance of multimedia applications. The performance of a general-purpose computing platform running MPEG-2 application is studied [4]. The impact of cache parameters on performance (in terms of miss rate and memory traffic) is evaluated in this work. Experimental results show that cache may improve MPEG-2 performance on such a system. Cache modeling and optimization is conducted for MPEG-4 video decoder in [1]. The target architecture includes a processing core to run the decoding algorithm and a memory hierarchy with two-level caches. Both level-1 and level-2 caches are considered to be unified caches. Cache parameters are optimized to improve system performance. Experiments show that performance can be improved by lowering the miss rates by optimizing CL1 line size, CL1 associativity level, and CL2 cache size for MPEG-4 decoder. Problem of improving performance of the memory hierarchy at system level for multitasking data intensive applications is addressed. This method uses cache partitioning techniques to find a static task execution order for inter-task data cache misses. Due to the lack of freedom in reordering task execution, this method improves performance by optimizing the caches more. Above mentioned articles ([1, 4]) present different cache optimization techniques to improve the performance. However, these articles do not address the tradeoff between performance and power consumption which is important for present systems. For multicore systems, power consumption is a crucial design factor due to the presence of multiple power-hungry caches. A victim buffer (between CL1 and CL2) is introduced to improve the power consumption in [12]. Instead of discarding the victim blocks (from CL1) that might be referenced in the near future, victim blocks are stored in victim buffer. Experimental results show that victim buffer can reduce energy by 43% on PowerStone and MediaBench benchmarks. In [13], a system-level power aware design flow is proposed in order to avoid failures after months of design time spent at register transfer level and gate level. In [14], an analytical model for power estimation and average memory access time is presented. Articles in [12–14] discuss how overall energy consumption can be reduced by optimizing various cache parameters; these articles do not cover performance analysis. Moreever, these articles do not address how execution time (a crucial design factor for real-time systems) and power consumption due to H.265/HEVC can be decreased.

175

It has been proposed that important cache contents should be statically locked in cache so that memory access time and cache-related preemption delay are predictable. In [5], the impact of cache locking on the execution time predictability and performance is studied. Experimental results show that cache locking improves both predictability and performance when the appropriate cache parameters are used and the right cache blocks are locked. Predictability can be further improved by sacrificing performance. Various cache locking algorithm are also used. The major shortcoming of these studies is that no analysis is done to show how power consumption is impacted. In [3], a method is given to rapidly find the L1 cache miss rate of an application. Also, an energy model and an execution time model are developed to find the best cache configuration for the given embedded application. However, this work does not offer any methodology to improve the performance and decrease power consumption. III.

DEPENDENCIES IN VIDEO CODECS

This work considers two state-of-the-art multimedia encoder-decoder pairs, H.264/AVC and H.265/HEVC. We know that the encoders compress input video streams and decoders decode the compressed video data. There are dependencies among video frames during encoding and decoding video files [3, 4]. Therefore, multicore cache memory hierarchy may significantly impact the performance and power consumption of multimedia applications. A. H.264/AVC CODEC H.264/AVC (a.k.a. MPEG-4 Part-10) promises to significantly outperform both of its parents, H.263 and MPEG4 (a.k.a. MPEG-4 Part-2), by providing high-quality and low bit-rate streaming video. H.264/AVC CODEC includes two dataflow paths – a “Forward” path (encoder – left to right) and a “Reconstruction” path (decoder – right to left) [9]. H.264/AVC encoding performance can be improved by efficiently using the cache memory. The frame may be processed in Intra or Inter mode. In Intra mode, predicted frames are formed from samples in the current frame that have previously encoded, decoded and reconstructed. In Inter mode, predicted frames are formed by motion-compensated prediction from one or more reference frames [9]. Therefore, the performance and power consumption of H.264/AVC encoder are influenced by the presence of the cache (cache parameters and cache locking). In H.264/AVC, the decoder receives an encoded bit-stream from the Network Abstraction Layer. The data elements are entropy decoded and reordered to produce a set of quantized coefficients. Using the header information from the bit-stream, the decoder creates a prediction macro-block, identical to the original prediction formed in the encoder. As a result, the cache parameters and cache locking also influence the performance and power consumption of H.264/AVC decoder. B. H.265/HEVC CODEC H.265/HEVC is the new video compression standard, a successor to H.264/MPEG-4 AVC currently under joint development by the ISO/IEC MPEG and ITU-T Video Coding Experts Group (VCEG) [10].

The HEVC video coding layer uses the same "hybrid" approach used in all modern video standards, starting from H.261, in that it uses inter-/intra-picture prediction and 2D transform coding [10]. HEVC replaces macroblocks, which were used with previous standards, with a new coding scheme that uses larger block structures of up to 64x64 pixels and can better sub-partition the picture into variable sized structures. Improved context-adaptive binary arithmetic coding is the only entropy encoder method that is allowed in HEVC. HEVC specifies 33 directional modes for intra prediction compared to the 8 directional modes for intra prediction. HEVC uses halfsample or quarter-sample precision with a 7-tap or 8-tap filter. HEVC allows for two motion vector modes which are advanced motion vector prediction and merge mode. HEVC specifies four transform unit sizes (4x4, 8x8, 16x16, and 32x32) to code the prediction residual. HEVC specifies two loop filters that are applied in order with the de-blocking filter applied first and the sample adaptive offset filter applied afterwards. HEVC/H.265 works off of larger quantization blocks than AVC/H.264 (64×64 versus 16×16) and splits a single frame into multiple tiles for more efficient multithread decoding. H.265/HEVC can divide frames into multiple tiles so multicore processors can spread decoding across parallel subtasks. IV.

Associativity Level: Studies indicate that higher associativity may improve performance/power ratio by reducing the conflict misses. For video data, aggressive increment in associativity level may not improve overall performance due to the increased complexity [5]. B. Cache Locking Cache locking is a technique to hold certain memory blocks inside the cache during the entire execution time. The memory blocks may be pre-selected or randomly chosen and pre-loaded in order to improve hit rates. Once such a block is loaded into the cache, the replacement algorithm excludes it to be removed until the execution is completed. Figure 1 illustrates how memory blocks can be preloaded by randomly selected memory blocks into cache: first, all Way-0 locations are preloaded, then Way-1, etc. In this example, locking Way-0 means 4 slots out of 8 i.e., 50% cache is locked. Studies show that cache locking improves predictability for computation intensive multimedia applications [6].

CACHE OPTIMIZATION

Cache optimization has potential to decrease cache misses. In this work, we explore the impact of cache parameters and cache locking on the performance and power consumption for H.264/AVC and H.265/HEVC CODECs running on multicore systems. Processing cores use the cache memory hierarchy to read unprocessed video data from main memory and to write processed video data into main memory. Therefore, the selection of cache parameters and the amount of cache locked is extremely important to achieve the optimal performance and power consumption. A. Cache Parameters Most contemporary processors are having on-chip CL1 and off-chip CL2. Cache parameters include cache size, block size or cache line (say, Sb), associativity levels, and sets. For a cache with S sets and W ways of associativity, total number of blocks (B) = S * W and total cache size is B * Sb. Cache can operate as direct mapped (W = 1 and S = B), set-associative (1 < W < B), and fully-associative (W = B and S = 1). Caches take significant amount of additional energy to be operated and it makes the system more unpredictable due to its dynamic behavior [5, 6]. So, adding more caches to a system is more problematic, especially when the system is battery operated and runs real-time applications. In this work, we examine the impact of cache parameters on miss rate and total power consumption. Cache Size: Studies show that for MPEG-2, the increasing cache size offers performance improvement. However, after a point, the increment is not significant [4]. Line Size: Larger line size tends to provide low miss rates, but require more data to be read and possibly written back on a cache miss. For video data, too large line size may introduce cache pollution and decrease performance [5].

176

Fig. 1 Cache preloading and locking in a 2-way set-associative cache organization

Aggressive random cache locking may decrease the performance by increasing cache misses due to the reduction in effective cache size. If memory blocks are wisely selected, cache miss should decrease. Recently published work shows that performance increases significantly with the increase in the locked cache size for smaller cache [1, 5, 6]. V.

EXPERIMENTAL DETAILS

In this study, we model a multi-core architecture and run the simulation program using H.264/AVC and H.265/HEVC workloads. The architecture has two levels of cache memory hierarchy. We optimize line size, associativity level, CL2 size, the number of cores. We implement CL2 cache locking in an eight-core system. In the following subsections, we discuss simulated architecture and related topics. A. Simulated Architecture We simulate Xeon-like quad-core architecture with shared CL2 as illustrated in Figure 2. Each core of the simulated multicore architectures has private CL1 (split into I1 and D1) and shared CL2 is a unified cache.

B. Simulation Tools We use Cachegrind [15] and VisualSim [16] simulation tools. Cachegrind is used to simulate level-1 (I1 and D1) and level-2 caches. Using JM-RS (98) [17] with Cachegrind, we characterize H.264/AVC encoding/decoding algorithms and using “isovideo” software package [18], we characterize H.265/HEVC encoding/decoding algorithms. Using Cachegrind outcomes, we create workload of the multimedia applications to run VisualSim simulation programs. Using VisualSim, we obtain the power consumption and mean delay for the multicore system. C. Workload We use two representative real-time applications, H264/AVC and H.265/HEVC, in this study. We characterize the applications using Cachegrind package and create workload to capture all possible scenarios that the target architecture will experience. Video sequences in the uncompressed YUV4MPEG format used by “mjpegtools project” are selected in this work to generate workloads and run the simulation programs. I1, D1, CL2, read, and write references due to the CODECs are obtained. Tables I and II show the references due to H.264/AVC encoder.

Fig. 2 Important components of the simulated multicore architecture

Each core processes (encodes or decodes) the video streams from the main memory. Each core reads the raw/encoded data from (and writes the encoded/decoded data into) the main memory through shared CL2 and private CL1. Task partitioning and cache locking decisions are made at the CPU level (by the master core). TABLE I.

YUV Video Sequences (Frames) Akiyo (300)

I1, D1, AND CL2 REFERENCES FOR H.264/AVC ENCODER DUE TO YUV VIDEO SEQUENCES

File Size (in MB)

TOTAL CL1 REFERENCES (IN KILO) I1 D1

CL1 REFS I1% D1%

CL1 MISS RATIO (%)

CL2 REFS (IN KILO)

11.1

118,183

47,239

71.4

28.6

5,271

3.1

Paris (1065)

37.4

294,128

129,750

69.4

30.6

27,183

6.1

News (300)

11.1

132,718

53,858

71.1

28.9

6,837

3.5

Foreman (300)

11.1

144,074

60,036

70.6

29.4

8,325

3.9

Suzie (150)

4.8

97,745

34,599

73.9

26.1

2,293

1.7

TABLE II.

YUV Video Sequences (Frames) Akiyo (300)

D1 AND CL2 READ/WRITE REFERENCES FOR H.264/AVC ENCODER DUE TO YUV VIDEO SEQUENCES

File Size (in MB)

D1 REFERENCES (KILO) Read Write 10,298

D1 REFS R% W%

CL2 REFS (KILO) Read Write

78.2

21.8

3,568

11.1

36,941

Paris (1065)

37.4

102,762

26,988

79.2

20.8

19,150

8,603

69.0

31.0

News (300)

11.1

41,902

11,956

77.8

22.2

4,629

2,208

67.1

32.9

Foreman (300)

11.1

47,128

12,908

78.5

21.5

5,636

2,689

68.8

31.2

Suzie (150)

4.8

25,569

9,030

74.9

25.1

1,551

741

65.9

34.1

Using Cachegrind, we obtain the total number of references for I1, D1 (read and write), and level-2 (read and write) caches TABLE III.

1,703

CL2 REFS R% W% 67.7

using different cache parameters. Table III shows the workload for H.264/AVC and H.265/HEVC encoders.

AVERAGE CACHE REFERENCES FOR H.264 AND H.265 ENCODERS

JM RS (96)

CL1 Refs I1% D1% 73 27

D1 Refs R% W% 79 21

CL2 Refs R% W% 70 30

“isovideo”

82

86

79

Standard

Codec

H.264/AVC H.265/HEVC

18

177

32.7

14

21

D. Assumptions • All cores are identical. CL1 is private and split into I1 and D1 caches; CL2 is shared and unified cache. • Line size from 16 to 256 Byte, associativity level from 2to 16-way, level-2 cache size from 128 KB to 4 MB, and cache locking capacity from 0% (no locking) to 50% are used. • Only cache locking at L2 is considered. Memory blocks are selected randomly for cache locking. • 16-way set-associative cache mapping and way cache locking is considered. Therefore, the minimum portion of the cache that can be locked is 1/16 (i.e., 6.25%). • Random cache block replacement policy and write-back memory update strategy are used. • The dedicated bus that connects CL1 and CL2 introduces negligible delay compared to the delay introduced by the system bus which connects CL2 and main memory. E. Power Consumption The total power consumption is calculated using Equation 1. For a system with X components and Y tasks, the total power consumption can be expressed as shown below, X, Y Pt (total) = Σ Σ (Pij (active) + Pij (idle) i=1, j=1

B. Impact of Cache Locking To examine which application (H.264/AVC or H.265/HEVC) takes more advantages of multiple cores and cache optimization, we compare the percentage reduction in miss ratio and power consumption due to CL2 cache locking of an 8-core system. As illustrated in Figure 4, H.265/HEVC workload reduces the miss ratio better than those of H.264/AVC workload for all locked cache sizes.

Equ. (1)

In our experiment, we consider CPU, CL1, buses, CL2, and main memory (MM) for power consumption. Power consumed by cache memory subsystem depends on their size and activities (like cache misses) due to the applications being executed. Power consumption is distributed among the processor components as: CPU 25.92 units/operation, I1 19.44 units/operation), D1 11.52 units/operation, bus and others components 15.12 units/operation [19]. Power consumption by CL2 and MM is assumed to be proportionate to size and the number of cache hits. VI.

Fig. 3 CL1 Miss Ratio Vs CL1 cache line for encoders

RESULTS AND DISCUSSION

In this work, we examine the impact of cache parameters and cache locking on power consumption and performance for the CODECs of H.265/HEVC and H.264/AVC multimedia applications. We obtain the cache miss rates (the less the cache misses, the better the performance is) and power consumption for various cache size, locked cache size, and the total number of cores. A. Impact of CL1 Cache Size If cache size increases, more memory blocks can be stored in the cache; therefore, miss ratio decreases. Figure 3 illustrates the impact of level-1 cache size on H.264/AVC and H.265/HEVC video encoders. As CL1 cache size increases, CL1 miss ratio decreases for all applications. Experimental results show that miss ratio due to H.265/HEVC encoder is always smaller than H.264/AVC encoder. (See CL1 size 4 KB in Figure 3.) For large cache size (say, 16 KB or larger), some applications do not exhibit any cache misses; this is because those applications entirely fits in I1 cache.

178

Fig. 4 Impact of cache locking at shared CL2 on mean delay for H.264/AVC and H.265/HEVC workloads

Similarly, H.265/HEVC workload reduces the power consumption better than that of H.264/AVC workload (see Figure 5). 25% locked cache size shows the maximum energy saving because more misses are saved at 25% locking.

Fig. 5 Impact of cache locking at shared CL2 on power consumption for H.264/AVC and H.265/HEVC workloads

C. Impact of Multiple Cores Finally, we discuss the impact of multiple cores (2 to 32 cores) on the performance and power consumption while running multimedia applications. Simulation results show that mean delay decreases but the power consumption increases as the number of cores increases (see Figures 6 and 7). For the given cache parameters and workloads, 8-core is the optimal choice when performance and power consumption are concerned. It is also noted that the impact of cache locking on mean delay is significant for small number of cores (Figure 6). However, the impact of cache locking on power consumption is significant for large number of cores (Figure 7).

be decreased up to 30% and 21%, respectively (see Figures 4 and 5). It is noticed that 8 cores are the optimal choice in this experiment (see Figures 6 and 7). Simulation results also reveal that H.265/HEVC encoder has performance advantage over H.264/AVC encoder for smaller caches. This is because H.265/HEVC encoder makes smaller executable files when compared with those due to H.264/AVC encoder. It would be interesting to explore the impact of optimized cache locking technique versus CUDA/GPU technology on performance/power ratio due to multimedia applications as an extension to this work. REFERENCES [1]

[2]

[3]

[4]

[5] Fig. 6 Impact of adding cores on mean delay

[6]

[7] [8] [9] [10]

[11]

[12]

Fig. 7 Impact of adding cores on power consumption

VII. CONCLUSIONS In order to run computation intensive multimedia applications like H.265/HEVC and H.264/AVC, high processing speed is required. Cache improves processing speed by bridging the speed gap between the slow main memory and fast CPU. However, cache requires additional power to be operated. As multicore systems have many power-hungry cache memories, it is important to analyze the performancepower tradeoff for such complex systems. In this paper, we investigate the impact of multicore cache on performance and power consumption for H.265/HEVC and H.264/AVC applications. We simulate a Xeon-like quad-core architecture with two levels of cache memory organization. According to the experimental results, cache size has significant impact on performance for various multimedia applications (see Figure 3). It is observed that the miss rate and power consumption can

179

[13] [14]

[15] [16] [17] [18]

[19]

Asaduzzaman, A (2013) A Power-Aware Cache Organization Effective for Real-Time Distributed and Embedded Systems. In Journal of Computers (JCP), Vol. 8, No. 1, pp. 49-60. Certner O, Li Z, et al (2008) A Practical Approach for Reconciling High and Predictable Performance in Non-Regular Parallel. Design, Automation and Test in Europe (DATE'08). Janapsatya A, Ignjatovic A, Parameswaran S (2006) Finding Optimal L1 Cache Configuration for Embedded Systems. In the Proceedings of the 2006 conference on Asia South Pacific design automation, pp. 796-801. Soderquist P, Leeser M (1997) Optimizing the Data Cache Performance of a Software MPEG-2 Video Decoder. In ACM Multimedia 97 – Electronic Proceedings, Seattle, WA. Asaduzzaman, A and Sibai, FN (2009) Impact of Level-2 Cache Sharing on the Performance and Power Requirements of Multicore Embedded Systems. In MICPRO Journal, Vol. 33, No. 5-6, pp. 388-397. Puaut I (2006) Cache Analysis Vs Static Cache Locking for Schedulability Analysis in Multitasking Real-Time Systems. (Accessed: 5/15/2013) http://citeseer.ist.psu.edu/534615.html. Wolf W, et al (2003) Memory System Optimization of Embedded Software. In Proceedings of the IEEE Vol. 91, No. 1, pp. 165-182. Chase J, Pretty C (2002) Efficient Algorithms for MPEG-4 Video Decoding. In TechOnLine, University of Canterbury, New Zealand. Richardson I. H.264 / MPEG-4 Part 10: Overview. (Accessed: 5/15/2013) URL: www.vcodex.com. Sullivan GJ, Ohm JR, Han WJ, Wiegand T (2012) Overview of the High Efficiency Video Coding (HEVC) Standard (pdf). IEEE Trans. on Circuits and Systems for Video Technology. Retrieved 2012-09-14. Panayides, A, Antoniou, Z, Pattichis, MS, et al (2012) High Efficiency Video Coding for Ultrasound Video Communicationin M-Health Systems. In the 34th Annual International Conference of the IEEE EMBS, San Diego, CA, pp. 2170-2173. Zhang C, Vahid F, Lysecky R (2004) A Self-Tuning Cache Architecture for Embedded Systems. In ACM Trans. on Embedded Computing Systems, Vol. 3, No. 2, pp. 407-425. Nebel W (2004) System-Level Power Optimization. In IEEE/DSD-04. Ahmed RE (2006) Energy-Aware Cache Coherence Protocol for ChipMultiprocessors. In the Canadian Conference on Electrical and Computer Engineering (CCECE'06), pp. 82-85. Cachegrind - a cache profiler from Valgrind. (Accessed: 5/15/2013) URL: http://valgrind.kde.org/index.html. VisualSim – system-level simulator. (Accessed: 5/15/2013) URL: www.mirabilisdesign.com. JM-RS (96) – H.264/AVC Reference Software. (Accessed: 5/15/2013) URL: http://iphome.hhi.de/suehring/tml/ download/. isovideo software package. (Accessed: 5/15/2013) URL: http://www. isovideo.com/deinterlacing_before_compression_long.php; http://trace.eas.asu.edu/yuv/ Tang, W, Gupta, R, and Nicolau, A (2002) Power Savings in Embedded Processors through Decode Filter Cache. In Proceedings of the 2002 Design, Automation and Test in Europe Conference and Exhibition (DATE’02), pp.1-6.