Towards a Load-Aware Scheduling Framework for Realtime Video ...

2015 International Conference on Identification, Information, and Knowledge in the Internet of Things

Towards a Load-Aware Scheduling Framework for Realtime Video Cloud Weishan Zhang1 Pengcheng Duan1 Qinghua Lu12 1 Department of Software Engineering, China University of Petroleum No.66 Changjiang West Road, Qingdao, China. 266580 2 NICTA, 13 Garden St, Eveleigh, NSW 2015, Australia {zhangws}@upc.edu.cn {chaseandblack}@gmail.com {dr.qinghua.lu}@gmail.com

Abstract

Hadoop1 and Storm2 makes the processing of large video data in a more efficient way than existing classical ones. For example, Baidu3 has successfully applied the Hadoop ecosystem for business analysis, data mining and web clustering. Based on a programming model called MapReduce[4], Hadoop can handle massive amounts of data. Duan[5] proposed food image recognition using MapReduce to process large scale of image dataset. And Kim[10] designed a video transcoding system on Hadoop. However, Hadoop is designed for batch processing to process offline dataset. Apache Spark[20] can build low latency applications but it mainly focuses on iterative computing and repeated queries on RDDs (Resilient Distributed Datasets). However, many applications in video processing such as accident vehicle tracking and terrorist incident anticipation requires real-time responses. The CUDA architecture based on GPU’s high processing capability has been accelerating real-time processing recently. Many CPU-intensive algorithms benefit from it, ending up with an increase in speed by severalfold such as [9, 7, 16] which refine MOG (Mixture of Gaussians) with GPU for ASR(automatic speech recognition) and [19, 14, 2] which focus on subtracting backgrounds from a sequence of images based on GPU. But apparently GPU itself has limitations for example limited built-in memory. Although GPU-based jobs consume less CPU resources and gain a speed acceleration, it occupies a lot more RAM than the corresponding CPU-based jobs. If all video jobs are based on GPU or CPU, it may cause resource imbalance where CPU might be very busy while RAM is not fully utilized or, on the contrary, RAM might be run out yet CPU is not fully used. Therefore, in this paper we propose a load-aware pluggable cloud framework for real-time video processing. Based on Apache Storm4 , we design a pluggable GPU-CPU

A lot of video applications such as traffic jam detection and criminal tracking require quick responses for video processing, which rely on a realtime supporting framework. Compared with CPU processors, GPU acceleration can achieve high performance. However in the context of Cloud Computing, GPU-based jobs consume less CPU resources yet occupy a lot more memories compared to CPU-based jobs, especially when bottlenecks occur in CPUs or memories. In this paper, we propose a load-aware pluggable cloud framework for real-time video processing where loadaware CPU-GPU switching can be conducted at run time to alleviate the potential imbalance. We have evaluated the framework on its performance, reusability, pluggablity and scalability to show its effectiveness.

1 Introduction Video cameras are now emerging throughout our surroundings such as ATMs, office buildings, elevators and schools, and hence video data are increasing at an unprecedented scale. According to the International Data Corporation’s recent report, "The Digital Universe in 2020," half of global big data are surveillance video data in 2012, and the percentage is set to increase to 65 percent by 2015. Video data manipulation mainly includes compression, storage, transmission and intelligent analysis. Many new applications are emerging from intelligent video analysis such as prediction of traffic congestion, accident vehicle tracking and abnormal behavior detection in video. An example of this is the terrorism prevention. The 2014 Kunming terrorist attack at the railway station of the Kunming in China and the recent similar attacks in Paris, France caused great concerns for public security. This raises a new challenge for real-time intelligent video processing in order to detect such situation at real time. The emergence of "Big Data" frameworks such as 978-1-4673-8637-1/15 $31.00 © 2015 IEEE DOI 10.1109/IIKI.2015.8

1 http://hadoop.apache.org/ 2 http://storm.apache.org/ 3 http://www.baidu.com/ 4 http://storm.apache.org

1

3

switching mechanism. Usually video processing involves intensive computation, the framework provides two sets of pluggable APIs to do video analysis, each set of which is separately on GPU and CPU to fit different types of jobs. The remainder of the paper is organized as follows: we first present a high level overview of the proposed framework in Section 2; after that we give details of the loadaware switching in Section 3. Section 4 evaluates the proposed framework in terms of performance, reusability, interoperability, pluggablity, and scalability. Related work and Conclusions work ends the paper.

2

Load-aware Switching

From the perspective of resource maximum utilization, when a topology is submitted to Storm, we have to consider the available hardware resources. Though GPU relieves CPU from high-density computing, it increases much RAM overhead. In general, if CPU is the bottleneck yet enough RAM is available it’s reasonable to choose GPUbased APIs, otherwise CPU is a better choice. Based on the off-line scheduling, we propose a load-aware switching method. It is targeted to automatically switch between GPU APIs and CPU APIs by analyzing load capabilities of the cluster node where a worker of a topology is to be assigned. Assume there are I nodes in a Storm cluster, each of which is with a total CPU T Ci and a used CPU U Ci , also with a total RAM T Ri and a used RAM of U Ri . Also suppose there are J topologies that needs assigning. For the kth worker Wj,k of the jth topology T OP Oj that needs assigning, we check the resources that the worker requires, that is, the CPU usage grcj,k and the RAM usage grrj,k when GPU APIs are used. We also correspondingly denote crcj,k and crrj,k when CPU APIs are used. In general, a totally full RAM or CPU usage can cause the drop in performance. In order to avoid that, we set two thresholds for CPU usage with a parameter α and RAM usage with a parameter β. Taking enough GPU, PCIe and graphic memory as a ground truth, we formalize this method in the algorithm 1.

Framework Overview

From a number of experiments, we found that GPUbased jobs consume less CPU resources yet occupy a lot more RAM when compared to CPU-based jobs, which motivates us to propose the load-aware switching mechanism. The component-connector view is shown in Figure 1. Storm

Algorithm 1 Load-aware Switching Algorithm Input: N = {ni }: set of nodes

T Ci ,U Ci : Total CPU and Used CPU of node i T Ri ,U Ri : Total RAM and Used RAM of node i AC, AR: Available CPU and RAM of one node T opo = {tj }: set of topologies Wj = {wj,k }: workers that needs assigning from jth topology. grcj,k ,grrj,k : CPU and RAM required for wj,k when GPU is used crcj,k ,crrj,k : CPU and RAM required for wj,k when CPU is used Begin: for all tj ∈ T opo do for all wj,k ∈ Wj do for all ni ∈ N do ACi ← (αT Ci − U Ci ) ARi ← (βT Ri − U Ri ) if crcj,k < ACi and grrj,k < ARi then AC ← ACi ,AR ← ARi break end if end for AC AC if M in{ grc , AR } < M in{ crc , AR } then j,k grrj,k j,k crrj,k Switch to CPU else Switch to GPU end if end for end for

Figure 1. Component and Connector view of the proposed framework

Cluster serves as a core infrastructure. Video Sensor is the video stream generator(not necessarily cameras) that can continuously pulling/pushing video streams into Streaming Server. Streaming Server provides us a set of unified interfaces so that we don’t need to know details of front video sensors. LoadChecker is used to achieve load awareness in order to switch tasks between CPU and GPU. It checks the available memories and CPU capacities with the payloads of a Storm worker, informs Storm Cluster of which type of processor should be used according to the load-aware algorithm, and then instructs Metric Monitor to detect the performance of Storm cluster. Video Processor provides two sets of interfaces for video processing. One set of API is implemented on CPU, the other is on GPU. A Storm component (called Spout or Bolt) composites one of the interfaces to do some computation-intensively video-involved tasks such as dynamic background modeling and target tracking.

2

Table 1. Payloads of a Test Topology Item GPU APIs CPU APIs CPU 10% ∼ 12% 3%∼5% RAM 700M 120M

4 Evaluation In order to evaluate the proposed load-aware framework, a number of evaluations are conducted. These evaluations mainly include performance, reusability, interoperability and scalability. The framework is evaluated based on 5 cluster nodes (one is used by Nimbus, one is the Zookeeper coordinator, and the other 3 are supervisors), 30 IP cameras and one streaming media server. The test topology consists of 3 components with grouping specifications and parallelism hints shown in Figure 2. Each topology has 5 executors (or tasks in the Storm default configurations). To concentrate on the evaluation of load-aware switching and, at the same time, conform to the off-line scheduling, we set the number of the workers of each topology to be one. The FrameSpout is responsible for receiving video streams and decode them into frames. The MOG2Bolt is responsible for modeling the backgrounds and subtracting the foregrounds of the frames using model of Gaussians. The MorphBolt is

of each subfigure is the average number of topologies on one node/supervisor. It shows how many topologies one supervisor has on average. The vertical axis is the Average PTPF which implies how much time, on average, a frame takes from getting into the FrameSpout to getting out of the MorphBolt. Along with the increment of topologies submitted to the cluster, each selection strategy shows its own strength in different configurations. In the case of adequate memories and computing capacities (2 cores and 16 GB), no significant difference exists among the three strategies. The computing capacity becomes the bottleneck for the CPU-only strategy in the case of "1 core, 16 GB", because it shows a gradual increase in the processing time per frame from the 6th topology on, indicating that the CPU exposes its incompetence at handling topologies more than five. The corresponding result can be obtained from the "2 cores, 8 GB" case. When a constraint that challenges both the memories and computing capacity is put in the "1 core, 8 GB" case, we can see that both GPU-only strategy and the CPU-only strategy expose their drawbacks, each of which shows short of handling topologies more than five. It can be concluded the mixture strategy that utilizes the load-aware switching method shows its strength especially when there exists a deficiency of computing capacity or storage.

Figure 2. Test topology targeted to do the morphological operations on the binary masks made from the MOG2Bolt, including dilations and erosions. By checking how the traffic (measured with PTPF - Processing Time Per Frame) behaves along with the increment of topologies submitted to the framework. We separately test the traffic behaviors in the three cases, that is, the CPUOnly case, the GPU-Only case and the mixture case (with load-aware switching). Because the load-aware switching is based on a balance between CPU and RAM, we adjust the CPU computing capability and the RAM memory capacity to conduct a comprehensive analysis. By switching between GPU and CPU for efficient computer resource utilization, the load-aware switching exerts its role upon after a topology is scheduled. We first check the payloads that a test topology takes. For a single CPU core configuration one topology takes the payloads which are shown in Table 1. Note that the CPU utilization usually takes less when there are no moving objects in camera scenes.

4.1

4.2

Scalability

The framework is built on Storm, which makes itself scalable. • hardware scaling out: It is easy to add more nodes to the framework and achieve a higher performance, thus making the framework scalable. • hardware scaling up: As Figure 3 shows, when hardwares (RAM and GPU) in the framework are scaled up and the framework can handle a growing number of topologies. • software scaling out: To achieve a topology level scaling out, we choose one topology, one video sensor instead of one topology, multiple video sensor. The main reason is that when we increase or decrease the number of devices, we have to kill the single topology and again submit it to Storm in the one topology, multiple video sensor case. But in the one topology, one video sensor case, we can submit a topology without effecting other device’s normal running. This effectively il-

Performance

Figure 3 shows the results of conducting the evaluations on the 3 processor selection strategies. The horizontal axis

3

Figure 3. Traffic Performance of Processor Selection Strategies lustrates the good software scaling-out ability that the framework has. Also, from a worker level, in a topology the number of workers is easy to configure.

4.4

Storm itself is able to build pluggable components. Therefore the framework built upon Storm is born with pluggablity. Besides, we extend the pluggablity in the process of the framework development and testing.

• software scaling up: we can configure the buffer size among threads and processes.

4.3

Pluggablity

Reusability and Interoperability

• pluggable scheduling. Each scheduling algorithm is wrapped in an isolated scheduler. When a specific scheduler is used, we just plug the scheduler into the framework and then the submitted topologies can be scheduled according to the scheduling algorithm.

Reusability and interoperability are considered in the process of development and testing. • Protocol Provider component in Figure 1 is implemented based on the AOP subject/observer protocol to decouple Metrics Monitor and Storm Cluster. They can communicate without knowing the details of the each other. This makes Metrics Monitor a reusable asset that can be reused in any component that conforms to the protocol.

• pluggable GPU-CPU switching. Automatic pluggablity is considered in the CPU-GPU switching. The framework dynamically plugs the needed CPU APIs or GPU APIs into the running tasks after the topology is submitted.

• Streaming Server component in Figure 1 is added as a middleware to decouple Video Sensor and Storm Cluster. Storm Cluster is only responsible for Streaming Server. This achieves a hot scale-out of Video Sensor, thus interoperability is achieved, and new video sensors can be added at run time.

• pluggable components(Bolts/Spouts). Based on Storm, the framework provides components for jobs to plug themselves into the framework. When a job is to be run on the framework, we just let the job implement Storm components, then it can be plugged into the framework. Different video processing algorithms can be encapsulated in the bolts.

4

5 Related Work

6

Video data processing including compression, storage, transmission, analysis, and recognition has been researched for quite a long time. Some papers [13, 8, 10] propose to encode and decode large video data based on the MapReduce paradigm for high performance. They show competitive results in performance compared to non-cluster environment. Previous works like the above mainly concentrate on video storage and compression and seldom get involved with intelligent video processing such as video tracking and target recognition in a cloud environment, and these work is not focusing on task switching issue currently needed. [5, 17] in deed propose to get high level understanding of images on Hadoop. However, all these work doesn’t focus on the real-time processing of video data. There are some work on CPU-GPU collaborations. Takizawa[15] presents a runtime environment SPRAT that dynamically selects an appropriate processor so as to improve the energy efficiency. Breß[3] proposes heuristics that optimize database query processing for response time and throughput based on load-aware inter-co-processor parallelism. Also, Lang[11] develops an energy-aware model to predict how to distribute the data to the processing units for achieving the most energy efficient execution. The above works are mainly focus on a CPU-GPU-based collaboration computing, in this paper we also propose a similar processor switching method to maximize the computer resources. The difference is that they don’t focus on a real-time distributed computing and their purpose is to provide a generic heterogeneous collaboration instead of video-based intelligent analysis. Another part of the related work is the analysis and optimization of Storm. The network bandwidth is a vital bottleneck for real-time distributed frameworks. Lutz[12] presents NaaStorm which is an implementation of Storm modeled on NaaS (Network as a Service). It provides toplevel support for controlling the locality of computations in Storm, thus making the network infrastructure more transparent to tenants. Based on a round-robin mechanism, the default scheduler of Storm evenly distributes the execution of topology components on available machines, yet disregard the network cost when tuples go through the whole topology. Aniello[1] propose a new schedule (i.e., assignment of tasks to machines) for Storm to monitor system performance and reschedule the deployment at run-time. Fischer[6] and Xu[18] also propose new schedules that can detect network traffic and reschedule the execution of a topology. All of these methods improve the performance compared with Storm’s default strategy. These work give us hint on optimizing Storm, but apparently we have different target to fit for the real time video processing using the stream processing cloud technologies.

In this paper we propose a Storm-based framework which aims to process large scale of real-time video streams with GPU-CPU collaborations in a flexible, pluggable, switchable manner. And in order to solve the resource imbalance problem that caused by GPU-CPU collaborations, we propose the load-aware switching algorithm to automatically choose processors based on the payload of a job and the resource utilization of the corresponding cluster nodes. Evaluations show that the proposed framework is effective in terms of performance, reusability and interoperability, scalability, and pluggablity. The load-aware switching algorithm shows its merits when bottlenecks occurs in CPU or RAM.

Conclusions

Acknowledgements The research is supported by the National Natural Science Foundation of China (Grant No. 61402533) and also Natural Science Foundation of Shandong Province (Grant No. ZR2014FM038), "Key Technologies Development Plan of Qingdao Technical Economic Development Area". Weishan Zhang has been supported by the start-up funds for "Academic Top-Notch Professors in China University of Petroleum."

References [1] Leonardo Aniello, Roberto Baldoni, and Leonardo Querzoni. Adaptive online scheduling in storm. In Proceedings of the 7th ACM international conference on Distributed event-based systems, pages 207–218. ACM, 2013. [2] Shoaib Azmat, Linda Wills, and Scott Wills. Accelerating adaptive background modeling on low-power integrated gpus. In Parallel Processing Workshops (ICPPW), 2012 41st International Conference on, pages 568–573. IEEE, 2012. [3] Sebastian Breß, Norbert Siegmund, Max Heimel, Michael Saecker, Tobias Lauer, Ladjel Bellatreche, and Gunter Saake. Load-aware inter-co-processor parallelism in database query processing. Data & Knowledge Engineering, 93:60–79, 2014. [4] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008. [5] Pengcheng Duan, Wenshan Wang, Weishan Zhang, Faming Gong, Peiying Zhang, and Yuan Rao. Food

5

image recognition using pervasive cloud computing. In Green Computing and Communications (GreenCom), 2013 IEEE and Internet of Things (iThings/CPSCom), IEEE International Conference on and IEEE Cyber, Physical and Social Computing, pages 1631–1637. IEEE, 2013.

[15] Hiroyuki Takizawa, Katsuto Sato, and Hiroaki Kobayashi. Sprat: Runtime processor selection for energy-aware computing. In Cluster Computing, 2008 IEEE International Conference on, pages 386–393. IEEE, 2008. [16] Jan Vanek, Jan Trmal, Josef V Psutka, and Josef Psutka. Full covariance gaussian mixture models evaluation on gpu. In Signal Processing and Information Technology (ISSPIT), 2012 IEEE International Symposium on, pages 000203–000207. IEEE, 2012.

[6] Lorenz Fischer, Thomas Scharrenbach, and Abraham Bernstein. Network-aware workload scheduling for scalable linked data stream processing. In International Semantic Web Conference (Posters & Demos), pages 281–284, 2013.

[17] Wenshan Wang, Pengcheng Duan, Weishan Zhang, Faming Gong, Peiying Zhang, and Yuan Rao. Towards a pervasive cloud computing based food image recognition. In Green Computing and Communications (GreenCom), 2013 IEEE and Internet of Things (iThings/CPSCom), IEEE International Conference on and IEEE Cyber, Physical and Social Computing, pages 2243–2244. IEEE, 2013.

[7] Wang Fuqiu, Wei-Qiang Zhang, and Liu Jia. Gpu accelerated gmm supervectors for speaker and language recognition. In Signal Processing (ICSP), 2012 IEEE 11th International Conference on, volume 1, pages 536–539. IEEE, 2012. [8] Adriana Garcia, Hari Kalva, and Borko Furht. A study of transcoding on cloud environments for video content delivery. In Proceedings of the 2010 ACM multimedia workshop on Mobile cloud media computing, pages 13–18. ACM, 2010.

[18] Jielong Xu, Zhenhua Chen, Jian Tang, and Sen Su. Tstorm: Traffic-aware online scheduling in storm. In Distributed Computing Systems (ICDCS), 2014 IEEE 34th International Conference on, pages 535–544. IEEE, 2014.

[9] Kshitij Gupta and John D Owens. Three-layer optimizations for fast gmm computations on gpu-like parallel processors. In Automatic Speech Recognition & Understanding, 2009. ASRU 2009. IEEE Workshop on, pages 146–151. IEEE, 2009.

[19] Yi Yang and Wenjie Chen. Parallel algorithm for moving foreground detection in dynamic background. In Computational Intelligence and Design (ISCID), 2012 Fifth International Symposium on, volume 2, pages 442–445. IEEE, 2012.

[10] Myoungjin Kim, Yun Cui, Seungho Han, and Hanku Lee. Towards efficient design and implementation of a hadoop-based distributed video transcoding system in cloud computing environment. International Journal of Multimedia and Ubiquitous Engineering, 8(2):213– 224, 2013.

[20] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 10–10, 2010.

[11] Jens Lang and Gudula Rünger. An execution time and energy model for an energy-aware execution of a conjugate gradient method with cpu/gpu collaboration. Journal of Parallel and Distributed Computing, 74(9):2884–2897, 2014. [12] Clemens Lutz. Enhancing the performance of twitter storm with in-network processing. 2012. [13] Rafael Pereira, Marcello Azambuja, Karin Breitman, and Markus Endler. An architecture for distributed high performance video processing in the cloud. In Cloud Computing (CLOUD), 2010 IEEE 3rd International Conference on, pages 482–489. IEEE, 2010. [14] David Schreiber and Michael Rauter. Gpu-based non-parametric background subtraction for a practical surveillance system. In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on, pages 870–877. IEEE, 2009.

6