Flexible Transport of 3-D Video Over Networks - IEEE Xplore

10 downloads 0 Views 1MB Size Report
Flexible Transport of 3-D Video. Over Networks. The authors of this paper look at the future of video transport and see developments such as stereoscopic video, ...
INVITED PAPER

Flexible Transport of 3-D Video Over Networks The authors of this paper look at the future of video transport and see developments such as stereoscopic video, streaming of multiview video, and view-selective streaming. ˘ Gu ¨ ktug ¨ rkemli, Member IEEE , By C. Go ¨ rler, Member IEEE , Burak Go ¨ rkem SaygNlN, and A. Murat Tekalp, Fellow IEEE Go

ABSTRACT | Three-dimensional (3-D) video is the next natural

I . INTRODUCTION

step in the evolution of digital media technologies. Recent 3-D

With wide availability of low cost stereo cameras, 3-D displays, and broadband communication options, 3-D media is destined to move from the movie theater to home and mobile platforms. In the near term, popular 3-D media will most likely be in the form of stereoscopic and multiview video with associated spatial audio. Transmission of 3-D media, via broadcast or on-demand, to end users with varying 3-D display terminals (e.g., TV, laptop, and mobile devices) and bandwidths is one of the biggest challenges to bring 3-D media to the home and mobile devices. There are two main platforms for 3-D video delivery: digital television (DTV) platforms and the Internet Protocol (IP) platform, as depicted in Fig. 1. There are already broadcasters, who started 3DTV broadcasts using a DTV platform. For example, digital video broadcasting (DVB) is a suite of open standards for DTV, which has already been used to broadcast stereo video using frame-compatible formats. However, DTV platforms are not well suited to transmit multiview content with variable number of views to accommodate different 3-D display technologies. We note that transport of 3-D video over DTV platforms falls outside the scope of this paper and have been covered in [1] and [2]. On the other hand, the IP platform, with applications such as IPTV and WebTV, provides a more flexible channel to transmit as many views as required by the user display terminal and at a quality level allowed by the bandwidth of each user. This paper focuses on the flexible transport of 3-D video over networks including IP, DVB, or a combination of them, with the ability to adapt the number of views and/or spatial, temporal, quality resolutions. International Telecommunications Union (ITU) defines IPTV as multimedia services delivered over IP-based managed networks that provide the required level of

autostereoscopic displays can display multiview video with up to 200 views. While it is possible to broadcast 3-D stereo video (two views) over digital TV platforms today, streaming over Internet Protocol (IP) provides a more flexible approach for distribution of stereo and free-view 3-D media to home and mobile with different connection bandwidths and different 3-D displays. Here, flexible transport refers to rate-scalable, resolution-scalable, and view-scalable transport over different channels including digital video broadcasting (DVB) and/or IP. In this paper, we first briefly review the state of the art in 3-D video formats, coding methods for different transport options and video formats, IP streaming protocols, and streaming architectures. We then take a look at beyond the state of the art in 3-D video transport research, including asymmetric stereoscopic video streaming, adaptive and peer-to-peer (P2P) streaming of multiview video, view-selective streaming and future directions in broadcast of 3-D media over IP and jointly over DVB and IP. KEYWORDS | Adaptive multiview video streaming; broadcast over combined digital video broadcasting (DVB) and IP platforms; free-view 3-D video; peer-to-peer (P2P) multiview video streaming; stereoscopic video

Manuscript received April 8, 2010; revised October 8, 2010; accepted December 3, 2010. Date current version March 18, 2011. This work was supported in part by the European FP7 Project DIOMEDES. The work of A. Murat Tekalp was also supported by the Turkish Academy of Sciences (TUBA). The authors are with the Department of Computer Engineering, Koç University, Istanbul 34450, Turkey (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). Digital Object Identifier: 10.1109/JPROC.2010.2100010

694

Proceedings of the IEEE | Vol. 99, No. 4, April 2011

0018-9219/$26.00 Ó 2011 IEEE

¨ rler et al.: Flexible Transport of 3-D Video Over Networks Gu

transport stream, but is constrained by the physical channel bandwidth to allow transmitting MVV. Transport of 3-D video over MPEG-2 transport stream has been discussed in detail in [1]. The IP platform is more flexible in terms of bandwidth but is not reliable. Server–client or peer-to-peer (P2P) streaming over IP can be used standalone or to supplement the DVB (to deliver additional views) in order to provide free-view 3-D experience. The organization of this paper is as follows. Section II reviews the state of the art in 3-D video formats, 3-D video coding methods, and IP streaming protocols and architectures. Sections III and IV look at the state of the art in adaptive streaming techniques for 3-D video over IP, where Section III introduces adaptive asymmetric stereoscopic video transmission over IP, and Section IV summarizes adaptive streaming methods for multiview video. Finally, Section V discusses future research directions in free-view 3-D video broadcast and draws conclusions.

Fig. 1. Platforms for 3-D media transport.

quality of service (QoS) and experience, security, interactivity, and reliability [3]. On the other hand, WebTV services are offered over Internet connections that support best effort delivery with no QoS guarantees, making them accessible anytime, anywhere as opposed to IPTV, which is limited by the service provider’s infrastructure. Unlike traditional broadcast, IP services are offered at varying speeds and costs over a variety of physical infrastructures, such as fixed or wireless telecommunications networks. Furthermore, it is possible to provide a variety of service architectures such as server–client (unicast) or peer-topeer (multicast) using different transport protocol options, such as HTTP/TCP or RTP/UDP, over the IP platform. Hence, 3-D video encoding methods that offer functionalities such as rate scalability, resolution scalability, view scalability, view selectivity, and packet-loss resilience, without a significant sacrifice from encoding efficiency become a key requirement in order to take full advantage of the flexibility that the IP platform provides. In order to provide the best end-to-end quality of user experience, 3-D video encoding methods and transport mechanisms must be jointly optimized, considering the available network rate, the end-user display terminal, and possibly the human perception of stereoscopy. A more recent research direction is to consider a combination of DVB and IP platforms to deliver multiview video (MVV) in order to provide free-view TV/video experience [4]. The DVB channel provides a dedicated platform that can be used for transmitting stereoscopic media in frame-compatible format [2] wrapped in MPEG-2

II. THE STATE OF THE ART A. Three-Dimensional Video Formats Current 3-D video formats can be classified as stereoscopic and multiview as depicted in Fig. 2. Common stereo video formats are frame-compatible and fullresolution (sequential) formats. There are also depthbased representations, which are often preferred for efficient transmission of multiview video as the number of views increases. Frame-compatible stereo video formats have been developed to provide 3DTV services over the existing digital TV broadcast infrastructures [2]. They employ pixel

Fig. 2. Three-dimensional video formats and coding options for fixed-rate and rate-adaptive streaming.

Vol. 99, No. 4, April 2011 | Proceedings of the IEEE

695

¨ rler et al.: Flexible Transport of 3-D Video Over Networks Gu

subsampling in order to keep the frame size and rate the same with that of 2-D video. Common subsampling patterns include side by side, top and bottom, line interleaved, and checkerboard. Side-by-side format applies horizontal subsampling to the left and right views, reducing horizontal resolution by 50%. The subsampled frames are then put together side by side. Likewise, topand-bottom format vertically subsamples the left and right views, and stitches them over–under. In the lineinterleaved format, the left and right views are again subsampled vertically, but put together in an interleaved fashion. Checkerboard format subsamples left and right views in an offset grid pattern and multiplexes them into a single frame in a checkerboard layout. Among these formats, side by side and top and bottom are selected as mandatory for broadcast by the latest HDMI specification 1.4a [5]. Frame packing, which is the mandatory format for movie and game content in the HDMI specification version 1.4a, stores frames of left and right views sequentially, without any change in resolution. This format, which supports full HD stereo video, requires, in the worst case, twice as much bandwidth of monocular video. The extra bandwidth requirement may be kept around 50% if we use the multiview video coding (MVC) standard, which is selected by the Blu-Ray Disc Association as the coding format for 3-D video. An alternative stereo video format is view plus depth, where a single view and its associated depth map are transmitted to render a stereo pair at the decoder side. It was proposed by the European project ATTEST [6] to develop a backwards-compatible 3DTV service using a layered bit stream with MPEG-2 coded monocular view in the base layer and encoded depth information in a supplementary layer. MPEG has specified a container format for view-plus-depth data in BISO/IEC 23002-3 representation of auxiliary video and supplemental information[ also called MPEG-C Part 3 [7], [8]. It has later been proposed to extend this format to multiview video plus depth (MVD), where N views and N depth maps are used to generate M views at the decoder, with N  M [9]. Each frame of the depth map conveys the distance of corresponding video pixels from the camera. The depth values are scaled and represented with 8 b, where higher values represent points that are closer to the camera. Therefore, the depth map can be regarded as a gray-scale video, which can be compressed very efficiently using state-of-the-art codecs, due to its smooth and less structured nature. Typically, a single depth map requires 15%–20% of the bit rate necessary to encode the original video [10]. Furthermore, an MVC codec can be utilized to exploit the inter-view correlations between depth maps for the MVD representation [11]. The depth map information needs to be accurately captured/computed, encoded, and transmitted in order to render intermediate views accurately using the received 696

Proceedings of the IEEE | Vol. 99, No. 4, April 2011

reference view and depth map. A difficulty with the viewplus-depth format is generation of accurate depth maps. Although there are cameras that can generate disparity maps, they typically offer limited performance. Algorithms for depth and disparity estimation have been studied extensively in the computer vision literature, which either utilize disparity estimation from rectified images or perform color-based image segmentation [11]. Another difficulty with the view-plus-depth format is the appearance of some regions in the rendered views, which are occluded in the available views. These disocclusion regions may be concealed by smoothing the original depth map data to avoid appearance of holes, as in the ATTEST project [6]. Also, it is possible to use multiple view-plus-depth data to prevent disocclusions [12]. An extension of the view plus depth, which allows better modeling of occlusions, is the layered depth video (LDV). LDV provides multiple depth values for each pixel in a video frame. The number of depth values depends on the number of surfaces in the line of sight for that particular pixel to allow for better rendering of intermediate (synthetic) views [13].

B. Three-Dimensional Video Coding The method of choice for 3-D video encoding should depend on the transport option and raw video format. For example, for transmission of stereo video over fixed bandwidth broadcast channels, a nonscalable monocular video codec, such as H.264/AVC, can be used to encode stereo video in one of frame-compatible formats. This paper focuses on adaptive streaming of stereo and MVV in sequential or multiview-plus-depth formats, where we have two main options. 1) Simulcast encoding: encode each view and/or depth map independently using a scalable or nonscalable monocular video codec, which enables streaming each view over separate channels; and clients can request as many views as their 3-D displays require without worrying about inter-view dependencies. 2) Dependent encoding: encode views using MVC to decrease the overall bit rate by exploiting the inter-view redundancies. We note that, in this case, special inter-view prediction structures must be employed to enable viewscalable and view-selective adaptive streaming. It is also possible to exploit features of the human visual system (HVS) to achieve more efficient compression by degrading the quality of one of the views without introducing noticeable artifacts. This approach is known as asymmetric coding. We review some common encoding options for adaptive streaming of 3-D video in more detail below. 1) Simulcast View Coding Using SVC: Simulcast coding using the SVC standard refers to producing scalable 3-D video, where each view is encoded independently. Here, two approaches can be followed for scalability: either all views can be coded scalable, or some views can be coded

¨ rler et al.: Flexible Transport of 3-D Video Over Networks Gu

scalable using SVC and others can be coded nonscalable using H.264/AVC. The latter approach has been employed in asymmetric encoding, as described in Section III. The SVC, which is an annex of the advanced video coding (AVC) standard, provides spatial, temporal, and quality scalability. SVC provides temporal scalability through the usage of hierarchical prediction structures, whereas spatial and quality scalability are supported by multilayer coding [14]. Quality scalability is supported in two modes: coarse-grained scalability (CGS) and mediumgrained scalability (MGS). CGS, also called layer-based scalability, is based on the multilayer concept of SVC, meaning that rate adaptation should be performed on complete layer basis. However, MGS concept allows any enhancement layer network abstraction layer (NAL) unit, defined in [14], to be discarded from a quality scalable bit stream in decreasing Bquality_id[ order, providing packetbased scalability [15]. Also, it is possible to fragment an MGS layer into multiple sublayers by grouping zigzag scanned transform coefficients and in this way increase the number of rate adaptation points. The rate-distortion (RD) performances of different MGS fragmentation configurations are compared in [16]. 2) Multiview Extension of H.264/AVC: MVC aims to offer high compression efficiency for MVV by exploiting interview redundancies. A detailed description of MVC can be found in [17]. It is based on the high profile of H.264/AVC, and features hierarchical B-pictures and flexible prediction structures [18], [19]. In one extreme, each frame can be predicted only from frames of the same view, which is simulcast coding. In another extreme, frame prediction spans all views, which is called full prediction, at the cost of complex dependency hierarchy. In [20], a simplified prediction scheme is proposed that restricts inter-view prediction to only anchor pictures, and still achieves similar RD performances. An illustration of prediction structures for a video with five views is depicted in Fig. 3. In MVC, it is important to perform proper illumination compensation either by preprocessing or by weighted inter-view prediction within the coding loop. Also, large disparity or different camera calibration among views may adversely affect the performance of MVC. Although there has been some work on scalable MVC, they either utilize a subset of scalability options or MVC prediction schemes [21]–[23]. Current implementation of the reference MVC software (JMVC 7.0) offers only temporal and view scalability, but no quality or resolution scalability. Effect of scalability options on subjective quality of MVV is a current research area and the results are very likely to be dependent on the content and/or 3-D display. 3) Multiview-Plus-Depth Coding: In this option, selected views and associated depth maps can be either simulcast or dependently encoded using nonscalable or scalable codecs. It is also possible to exploit correlations between the

Fig. 3. Prediction structures for asymmetric MVC encoding. (a) Full prediction scheme. (b) Simplified prediction scheme.

texture video and associated depth maps. For example, in [24], SVC is employed to compress texture videos and associated depth maps jointly, where up to 0.97-dB gain is achieved for the coded depth maps, compared with the simulcast scheme. 4) Asymmetric Stereoscopic Video Coding: Naturally, stereoscopic video requires higher bit rates than monocular video. Another method to decrease the overall transmission rate is to exploit the human visual system, which is known to tolerate lack of high-frequency components in one of the views [25]–[31]. Hence, one of the views may be presented at a lower quality without degrading the 3-D video perception. This is similar to what is being done with monocular video in which the chrominance channels can be represented using fewer bits than the luminance, because human eye is less perceptive to changes in color. In asymmetric MVC coding, where alternating views are coded at high and low quality, the inter-view Vol. 99, No. 4, April 2011 | Proceedings of the IEEE

697

¨ rler et al.: Flexible Transport of 3-D Video Over Networks Gu

dependencies should be carefully constructed. Fig. 3 depicts a scheme in which the views are predicted only from high-quality views in order to achieve better prediction. In Section III-A, we summarize current research on stereo video coding by spatial versus PSNR asymmetry, and the extent to which PSNR of one view can be reduced without perceptual degradation (just noticeable degradation) along with some results.

C. Transport Protocols Being the de facto reliable transport protocol of the Internet, the Transmission Control Protocol (TCP) is the first that comes to mind when to send data over IP. But it may be unsuitable to use TCP for streaming live video with a strict end-to-end delay constraint, due to TCP’s lack of control on delay and its rapidly changing transmission rate. On the other hand, TCP is the easiest choice for streaming stored media, with its built-in congestion control, reliable transmission, and firewall friendliness, making it the most used transport protocol to stream stored media over the Internet. Popular video distribution sites, such as YouTube, Vimeo, and Metacafe, use HTTP over TCP to stream video to clients. Moreover, it has been shown in [32] that using TCP for streaming video provides good performance when the available network bandwidth is about twice the maximum video rate, with a few seconds pre-roll delay. An alternative to streaming video over TCP is UDP, which does not accommodate TCP’s built-in congestion control and reliable, in order packet delivery, leaving their implementations to the application layer. Since congestion control is crucial for the stability of the Internet, it should be implemented by the applications using UDP, which is not a straightforward task. Moreover, UDP is not firewall friendly, thanks to its connectionless nature. For these reasons, UDP is not as popular as TCP for streaming video over the Internet, used by media streaming servers such as Windows Media Server. On the other hand, videoconferencing systems such as Skype and Vidyo utilize UDP for media delivery; however, they base their failover scenarios on TCP. The datagram congestion control protocol (DCCP) [33] is a new transport protocol, implementing bidirectional unicast connections of congestion-controlled, unreliable datagrams, which accommodates a choice of modular congestion control mechanisms, to be selected at connection startup. DCCP is designed for applications like streaming media, which does not prefer to use TCP due to arbitrary long delays that can be introduced by reliable in-order delivery and congestion control, and which does not like to implement the complex congestion control mechanism that is absent in UDP. In [34], it has been shown that DCCP outperforms TCP under congestion when a video streaming scenario is considered. Moreover, the performance of streaming video over DCCP in heterogeneous networks is compared with UDP and the 698

Proceedings of the IEEE | Vol. 99, No. 4, April 2011

Fig. 4. Streaming protocol stacks.

stream control transmission protocol (SCTP); and it is concluded that DCCP achieves better results than SCTP and UDP [35]. Real-time transport protocol (RTP) is an application layer protocol enabling end-to-end delivery of media services [36]. RTP defines a packetization format that identifies the payload type, orders data packets, and provides timestamps to be used in media playout. RTP is typically run on top of UDP, and may easily be used with DCCP or SCTP, but a framing mechanism is required in case it is used over TCP, as defined in [37]. RTP is usually used together with the real-time transport control protocol (RTCP), which monitors transmission statistics and QoS information. These transport protocols, shown in Fig. 4, can be easily adopted in 3-D video streaming with little or no change at all. When a 3-D multicast scenario is considered, the views that compose the video are usually transmitted over separate multicast channels, so that the clients can subscribe to as many channels as they want, depending on their download capacity or display characteristics. For 3-D unicast, multiplexing all views on a single connection may utilize the available network better in case a TCPcompatible congestion control scheme is used and the views are encoded at unequal rates. This is because TCPcompatible congestion control schemes tend to divide available bandwidth equally among connections sharing the same bottleneck link. When unequal rates are allocated to views that are sent over separate network connections, views demanding lower rates will be overprovisioned while the ones with high bit rate requirements will not get the network share they need. It should be noted that each video packet should carry a view identifier, as implemented by the MVC [17], so the receiver can distinguish the packets of one view from the other, in case of using a single connection. If multiplexing views on a single connection to overcome this fairness issue is not an option, then each view with high bit rate may be split over multiple connections for fairness, as in [38] and [39].

D. Adaptive Streaming For adaptive streaming, a mechanism should exist to estimate the network conditions so as to adapt the video rate accordingly, in order to optimize the received video quality. This estimation can be performed by requesting receiver buffer occupancy status information to prevent

¨ rler et al.: Flexible Transport of 3-D Video Over Networks Gu

buffer underflow/overflow [40] or by combining the receiver buffer status with bandwidth estimation [41]. A virtual network buffer between the sender and the receiver is employed together with end-to-end delay constraints to adapt the video transmitted in [42], while the same virtual network buffer algorithm is also utilized by [43] to implement source rate control and congestion control jointly. Packets may be sent depending on their RD values, as in [44] and [45]. In case DCCP is used, with the TCPfriendly rate control (TFRC) congestion control method selected, the TFRC rate calculated by DCCP can be utilized by the sender to estimate the available network rate [34], [46]. When the video is streamed over TCP, an average of the transmission rate can be used to determine the available network bandwidth [34]. How to adapt the video rate to the available bandwidth depends on the encoding characteristics of the views. One or more views can be encoded multiple times with varying bit rates, where the sender can switch between these streams according to the network conditions [42]. Alternatively, in HTTP live streaming, the client selects from a number of streams containing the same material encoded at a variety of data rates in order to adapt to the available network rate [47]. A more elegant solution is encoding views once with multiple layers using SVC and switching between these layers. Another video adaptation scheme is real-time encoding with source rate control [48]. Even SVC encoding can be performed in real time, as proposed in [49]. Recent developments in adaptive HTTP streaming using SVC have been discussed in [50]. However, real-time encoding of MVV is difficult due to high computational requirements as the number of views grows.

E. P2P Streaming The server–client unicast streaming model is not scalable by its nature, that is, it is difficult to serve an increasing number of clients without expanding the bandwidth capacity or creating a large content distribution network (CDN). The most important advantage of P2P solutions over traditional server–client architecture is scalable media distribution. These solutions aim to reduce the bandwidth requirement of the server by utilizing the network capacity of the clients, now called peers. In theory, it is possible to originate only a single copy from a source and duplicate the packets along the path to different peers at the network layer. This could have been the best solution for the scalability problem but unfortunately multicasting at the network layer has not been implemented. Current P2P solutions use overlay networks in which the data are redirected to another peer by the application at the edge platforms and multiple copies of the data traverse the IP network. It is evident that relying on peers that may leave the network or stop data transmission anytime has its drawbacks but there are already successful P2P video applications that have

Fig. 5. Strengths and weaknesses of P2P approaches.

managed to solve such issues. It is possible to examine these solutions under two extremes: tree-based (structured) and mesh-based (unstructured) solutions. Fig. 5 presents some key features of these approaches, which we discuss in more detail in the following. 1) Tree-Based Approach: Tree-based solutions provide an efficient transport mechanism to deliver content from the server that is at the top of the tree to peers that are connected to each other in parent–child fashion. Data flow starts immediately once a peer joins a slot in the tree since there is no need for peer search phase. Moreover, data are pushed from the server to peers, allowing significantly lower latency in data dissemination when compared to mesh-based approaches. These features make tree-based solutions suitable for time critical applications such as video broadcasting. However, the rigid structure that is required for high efficiency introduces some challenges. The major problem with tree-based solutions is ungraceful peer exit, which leads its descendants to starvation. Besides on-the-fly tree reconstruction, there are two possible solutions to this problem in the literature: using multiple parents [51], [52] or using multiple trees [53]–[56]. In the first solution, each peer has a backup parent(s) to request content if the current parent leaves the network. Determining a pool of candidate parents during tree construction and then using periodic heart beat messages to update the list has been suggested in [52]. The second solution is based on building multiple trees such that whenever a peer leaves the tree, its descendants may continue to receive content from an alternative path. In split stream, trees are formed such that a peer that is an interior node in one of the trees becomes a leaf node for the rest of them [53]. Similarly, in Stanford P2P multicast (SPPM), complementary trees are formed such that path diversity is guaranteed [54]–[56]. Replicating the content for feeding multiple trees leads to redundancy within the network and decreases the overall efficiency of the solution. With multiple description coding Vol. 99, No. 4, April 2011 | Proceedings of the IEEE

699

¨ rler et al.: Flexible Transport of 3-D Video Over Networks Gu

(MDC) [57], it is possible to generate self-decodable bit streams (descriptions) with less redundancy. Using these descriptions increases the efficiency of video transmission [53], [58]–[60] and provides resilience against packet losses. A peer should receive at least one of the descriptions to seamlessly decode the content and can receive additional descriptions that enhance the video quality. One possible way of obtaining scalable multiple descriptions is to use the base and enhancement layers generated by SVC. We note that widely used open or commercial P2P applications/services that rely on purely tree-based solutions do not exist. The major reason for this is the lack of sufficient upload capacity of peers due to asymmetric Internet connections. Each peer is expected to feed multiple peers in a tree-based solution, which is difficult to realize when peer upload capacity does not match the video rate. Therefore, peers cannot branch to multiple peers; on the contrary, multiple peers need jointly serve another peer as depicted in Fig. 6. The resulting architecture becomes the opposite of what is intended. In order to address this problem, researchers at the University of California Berkeley have suggested use of helper peers to assist peers that lack the upload capacity [61], [62]. However, in the absence of proper incentive mechanisms, it is still difficult to implement a scalable treebased P2P video distribution solution. 2) Mesh-Based Approach: In mesh-based solutions, data are distributed over an unstructured network in which each peer can connect to multiple peers. These connections work both ways so there is no parent–child relation in the network. While this increased connectivity alleviates the problem of ungraceful peer exit, building multiple connections dynamically requires a certain amount of time, which we call the initiation interval. A peer cannot fully utilize its resources until the peer search is over. Therefore, mesh-based solutions are more suitable for applications that may tolerate some initiation interval.

Fig. 6. Tree formation when upload capacity of peers is half of the media bit rate.

700

Proceedings of the IEEE | Vol. 99, No. 4, April 2011

Bit-Torrent is a popular protocol that adopts the meshbased P2P approach in order to distribute equally sized pieces of a file [63]. The protocol starts with finding peers via tracking servers listed in the B.torrent[ metadata file. The file also includes the hash information of pieces for data integrity. With the rarest first policy, the Bit-Torrent protocol aims to increase the availability of each piece. Moreover, it has choking subroutine to eliminate free-riders, tit-for-tat policy to provide incentives for uploading and pipelining to saturate the client’s downloading capacity [64]. While these approaches have proven successful for file sharing, video streaming application requires some modifications in the Bit-Torrent protocol. In the literature, there are highly cited works that aim to increase effectiveness of Bit-Torrent for VoD applications. For instance, BiToS [65] modifies the piece selection policy to make it time sensitive, whereas BASS [66] uses the Bit-Torrent protocol to assist a streaming server. Perhaps the most successful adaptation of the Bit-Torrent protocol for video streaming is Tribler [67]. The major advantage of Tribler is the social networking among peers for more effective content discovery and content sharing [68]. Social networking is also used during peer search, which shortens the period of initiation. In addition to the Bit-Torrent architecture, Tribler features a more advanced incentive mechanism called give to get [69], Merkle Hashes to provide security [70], and cooperative download in which friends share their upload capacity to increase overall efficiency [67]. Currently, the European project P2PNext supports Tribler and also NextShare [71], which is another platform for P2P video sharing using the BitTorrent architecture [72].

I II . ADAPTIVE STEREOSCOPIC VIDEO S TREAMING One of the most important differences between adaptive monocular and stereoscopic video streaming is the increased level of flexibility in adapting the stereo video source rate to the available network rate. While it is only possible to discard MGS layers sequentially in decreasing Bquality_id[ order in monocular video, in stereoscopic video, it is possible to discard MGS layers from only one of the views (asymmetric) or both views (symmetrically or asymmetrically). We note that streaming of asymmetrically coded video is different from asymmetric rate scaling of scalable but symmetrically coded streams, which may only be performed during periods of congestion for rate adaptation purposes. We discuss both options in the following. However, we first discuss determination of the just noticeable asymmetric degradation level for stereo video coding. We then present an analysis of the tradeoff between encoding efficiency and scalability, comparing several encoding schemes for asymmetric streaming.

¨ rler et al.: Flexible Transport of 3-D Video Over Networks Gu

A. Determination of Just Noticeable Degradation for Asymmetric Stereo Coding It is well known that decreasing quality of one of the views may yield perceptually unnoticeable degradation in 3-D viewing. It is possible to achieve asymmetry by scaling the quality in one of the views (secondary view) in spatial, signal-to-noise ratio (SNR) or temporal dimensions. However, which method should be used and what is the level of asymmetry before observers start noticing visible degradations must be determined. Stelmach et al. [26], [27] suggest using low-pass filtering that corresponds to scaling in the spatial dimension to achieve asymmetry. In [28], it was concluded that the subjective quality of SNRasymmetric stereo video relates to the average of the quality of left and right views, whereas the subjective quality of resolution-asymmetric stereo video is closer to that of higher resolution view. More recent studies argue that resolution-asymmetry method results in coarse changes in the perceived video quality, and suggest using SNR asymmetry in order to be able to more finely tune the level of asymmetry [29], [30]. It is also claimed that in case of SNR asymmetry artifacts are unnoticeable in 3-D if the PSNR of the secondary (lower quality) view is higher than a threshold, provided that the reference (higher quality) view is coded at a high enough PSNR. This claim is supported by subjective tests using two different display systems: a polarized projection system that is very similar to the ones in 3-D theaters and an autostereoscopic display that is more likely to appear as a mobile appliance. The latter is equipped with parallax barrier that blocks light for certain directions and achieves glasses-free stereoscopy at the cost of decreased effective spatial resolution. The viewing distance for the autostereoscopic display is set to its sweet-spot location, and the viewing distance for the projector is determined to match the same distance/ screen-size ratio. An interactive test has been conducted where both views are first displayed at 38 dB. Throughout the test, assessors increase the quantization parameter (to decrease quality) of the secondary view down to 25 dB gradually and after each decrease they compare quality of the current video against the initial video where both views are at 38 dB. The test continues as long as the assessor cannot notice any quality degradation and ends when the distortion becomes noticeable. The results show that the Bjust noticeable[ threshold PSNR is 33 dB for the polarized projection display and 31.5 dB for the parallax barrier display (see Table 1). One possible cause for this difference is that the polarized projector displays views at full resolution whereas in the parallax barrier effective resolution is halved, which may conceal artifacts for a higher degree of asymmetry. Interestingly, the threshold value is not content dependent. In [30], it is stated that the asymmetry in SNR domain becomes more noticeable when PSNR drops below this threshold value. Conclusions presented in another study [31] also agree with this observation.

Table 1 Visibility Threshold for Asymmetric Coding

However, entertainment-quality streaming services commonly operate above this threshold and we can state that for 3DTV over IP applications, SNR-asymmetric coding, or rate scaling may be useful for achieving lower overall transmission bit rates.

B. Asymmetric Encoding for Adaptive Streaming 1) Asymmetric Coding at a Fixed Rate Using MVC: Several research groups have proposed asymmetric coding using MVC to exploit HVS and increase compression efficiency beyond what is possible with inter-view prediction. Fehn et al. [73] propose using additional downsampling steps in the encoding process to achieve asymmetry in spatial resolution. For the backward-compatible reference view, authors use the same procedure defined in the MVC standard. For the dependent secondary view, the frames of the first view are downscaled and then used as reference in inter-view prediction. This way, spatially reduced frames of the second view can be predicted from the primary view at full resolution. However, this is not part of the current MVC standard. For temporal-asymmetric coding, Anil et al. [74] proposed a low-weight frame skipping method for the secondary view to decrease the overall bit rate. The authors propose using a postprocessing procedure in which NAL units of the odd numbered frames in the second view are replaced with special NAL units signaling skip mode for each macroblock. When such a frame is decoded, each macroblock is copied from the reference frame resulting in frame repetition. If these odd numbered frames are discarded without signaling skip mode, then error concealment should have been implemented at the decoder to obtain the same results. Implementing SNR asymmetry is straightforward compared to other types of asymmetry, and does not require any modifications to the current MVC. Since the encoding quality of a view depends on the quantization parameter used, utilizing different quantization parameters for the left and right views results in asymmetry in SNR domain. 2) Scalable Asymmetric Coding Using SVC: Although there have been proposals for spatial and quality scalable MVC Vol. 99, No. 4, April 2011 | Proceedings of the IEEE

701

¨ rler et al.: Flexible Transport of 3-D Video Over Networks Gu

Table 2 RD Values for Different Stereoscopic Encoding Options

Fig. 7. Scalable coding options for stereoscopic case. (a) Two-view scalability, PSNR range of views. (b) One-view scalability, PSNR range of views.

[21]–[23], the current MVC standard only supports temporal and view scalability provided by hierarchical prediction structures. However, it is possible to obtain spatial and/or quality scalable right and left views if they are simulcast coded using the SVC standard. One benefit of simulcast coding is that it generates independently decodable bit streams, easing the synchronization between views. In cases where inter-view redundancy is difficult to exploit due to factors such as camera orientation, calibration problems, and lighting differences, simulcast coding may achieve compression efficiency comparable to that of MVC. There are two encoding options for achieving scalable asymmetric stereoscopic video bitstreams when simulcast coding is used: encoding both views using SVC or encoding one view with SVC and the other with H.264/AVC. Fig. 7(a) presents a possible layer configuration when both views are encoded using SVC. In this scheme, maximum adaptation capability (largest range of adaptation) is achieved at the cost of decreased compression efficiency. Hence, it is possible to scale the video to lower bit rates, but the maximum PSNR will be reduced due to the scalability overhead. Although the figure shows a symmetric quality distribution, asymmetry can be obtained through unequal rate allocation during extraction. Hence, one of the views may be extracted at the highest possible rate, while the other is kept at the base layer quality or higher, if the bit budget allows. Fig. 7(b) depicts the quality levels of the bit streams when one of the views is encoded using H.264/AVC instead of SVC. Comparable adaptation capability is still possible if the gain in bit rate for using H.264/AVC is used to augment the enhancement layer of the scalable bit stream. In this scheme, HVS is exploited in both cases where the video is transmitted at the maximum or the minimum quality. When the available bandwidth is more than the maximum rate of the video, the quality of the scalable bit stream dominates the perceived quality. On the other hand, when the bandwidth is scarce, the 702

Proceedings of the IEEE | Vol. 99, No. 4, April 2011

nonscalable bit steam becomes the high-quality pair of the asymmetry. Table 2 presents encoding results for the comparison of encoding options (one-view scalable versus two-view scalable). The results using MVC is also presented for benchmark purposes. It is seen that in computer generated content (Adile), MVC significantly outperforms scalable coding, whereas in actual recorded scene (Flower), both options have comparable encoding performance. Among these two scalable options, one view scalable option provides better visual performance as it does not have scalability overhead for one view and also takes advantage of asymmetric coding. For instance, the Flower scene is encoded at about 36 dB for the left view and 33 dB for the right view, and in [30], it is suggested that the viewers perceive this 3-D scene as if both views are at about 36 dB. For the case when both views are scalable, the pairs are encoded at about 34 dB and the same study claims that viewers can notice the difference.

C. Adaptive Asymmetric Stereo Video Streaming In conventional 2-D video streaming, adapting the video rate to the available network bandwidth plays a crucial role for maximizing the received video quality. In 3-D video streaming, content adaptation becomes even more critical because: 1) the bandwidth requirement is increased, hence the available bandwidth should be utilized effectively; and 2) how to distribute the available

¨ rler et al.: Flexible Transport of 3-D Video Over Networks Gu

network rate between views is an open issue, and while asymmetric rate distribution seems to be a promising solution, the best adaptation strategy has yet to be determined. In order to adapt the video rate to the network rate, at least one of the views should be scalable coded. Alternatively, at least one of the views should be coded at multiple rates. Then, the available network rate may be distributed either equally among the views, or one of the views may be transmitted at a higher rate than the others. Tests performed with asymmetric coding using short clips reveal that asymmetric distribution of the available network rate within views results in video with higher perceived quality [75]. Here, it should be noted that all views should be sent over the same network connection in case a TCP-compatible congestion control scheme is to be used during transport. Otherwise, the available network rate will not be distributed among the views asymmetrically, as intended, because TCP-compatible congestion control schemes tend to divide the available network rate equally among flows sharing the same bottleneck link.

IV. ADAPTI VE MULT IVI EW VIDEO STREAMING Free-view video is foreseen as the next big step in 3-D video technology beyond stereoscopy. Multiview video is required to provide free-view functionality, which enables viewers to see a 3-D scene from slightly different viewing angles as they move/turn their head. The free-view experience becomes more realistic as the number of views used to sample the viewing cone increases. Clearly, the bandwidth requirement to transmit MVV with a large number of views also increases and may not be feasible with today’s technology. A straightforward approach to lessen the bandwidth problem may be to extend the concept of asymmetric coding to MVV streaming, especially for relatively small number of views, as discussed in Section IV-A. A more efficient, in terms of bandwidth consumption, and flexible, in terms of number of views, approach for MVV transport is streaming the MVD representation. The details of this approach, which features view scalability, are discussed in Section IV-B. Finally, view-selective encoding and interactive streaming of multiview video, which requires computer vision methods for real-time head/gaze tracking, can be used to limit the number of views transmitted. This approach is briefly described in Section IV-C.

A. Asymmetric Streaming of Multiview Video Trivial extension of asymmetric streaming from stereoscopy to multiview video at a reasonable bit rate may be possible for displays with limited number of views, e.g., five views, by alternating views encoded at high and low quality in sequential view order. It is indeed possible to exploit the HVS this way because multiview displays

actually present stereoscopy with different views for different positions. Rate allocation among views can also be performed depending on the user’s viewing angle. A contribution factor is proposed in [76] for differentiating the importance of each view in the rendered 3-D picture, where the factor is used to manage the spatial quality of the views in network-adaptive streaming. In [77], the authors present a multiview streaming framework that uses currently available streaming protocols with minor modifications, where the views are streamed using separate RTSP sessions and a client may choose to initiate only the required number of sessions. Moreover, if MVC encoding is utilized, the server informs the client about the inter-view dependencies among the views during the handshaking process so that the required views are requested.

B. Multiview Video Streaming Using MVD Representation As the number of views in MVV increases, such as more than five to six views, asymmetric coding and direct application of MVC become inadequate to sufficiently compress the multiview content. In such cases, MVD coding, discussed in Sections III-A and B, can be utilized since it features view scalability and temporal scalability if MVC is used, and spatial and quality scalability if simulcast SVC is used, for adaptive streaming. Here, a variable number of views and their associated depth maps, depending on the user 3-D display terminal and available bandwidth, are transmitted and intermediate views are rendered at the user terminal using depth-based image rendering (DBIR) methods. For example, in order to drive a 45-view display, typically five views and their associated depth maps can be streamed using about 30 Mb/s, and the remaining views are rendered at the receiver side. The coding rate of views and corresponding depth maps can be modified during streaming to adapt to the dynamic network conditions [78]. The depth map in general requires about 15%–20% of video bit rate to produce acceptable results [10]. Merkle et al. [11] applied temporal and the inter-view prediction structure of MVC to multiview depth data, where further improvement in encoding efficiency is observed with respect to simulcast coding of depth data. C. Selective Streaming of Multiview Video Selective streaming is a method to reduce the bandwidth requirements of MVV for a single-user headtracking 3-D display, in which only a subset of views are streamed depending on the user’s viewing angle. To select which views should be streamed, a viewer’s current head position is tracked and a prediction of future head positions is computed. Head-tracking displays are reviewed in [79]. In order to conceal prediction errors, lowquality versions of other views may also be streamed [80]. However, the selective streaming method suffers from fast head movements. Also, delay in stream switching, which is Vol. 99, No. 4, April 2011 | Proceedings of the IEEE

703

¨ rler et al.: Flexible Transport of 3-D Video Over Networks Gu

Fig. 8. Architecture of project DIOMEDES.

determined by the frequency of switching frames, may degrade the perceived quality. This delay can be decreased by increasing the frequency of switching frames, which may decrease the encoding efficiency. A more bandwidthefficient multiview frame coding structure can also be utilized [81], instead of simply inserting I-frames for view switching.

V. FUTURE RESEARCH DIRE CTIONS AND CONCLUS IONS A. P2P-Assisted Multiview Video Broadcast Developing a scalable solution (both in terms of transmission rate, number of views, as well as number of users served) for the distribution of multiview video is critical for the successful deployment of free-view TV and video services over the Internet. The classical server– client architecture does not seem feasible, since the transmission bandwidth increases almost linearly with the number of transmitted views even with advanced coding algorithms. Moreover, the server–client architecture does not scale well with increasing number of users. The European project DIOMEDES [4] proposes a scalable architecture that utilizes the upload capacity of peers to assist distribution of up to 200 views and associated 3-D audio (see Fig. 9). The DIOMEDES architecture, shown in Figs. 8 and 10, aims to benefit from the current DTV infrastructure as well as the projected increase in the capacity of Internet connections in the intermediate term. The DVB-T signal provides stereoscopic 3-D media as a baseline, and will be assisted by P2P distribution of the remaining MVV views over IP to enable immersive freeview TV experience [81]. The DIOMEDES P2P architecture is based on the Torrent solution, discussed in Section II-E, in which the data split into multiple substreams (chunks) are distributed over a P2P network. However, for multimedia 704

Proceedings of the IEEE | Vol. 99, No. 4, April 2011

distribution, the chunk picking policy is modified to operate in a timely fashion. Moreover, for adaptive P2P video streaming, the video stream is split into base layer chunks and enhancement layer chunks. This property is defined in the metadata file. The major difference between P2P MVV streaming and P2P monoscopic video streaming is the view scalability option in P2P MVV, where an adaptation engine decides whether to drop some views or decrease their qualities, based on the available network rate and user display terminal. The DIOMEDES architecture consists of three modules, which are 3-D content server, master peers, and 3-D media streaming server, as depicted in Fig. 10. The 3-D content server encodes two stereo views using H.264 in a frame-compatible format and encapsulates them in MPEGTS for broadcast over the traditional DVB. The rest of the views are simulcast encoded using SVC and encapsulated by MPEG-TS and RTP headers, and distributed using P2P networking over the open Internet. In P2P delivery, the master peers are responsible for seeding to subgroup of other peers in the swarm, whose number is determined by the planned level of participation

Fig. 9. Three-dimensional multiview video and spatial audio display in the University of Surrey, U.K. (Courtesy of Dr. Stewart Worrall.)

¨ rler et al.: Flexible Transport of 3-D Video Over Networks Gu

session by signaling the P2P unit that is responsible for both authentication and data transmission. The incoming data from DVB-T and IP channels are first synchronized and then forwarded to corresponding audio/video module that decodes, renders, and displays the content. For rate adaptation purposes, a peer may unsubscribe from some of the streaming sessions (for some views) in case of insufficient bandwidth, and resubscribe later if bandwidth becomes available again.

Fig. 10. The architecture for MVV broadcast.

of the server in the data distribution. In order to initiate the streaming session, a peer first contacts with the 3-D media streaming server, which authenticates the peer and forwards the list of active peers in the video session. Through this authentication process, the streaming server can regulate the data dissemination and enforce content management policies. Moreover, the streaming server distributes the hash map for the content to ensure the integrity of the distributed content. This architecture is flexible to support a wide range of displays. For instance, a user with an N-view display can either subscribe to N 2 video streaming sessions (one for each additional view) or to a lower number of video-plusdepth transmissions and perform rendering of the missing views using associated depth maps. In the meantime, a user with a stereoscopic display would receive content from the DVB broadcast channel only. Naturally, peers that use both DVB and IP channels should synchronize the received signals. The MPEG-TS header is used to synchronize streams with the DVB signal whereas the RTP header is primarily intended to track packet loss events. A representative client model is depicted in Fig. 8. In this model, the main control module invokes the streaming REFERENCES [1] T. Schierl and S. Narasimhan, BTransport and storage systems for 3-D video using MPEG-2 systems, RTP, and ISO file format,’’ Proc. IEEE, vol. 99, no. 4, Apr. 2011, DOI: 10.1109/ JPROC.2010.2091370. [2] DVB BlueBook A151, Commercial Requirements for DVB 3D-TV, Jul. 2010.

B. Conclusion We review three adaptive streaming solutions for distribution of 3-D media. The first, asymmetric streaming, can be utilized for displays with limited number of views, such as five views or less. We note that the visual experiments on asymmetric coding have been conducted so far on short video clips, and some experts claim viewing asymmetric-coded stereo video over longer periods may cause eye fatigue, which needs to be studied further. If this is the case, asymmetric streaming only during periods of congestion may be more desirable. The second and third methods, streaming using MVD and selective streaming, respectively, are intended for displays that support more views, such as 5–200 views. Selective streaming requires tracking a viewer’s head position; hence it is applicable in case of a single user with a headtracking 3-D display [79]. Thus, adaptive streaming using the MVD representation seems to be better suited for general purpose multiview video applications with more than five views. Broadcast of stereoscopic 3-D media over digital TV platforms has already started. However, these platforms cannot provide sufficient bandwidth to broadcast multiview video due to physical channel limitations. Hence, we foresee that, in the medium term, multiview video services will be developed using the second method and these services will be deployed over the IP platform using various architectures, including server–client and P2P. In particular, DIOMEDES [4] addresses robust and flexible distribution of multiview TV/video services using a combination of DVB and IP, where stereo DVB broadcast will be complemented by P2P streaming of the remaining views and the corresponding depth maps to provide freeview TV experience. Streaming of holographic 3-D video over IP is projected in the long term since dynamic holographic display technology and compression methods for such data sources are not mature enough yet. h

[3] Requirements for the Support of IPTV Services, ITU-T Recommendation ITU-T Y.1901, 2009. [4] DIOMEDES. [Online]. Available: http:// www.diomedes-project.eu [5] HDMI Specification 1.4a, HDMI Licensing, LLC., Mar. 2010. [6] ATTEST. [Online]. Available: http://www. hitech-projects.com/euprojects/attest/

[7] Text of ISO/IEC FDIS 23002-3 Representation of Auxiliary Video and Supplemental Information, ISO/IEC JTC1/SC29/WG11, Marrakesh, Morocco, Doc. N8768, Jan. 2007. [8] Text of ISO/IEC 13818-1:2003/FDAM2 Carriage of Auxiliary Data, ISO/IEC JTC1/SC29/WG11, Marrakech, Morocco, Doc. N8799, Jan. 2007.

Vol. 99, No. 4, April 2011 | Proceedings of the IEEE

705

¨ rler et al.: Flexible Transport of 3-D Video Over Networks Gu

[9] P. Merkle, A. Smolic, K. Mu¨ller, and T. Wiegand, BMulti-view video plus depth representation and coding,[ in Proc. IEEE Int. Conf. Image Process., San Antonio, TX, Sep. 2007, pp. I-201–I-204. [10] A. Smolic, K. Muller, N. Stefanoski, J. Osterman, A. Gotchev, G. B. Akar, G. Triantafyllidis, and A. Koz, BCoding algorithms for 3DTVVA survey,[ IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 11, pp. 1606–1621, Nov. 2007. [11] P. Merkle, K. Mu¨ller, A. Smolic, and T. Wiegand, BEfficient compression of multiview depth data based on MVC,[ in Proc. 3DTV Conf., Kos, Greece, May 2007, DOI: 10.1109/3DTV.2007.4379460. [12] P. Kauff, N. Atzpadin, C. Fehn, M. Mu¨ller, O. Schreer, A. Smolic, and R. Tanger, BDepth map creation and image based rendering for advanced 3DTV services providing interoperability and scalability,[ Signal Process., Image Commun., vol. 22, Special Issue on 3DTV, no. 2, Feb. 2007, DOI: 10.1016/j.image.2006.11.013. [13] J. Shade, S. Gortler, L. Hey, and R. Szeliski, BLayered depth images,[ in Proc. ACM SIGGRAPH, Orlando, FL, 1998, pp. 231–242. [14] H. Schwarz, D. Marpe, and T. Wiegand, BOverview of the scalable video coding extension of the H.264/AVC standard,[ IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 9, pp. 1103–1120, Sep. 2007. [15] H. Kirchhoffer, D. Marpe, H. Schwarz, and T. Wiegand, BA low-complexity approach for increasing the granularity of packet-based fidelity scalability in scalable video coding,[ in Proc. Picture Coding Symp., Nov. 2007. [16] B. Gorkemli, Y. Sadi, and A. M. Tekalp, BEffects of MGS fragmentation, slice mode and extraction strategies on the performance of SVC with medium-grained scalability,[ in Proc. IEEE Int. Conf. Image Process., Hong Kong, Sep. 2010, pp. 4201–4204. [17] A. Vetro, T. Wiegand, and G. Sullivan, BOverview of the stereo and multiview video coding extensions of the H.264/MPEG-4 AVC standard,[ Proc. IEEE, vol. 99, no. 4, Apr. 2011, DOI: 10.1109/JPROC.2010. 2098830. [18] H. Schwarz, D. Marpe, and T. Wiegand, BAnalysis of hierarchical B-Pictures and MCTF,[ in Proc. IEEE Int. Conf. Multimedia Expo, Toronto, ON, Canada, Jul. 2006, pp. 1929–1932. [19] Y. Chen, Y.-K. Wang, K. Ugur, M. Hannuksela, J. Lainema, and M. Gabbouj, BThe emerging MVC standard for 3D video services,[ EURASIP J. Adv. Signal Process., vol. 2009, 2009, DOI: 10.1155/2009/786015. [20] P. Merkle, A. Smolic, K. Muller, and T. Wiegand, BEfficient prediction structures for multiview video coding,[ IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 11, pp. 1461–1473, Nov. 2007. [21] M. Drose, C. Clemens, and T. Sikora, BExtending single-view scalable video coding to multi-view based on H.264/AVC,[ in Proc. IEEE Int. Conf. Image Process., Atlanta, GA, Oct. 2006, pp. 2977–2980. [22] N. Ozbek and A. M. Tekalp, BScalable multi-view video coding for interactive 3DTV,[ in Proc. IEEE Int. Conf. Multimedia Expo, Toronto, ON, Canada, Jul. 2006, pp. 213–216. [23] J. Garbas, U. Fecker, T. Tro¨ger, and A. Kaup, B4D scalable multi-view video coding using disparity compensated view filtering and motion compensated temporal filtering,[ in Int. Workshop Multimedia Signal Process., Oct. 2006, pp. 54–58.

706

[24] S. Tao, Y. Chen, M. M. Hannuksela, Y.-K. Wang, M. Gabbouj, and H. Li, BJoint texture and depth map video coding based on the scalable extension of H.264/AVC,[ in Proc. IEEE Int. Symp. Circuits Syst., 2009, pp. 2253–2256. [25] L. M. J. Meesters, W. A. IJsselsteijn, and P. J. H. Seuntiens, BA survey of perceptual evaluations and requirements of three-dimensional TV,[ IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 3, pp. 381–391, Mar. 2004. [26] L. B. Stelmach, W. J. Tam, D. V. Meegan, A. Vincent, and P. Corriveau, BHuman perception of mismatched stereoscopic 3D inputs,[ in Proc. IEEE Int. Conf. Image Process., Vancouver, BC, Canada, Sep. 2000, vol. 1, pp. 5–8. [27] L. B. Stelmach, W. J. Tam, D. V. Meegan, and A. Vincent, BStereo image quality: Effects of mixed spatio-temporal resolution,[ IEEE Trans. Circuits Syst. Video Technol., vol. 10, no. 2, pp. 188–193, Mar. 2000. [28] W. J. Tam, ‘‘Image and depth quality of asymmetrically coded stereoscopic video for 3D-TV,’’ Joint Video Team Doc. JVT-W094, Apr. 2007. [29] G. Saygili, G. Gurler, and A. M. Tekalp, B3D display-dependent quality evaluation and rate allocation using scalable video coding,[ in Proc. IEEE Int. Conf. Image Process., Cairo, Egypt, Nov. 2009, pp. 717–720. [30] G. Saygili, G. Gurler, and A. M. Tekalp, BQuality assessment of asymmetric stereo video coding,[ in Proc. IEEE Int. Conf. Image Process., Hong Kong, Sep. 2010, pp. 4009–4012. [31] P. Aflaki, M. M. Hannuksela, J. Hakkinen, P. Lindroos, and M. Gabbouj, BSubjective study on compressed asymmetric stereoscopic video,[ in Proc. IEEE Int. Conf. Image Process., Hong Kong, Sep. 2010, pp. 4021–4024. [32] B. Wang, J. F. Kurose, P. J. Shenoy, and D. F. Towsley, BMultimedia streaming via TCP: An analytic performance study,[ in Proc. ACM Multimedia, Oct. 2004, pp. 908–915. [33] E. Kohler, M. Handley, and S. Floyd, ‘‘Datagram congestion control protocol (DCCP),’’ RFC 4340, Mar. 2006. [34] B. Gorkemli and A. M. Tekalp, BAdaptation strategies for streaming SVC video,[ in Proc. IEEE Int. Conf. Image Process., Hong Kong, Sep. 2010, pp. 2913–2916. [35] Y. B. Zikria, S. A. Malik, H. Ahmed, S. Nosheen, N. Z. Azeemi, and S. A. Khan, BVideo transport over heterogeneous networks using SCTP and DCCP,[ in Proc. Int. Multi Topic Conf., 2008, pp. 180–190. [36] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, ‘‘RTP: A transport protocol for real-time applications,’’ RFC 3550, Jul. 2003. [37] J. Lazzaro, ‘‘Framing RTP and RTCP packets over connection-oriented transport,’’ RFC 4571, Jul. 2006. [38] J. Crowcroft and P. Oechslin, BDifferentiated end to end internet services using a weighted proportional fair sharing TCP,[ ACM Comput. Commun. Rev., vol. 28, no. 3, Jul. 1998, DOI: 10.1145/293927.293930. [39] D. Damjanovic and M. Welzl, BMulTFRC: Providing weighted fairness for multimedia applications (and others too!),[ ACM Comput. Commun. Rev., vol. 39, no. 3, Jul. 2009. [40] S. Lee and K. Chung, BBuffer-driven adaptive video streaming with TCP-friendliness,[ Comput. Commun., vol. 31, no. 10, pp. 2621–2630, 2008. [41] D. T. Nguyen and J. Ostermann, BCongestion control for scalable video streaming using

Proceedings of the IEEE | Vol. 99, No. 4, April 2011

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

the scalability extension of H.264/AVC,[ IEEE J. Sel. Topics Signal Process., vol. 1, no. 2, pp. 246–253, Aug. 2007. B. Xie and W. Zeng, BRate distortion optimized dynamic bitstream switching for scalable video streaming,[ in Proc. IEEE Int. Conf. Multimedia Expo, Taipei, Taiwan, Jun. 2004, vol. 2, pp. 1327–1330. P. Zhu, W. Zeng, and C. Li, BJoint design of source rate control and QoS-aware congestion control for video streaming over the Internet,[ IEEE Trans. Multimedia, vol. 9, no. 2, pp. 366–376, Feb. 2007. B. Girod, M. Kalman, Y. J. Liang, and R. Zhang, BAdvances in channel-adaptive video streaming,[ in Proc. IEEE Int. Conf. Image Process., Rochester, NY, Sep. 2002, vol. 1, pp. 9–12. J. Chakareski, J. G. Apostolopoulos, S. Wee, W. Tan, and B. Girod, BRate-distortion hint tracks for adaptive video streaming,[ IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 10, pp. 1257–1269, Oct. 2005. N. Ozbek, B. Gorkemli, A. M. Tekalp, and T. Tunali, BAdaptive streaming of scalable stereoscopic video over DCCP,[ in Proc. IEEE Int. Conf. Image Process., San Antonio, TX, Sep. 2007, vol. 6, pp. 489–492. R. Pantos, ‘‘HTTP live streaming,’’ Internet-Draft, draft-pantos-http-live-streaming-02, Oct. 2009. J. Vieron and C. Guillemot, BReal-time constrained TCP-compatible rate control for video over the Internet,[ IEEE Trans. Multimedia, vol. 6, no. 3, pp. 634–646, Aug. 2004. M. Wien, R. Cazoulat, A. Graffunder, A. Hutter, and P. Amon, BReal-time system for adaptive video streaming based on SVC,[ IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 9, pp. 1227–1237, Sep. 2007. T. Schierl, Y. Sanchez de la Fuente, R. Globisch, C. Hellge, and T. Wiegand, BPriority-based media delivery using SVC with RTP and HTTP streaming,’’ Springer Multimedia Tools Appl., 2010, DOI: 10.1007/s11042-010-0572-5. J. H. Jeon, S. C. Son, and J. S. Nam, BOverlay multicast tree recovery scheme using a proactive approach,[ Comput. Commun., vol. 31, pp. 3163–3168, 2008. M. Fesci, E. T. Tunali, and A. M. Tekalp, BBandwidth-aware multiple multicast tree formation for P2P scalable video streaming using hierarchical clusters,[ in Proc. IEEE Int. Conf. Image Process., Cairo, Egypt, Nov. 2009, pp. 945–948. M. Castro, P. Druschel, A.-M. Kermarrec, A. Nandi, A. Rowstron, and A. Singh, BSplit-stream: High-bandwidth content distribution in cooperative environments,’’ Peer-to-Peer Systems II, vol. 2735. Berlin, Germany: Springer-Verlag, Feb. 2003, pp. 292–303, ser. Lecture Notes in Computer Science, DOI: 10.1007/ 978-3-540-45172-3_27. P. Baccichet, J. Noh, E. Setton, and B. Girod, BContent-aware P2P video streaming with low latency,[ in Proc. IEEE Int. Conf. Multimedia Expo, Beijing, China, Jul. 2007, pp. 400–403. J. Noh, P. Baccichet, F. Hartung, A. Mavlankar, and B. Girod, BStanford peer-to-peer multicast (SPPM)VOverview and recent extensions,[ in Proc. Picture Coding Symp., May 2009, DOI: 10.1109/PCS.2009. 5167392. P. P. Baccichet, T. Schierl, T. Wiegand, and B. Girod, BLow-delay peer-to-peer streaming

¨ rler et al.: Flexible Transport of 3-D Video Over Networks Gu

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

[65]

using scalable video coding,[ in Proc. Int. Packet Video Workshop, Nov. 2007, pp. 173–181. V. K. Goyal, BMultiple description coding: Compression meets the network,[ IEEE Signal Process. Mag., vol. 18, no. 5, pp. 74–93, Sep. 2001. E. Setton, P. Baccichet, and B. Girod, BPeer-to-peer live multicast: A video perspective,[ Proc. IEEE, vol. 96, no. 1, pp. 25–38, Jan. 2008. V. N. Padmanabhan, H. J. Wang, and P. A. Chou, BResilient peer-to-peer streaming,[ in Proc. IEEE Int. Conf. Network Protocols, 2003, pp. 16–27. D. Jurca, J. Chakareski, J. P. Wagner, and P. Frossard, BEnabling adaptive video streaming in P2P systems,[ IEEE Commun. Mag., vol. 45, no. 6, pp. 108–114, Jun. 2007. J. Wang and K. Ramchandran, BEnhancing peer-to-peer live multicast quality using helpers,[ in Proc. IEEE Int. Conf. Image Process., San Diego, CA, Oct. 2008, pp. 2300–2303. H. Zhang and K. Ramchandran, BA reliable decentralized peer-to-peer video-on-demand system using helpers,[ in Proc. Picture Coding Symp., May 2009, DOI: 10.1109/PCS.2009. 5167390. J. Pouwelse, P. Garbacki, D. Epema, and H. Sips, BThe BitTorrent P2P file-sharing system: Measurements and analysis,[ in Proc. 4th Int. Workshop Peer-to-Peer Syst., 2005, pp. 205–216. B. Cohen, BIncentives build robustness in BitTorrent,[ in Proc. Workshop Econom. P2P Syst., Jun. 2003. A. Vlavianos, M. Iliofotou, and M. Faloutsos, BBiToS: Enhancing BitTorrent for supporting streaming applications,[ in Proc. Global Internet Workshop/IEEE Int. Conf. Comput.

[66]

[67] [68]

[69]

[70]

[71] [72] [73]

[74]

Commun., Apr. 2006, DOI: 10.1109/ INFOCOM.2006.43. C. Dana, D. Li, D. Harrison, and C. Chuah, BBass: Bittorrent assisted streaming system for video-on-demand,[ in Proc. Int. Workshop Multimedia Signal Process., 2005, DOI: 10.1109/MMSP.2005.248586. Tribler. [Online]. Available: http://www. tribler.org/ J. Pouwelse, P. Garbacki, J. Wang, A. Bakker, J. Yang, A. Iosup, D. Epema, M. Reinders, M. van Steen, and H. Sips, BTribler: A social-based peer-to-peer system,[ in Proc. 5th Int. Workshop Peer-to-Peer Syst., 2006. J. Mol, J. Pouwelse, M. Meulpolder, D. Epema, and H. Sips, BGive-to-get: Free-riding-resilient video-on-demand in P2P systems,[ Proc. SPIEVMultimedia Comput. Network Conf., vol. 6818, Jan. 2008, DOI: 10.1117/12.774909. R. C. Merkle, BA digital signature based on a conventional encryption function,[ in Proc. Conf. Theory Appl. Cryptographic Techn. Adv. Cryptology, Santa Barbara, CA, Aug. 1987, pp. 369–378. NextShare. [Online]. Available: http://www. livinglab.eu P2P-Next. [Online]. Available: http://www. p2p-next.org C. Fehn, P. Kauff, S. Cho, H. Kwon, N. Hur, and J. Kim, BAsymmetric coding of stereoscopic video for transmission over T-DMB,[ in Proc. 3DTV Conf., Kos, Greece, May 2007, DOI: 10.1109/3DTV.2007. 4379449. A. Aksay, C. Bilen, E. Kurutepe, T. Ozcelebi, G. B. Akar, M. R. Civanlar, and A. M. Tekalp, BTemporal and spatial scaling for stereoscopic video compression,[ in Proc. 14th Eur. Signal Process. Conf., Florence, Italy, Sep. 2006.

[75] G. Gurler, K. Bagci, and A. M. Tekalp, BAdaptive stereoscopic 3D video streaming,[ in Proc. IEEE Int. Conf. Image Process., Hong Kong, Sep. 2010, pp. 2409–2412. [76] Z. Yang, B. Yu, K. Nahrstedt, and R. Bajcsy, BA multi-stream adaptation framework for bandwidth management in 3D tele-immersion,[ in Proc. Int. Workshop Netw. Oper. Syst. Support Digit. Audio Video, May 2006, DOI: 10.1145/1378191.1378209. [77] E. Kurutepe, A. Aksay, C. Bilen, C. G. Gurler, T. Sikora, G. B. Akar, and A. M. Tekalp, BA standards-based, flexible, end-to-end multi-view video streaming architecture,[ in Proc. Int. Packet Video Workshop, Lausanne, Switzerland, Nov. 2007, pp. 302–307. [78] G. Petrovic, L. Do, S. Zinger, and P. H. N. de With, BVirtual view adaptation for 3D multiview video streaming,[ Proc. SPIEVElectron. Imag. Stereoscopic Displays Appl., vol. 7524, Jan. 2010, DOI:10.1117/12. 840230. [79] H. Urey, K. V. Chellappan, E. Erden, and P. Surman, BState of the art in stereoscopic and autostereoscopic displays,’’ Proc. IEEE, vol. 99, no. 4, Apr. 2011, DOI: 10.1109/ JPROC.2010.2098351. [80] E. Kurutepe, M. R. Civanlar, and A. M. Tekalp, BClient-driven selective streaming of multiview video for interactive 3DTV,[ IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 11, pp. 1558–1565, Nov. 2007. [81] G. Cheung, A. Ortega, and N.-M. Cheung, BBandwidth-efficient interactive multiview live video streaming using redundant frame structures,[ in Proc. Annu. Summit Conf. Asia-Pacific Signal Inf. Process. Assoc., Oct. 2009.

ABOUT THE AUTHORS ˘ Gu ¨ ktug ¨ rler (Member, IEEE), photograph and biography not C. Go available at the time of publication.

¨ rkemli (Member, IEEE), photograph and biography not Burak Go available at the time of publication.

¨ rkem Sayglll , photograph and biography not available at the time of Go publication.

A. Murat Tekalp (Fellow, IEEE) received double major B.S. degrees in electrical engineering and mathematics from Bog ˘ aziçi University, Istanbul, Turkey, in 1980 and the M.S. and Ph.D. degrees in electrical, computer and systems engineering from Rensselaer Polytechnic Institute, Troy, NY, in 1982 and 1984, respectively. After working briefly at Eastman Kodak Research, he joined the University of Rochester, Rochester, New York, as an Assistant Professor in 1987, where he was promoted to Distinguished University Professor. He joined Koç University, Istanbul, Turkey, in 2001, where he is currently the Dean of Engineering. He authored the book Digital Video Processing (Englewood Cliffs, NJ: Prentice-Hall, 1995). He holds eight U.S. patents. Prof. Tekalp is a member of Turkish Academy of Sciences (TUBA), and a member of Academia Europaea. He has been elected a Distinguished

Lecturer by IEEE Signal Processing Society in 1998 and received the ¨ BI˙TAK Science Award in 2004. He has been a member of the IEEE TU Signal Processing Society Technical Committee on Image and Multidimensional Signal Processing from 1990 to 1999, and chaired it during January 1996–December 1997. He has been the Editor-in-Chief of the EURASIP Journal Signal Processing: Image Communication published by Elsevier (1999–2010). Formerly, he has served as an Associate Editor for the IEEE TRANSACTIONS ON SIGNAL PROCESSING (1990–1992) and the IEEE TRANSACTIONS ON IMAGE PROCESSING (1994–1996). He was also on the Editorial Board of the IEEE Signal Processing Magazine (2006–2009) and Academic Press Journal Visual Communication and Image Representation (1995–2002). He was appointed as the Special Sessions Chair for the 1995 IEEE International Conference on Image Processing, the Technical Program Co-Chair for the 2000 IEEE International Conference on Acoustics, Speech and Signal Processing, Istanbul, Turkey, the General Chair of the IEEE International Conference on Image Processing (ICIP), Rochester, NY, in 2002, and Technical Program Co-Chair of the 2005 European Signal Processing Conference (EUSIPCO), Antalya, Turkey. He is the founder and first Chairman of the Rochester Chapter of the IEEE Signal Processing Society. He was elected as the Chair of the Rochester Section of IEEE in 1994–1995. He is a member of the Advanced Grant panel for the European Research Council, and a project evaluator and referee for the European Commission. He is also appointed as a National Expert for the European Commission.

Vol. 99, No. 4, April 2011 | Proceedings of the IEEE

707