Rate-Distortion Optimal Video Transport Over IP ... - IEEE Xplore

3 downloads 0 Views 780KB Size Report
Rate-Distortion Optimal Video Transport Over. IP Allowing Packets With Bit Errors. Oztan Harmanci, Student Member, IEEE, and A. Murat Tekalp, Fellow, IEEE.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 5, MAY 2007

1315

Rate-Distortion Optimal Video Transport Over IP Allowing Packets With Bit Errors Oztan Harmanci, Student Member, IEEE, and A. Murat Tekalp, Fellow, IEEE

Abstract—We propose new models and methods for rate-distortion (RD) optimal video delivery over IP, when packets with bit errors are also delivered. In particular, we propose RD optimal methods for slicing and unequal error protection (UEP) of packets over IP allowing transmission of packets with bit errors. The proposed framework can be employed in a classical independent-layer transport model for optimal slicing, as well as in a cross-layer transport model for optimal slicing and UEP, where the forward error correction (FEC) coding is performed at the link layer, but the application controls the FEC code rate with the constraint that a given IP packet is subject to constant channel protection. The proposed method uses a novel dynamic programming approach to determine the optimal slicing and UEP configuration for each video frame in a practical manner, that is compliant with the AVC/H.264 standard. We also propose new rate and distortion estimation techniques at the encoder side in order to efficiently evaluate the objective function for a slice configuration. The cross-layer formulation option effectively determines which regions of a frame should be protected better; hence, it can be considered as a spatial UEP scheme. We successfully demonstrate, by means of experimental results, that each component of the proposed system provides significant gains, up to 2.0 dB, compared to competitive methods. Index Terms—Error resilience, optimal slices, rate-distortion optimization, resynchronization markers, unequal error protection, video transport.

I. INTRODUCTION

T

HE protocol stacks employed in the current IP networks, which were conceived for data and/or voice delivery, do not deliver packets with bit errors. The link and physical layers usually fragment IP packets, and the fragments that are received with bit errors are discarded. Any packet with missing fragments is treated as completely lost. Therefore, these networks are typically modeled as packet-loss channels, where bit errors are transformed into packet losses. On the other hand, it is generally agreed that video transport can benefit from delivery of packets which contain bit errors. Hence, new protocols that allow erroneous packet delivery are proposed. For example, UDP-Lite [1] is such a protocol that allows delivery of packets with bit Manuscript received February 14, 2006; revised December 1, 2006. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Sohail Dianat. O. Harmanci is with the Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY 14627-0126 USA (e-mail: [email protected]). A. M. Tekalp is with the College of Engineering, Koc University, Istanbul, Turkey (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2007.891792

errors to the application by performing checksum only for the sensitive parts of a packet. This paper addresses rate-distortion optimal video transport over IP networks, where packets with bit errors are also delivered, which may be especially promising for wireless video communications. In video transport over lossy channels, some degree of error resilience can be attained by using such tools as slicing and resynchronization markers. A slice is a group of macroblocks (MBs). In the case of lossy channels, it is generally desirable that each slice is independently decodable. A resynchronization marker (RM) is a special uniquely decodable codeword that typically marks the beginning of a new slice; that is, each slice is prefixed by an RM and encoded such that the decoder can achieve resynchronization at the start of each slice even if some previous slices are lost. Therefore, frequent slicing increases error resiliency by confining the impact of error in a shorter slice. However, there is a tradeoff between error resilience and compression efficiency. 1) In order to allow independent decodability, prediction from macroblocks that are outside the slice is disabled, thus reducing encoding efficiency; 2) in H.264 context-based adaptive variable-length coding (CAVLC) and context-based adaptive binary arithmetic coding (CABAC) [2], a new slice resets the context, thus causing further efficiency loss; and 3) each slice begins with a header, which can consume a significant amount of the bandwidth at low bit rates. The slices are packetized for transport over IP networks. It is a common practice to coincide packet borders with slice borders, also known as application layer framing (ALF) [3]. In a pure packet-loss network, there is no point in including more then one slice in a packet; hence, the common practice is to have one slice per packet. When the network and link layers are capable of delivering packets with bit errors, we can include multiple slices in a packet, and slices that contain any errors may be discarded, while others can be decoded. Therefore, strategies for slicing and packetization, i.e., the number and composition of slices within a video frame, should be reconsidered. There is little published research on the rate-distortion (RD) optimization of the number and composition of slicing within a frame. The mainstream research uses experimental selection of the slice size. In these works, either the slice size is fixed to a number of macroblocks or it is fixed to a number of bits [4], [5]. An RD optimal approach has been presented in [6], where a “begin a slice” decision is made during encoding a MB. However, the loss in encoding efficiency of the following MBs is not considered at the time of decision making. Other slicing methods exist that determine the slice locations as inferred from another optimization problem. For example, in [7], authors use a trellis based estimation to optimize macroblock coding decisions and per MB

1057-7149/$25.00 © 2007 IEEE

1316

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 5, MAY 2007

Fig. 1. Overview of the video transport model 1. Resynchronization markers are shown in gray and protocol headers are shown in black.

FEC by taking all of the MBs in the frame into account. Then, the decided FEC rates are used to determine where to insert the slices, that is a different FEC marks the beginning of a new slice; hence, slicing is inferred from MB FEC rates. After slicing and packetization, the network layer passes the packets to the link layer, which fragments and transmits the fragments (also called frames) over the channel. Due to low signal to noise ratios, wireless links are capable of applying FEC at various rates. Typically, a link layer adjusts FEC rate such that the application layer experiences a reasonable loss rate. There are methods in the literature that propose optimizing joint source channel coding of the video transmission at the application layer based on the actual bit-error rate observed over the link [8]–[11]. Although application layer FEC allows more flexibility, it introduces redundancy since some form of channel coding is already being employed at the physical layer independent of the application layer (e.g., [12]). Therefore, it is more desirable that an interface exists between the application layer and link layer, which allows the application layer to select the channel code (FEC) rate and signal to the link and physical layers. When the application layer can control the FEC rates, it is shown that protecting certain parts of the bitstream stronger than other parts yields superior results in video communications compared to equal protection. This is known as UEP. Various methods have been proposed for UEP using scalable coding [9] with interpacket FEC to recover lost packets [13], fine granular scalable coding [14], [15] data partitioning [16], [17], and other prioritization techniques [10], [18]. After the FEC rate is decided and during packetization, there is a RTP/UDP/IP packet overhead of 40 bytes (12/8/20 bytes, respectively) for each packet [19], [20]. Since the number of packets per frame is equal to the number of unique FEC rates used, the more the number of FEC rates used, the more the number of bits spent on headers and control information. Therefore, there is a tradeoff between the freedom in selecting different FEC rates for each slice and the packetization overhead due to protocol headers. In this paper, we propose to perform UEP spatially, i.e., certain parts of a given frame will be transmitted with higher protection. The problem is then to determine which MBs (i.e., slices) and at what rate while taking the packetization overhead into account. In this paper, we address the following rate-distortion optimization problems: 1) determine the number and location of resynchronization markers, and, 2) if application controlled FEC is available, choose the FEC rate for each slice and determine packetization (i.e., UEP) over an IP network which transmits packets with bit errors. Since we aim to optimize

resynchronization marker insertion and FEC rate, the scope of this work does not cover losses due to congestion. However, the proposed method can be combined with a packet loss resilient video streaming method such as [21]. In the next section, we introduce two video transport models, Model 1 and Model 2. In Section III, we present a new rate distortion optimal slicing method for transport model 1, and in Section IV, we extend this method to a cross layer rate distortion optimal packetization and UEP method for transport model 2. Section V discusses implementation of the proposed systems including video encoding. Sections VI and VII present experimental results and concluding remarks, respectively. II. COMMUNICATION MODEL This section reviews the video transport models and protocols, as well as the decoder model used in this work. A. Video Transport Model In this paper, we develop RD optimal video communication systems for two different transport models. While both models allow transmission of packets with bit errors, the first model assumes standard link layer controlled FEC, whereas the second is a cross-layer model which assumes that the application can control the rate of FEC performed at the link layer. 1) Model 1: The first model assumes that erroneous packets and fragments are delivered by the network and link layers, respectively, e.g., using UDP-Lite protocol at network layer. In this model, the link layer controls the FEC rate such that a reasonably low loss rate is experienced at the application layer at the expense of reduced data transmission rate. An overview of this model is shown in Fig. 1. Under this model, each frame is sent as a single packet and each packet is appended RTP/UDP/IP headers, which cause an overhead of 40 bytes per packet. However, since the number of packets is fixed (i.e., equal to the number of frames), the bitrate spent on the header transmission is also fixed. Then, the optimization problem simplifies to determining the number, location and length of slices per each frame/packet, which is addressed in Section III. 2) Model 2: This model also assumes that erroneous packets and fragments are delivered by the network and link layers, respectively. However, we now assume a cross-layer framework, where the rate of FEC, implemented at the link layer, is controlled by the application. The link layer can apply FEC possibly at a different rate for each IP packet as requested by the application. The link layer either directly communicates the available code rates and residual fragment loss rates with the application,

HARMANCI AND TEKALP: RATE-DISTORTION OPTIMAL VIDEO TRANSPORT

1317

Fig. 2. Overview of the video transport model 2. Resynchronization markers are shown in gray and protocol headers are shown in black.

or provides the means for the application to compute them. An overview of this model is shown in Fig. 2. Under Model 2, the application may generate multiple packets for each frame such that each packet contains the slices that will be assigned the same FEC rate. That is, the slices in a packet do not have to be consecutive, but are placed according to their desired FEC rates. For example, in Fig. 2, slices 1 and 3 are in one packet, and slices 2, 4, and 5 are in another packet. Each packet is sent through the link layer with the forward error correction (FEC) code rate that is selected by the application layer. The RTP/UDP/IP protocol header overhead becomes important for this model, because the number of packets may vary. Robust header compression (RoHC) can be used to reduce the bitrate from 40 Bytes to 2–4 Bytes [22]. However, in the proposed system, the amount of overhead is parameterized in order not to lose generality. Determining the optimal slicing configuration is also a problem for this model. Furthermore, there are two additional optimization problems. 1) Determining the number of packets, by optimizing the protocol header tradeoff, and 2) determining the UEP rates for each slice. For both models, the decoder receives packets with possibly failed link layer fragments in it. It decodes only the slices whose fragments are received completely error free. Decoder performs error by searching for invalid codewords, out of bounds syntax elements or by side information from the network and link layer. Error recovery is performed by seeking to the next valid resynchronization marker. For example, for the second model, in Fig. 2, slice 1 and slice 4 has at least one lost fragment in them and they are not decodable. Clearly slice 2 is decodable, and since the decoder can successfully find slice 3’s and slice 5’s resynchronization markers slices 3 and 5 are also decodable. B. Decoder Model We assume that the decoder concealment method is known at the encoder. For simplicity, we assume that decoder performs concealment of lost regions by copying from the previous frame. When there is loss in the channel, the distortion experienced at the decoder must be estimated by the encoder. Although ROPE [23] is an accurate algorithm for estimating expected distortion at pixel level, it is a computationally demanding method. We use a simpler method proposed by [24], [25]. This method estimates the distortion as (1)

is the expected decoder side distortion for where is the probability that macroblock will macroblock , be lost, is the distortion in case of concealment, and is the distortion upon successful reception and decoding . The computation of is discussed in Section III-A of [see (4)]. We use intrarefresh to recover from error propagation, and we include to model the effect of error propagation within the GOP. For example, the value of can be set equal to the number of frames until the next intrarefresh. III. RATE DISTORTION OPTIMAL SLICING OVER BIT-ERROR CHANNELS In this section, we address the problem of determining the location and number of resynchronization markers in a frame of video, in a rate-distortion optimal manner under the transport model 1. macroblocks. Let all Suppose a frame of video contains possible slicing configurations for a frame be denoted by , , where is the length of is the the th slice in terms of number of macroblocks, and number of slices in the th configuration, such that and , . Number of bits is denoted by , and , respectively, for frame, th slice and th MB when configuration is used. Similarly, , and denote distortion at frame, slice and MB level, respectively, in terms of sum of squared differences (SSD). denotes the coding decisions such as motion vectors, reference frame indices, encoding mode and residuals for . denotes the slice index in which resides when is used (i.e., MB index to slice index mapping). Clearly, , and are functions of , and . We drop or , where suitable, for ease of notation. An optimal slicing algorithm should compute an objective function for all possible slice configurations and select the configuration that minimizes the objective function. However, the number of possible configurations is too many to perform an exhaustive search. Hence, in the following, we first propose a computationally simplified method to evaluate the objective function, and then we propose a dynamic programming method to reduce and manage the number of possible configurations. The proposed method takes the MB dependencies in the encoding into account, so that the tradeoff between coding efficiency loss and increased error resilience can be studied accurately.

1318

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 5, MAY 2007

A. Estimating the Objective Function for a Configuration Lagrangian formulation of RD optimization is common in video encoding [26], where distortion is minimized subject to a bitrate constraint. We follow a similar approach for RD optimal slicing, and define our objective function as

(2) is the expected frame distortion, is the expected distortion of when is used, and the Lagrange multiplier and it configuration is chosen as proposed in [27]. Since we assume that there is no error propagation, using (1), we obtain where

Fig. 3. Scan line order dependencies utilized during the encoding of MBs. When the MB shown in black is being encoded, the gray shaded macroblocks are used for prediction and context adaptation.

scan line order neighbor of . Furthermore, since RD opalso depends on timal coding takes into account the bitrate, . This dependency is iterative; that is, may depend on the . Therefore, determining the optimal very first MB in is not computationally feasible; here, we adopt the popular Lagrange based optimization method1 (5)

(3) where is the loss probability of th MB. Assuming previous frame copy for the concealment method, for the sake of simcan easily be calculated using the previous frame. plicity, According to the transport model 1, an MB is lost only if the slice it resides in is lost, which happens if at least one of the is given by fragments in the slice is in error. Then, (4) where is the loss probability of a fragment, and is the is the actual load of length of a fragment in terms of bits. the fragment and does not include the parity bits. Since channel and depend on the FEC rate at the coding is used, link layer. For transport model 1, link layer controls the FEC; and , hence, the link layer tells the application layer or it provides the bit-error rate, code rate, and fragment length to the application layer, for indirect calculation. We assume that and are fixed for the duration of a frame, which is 100 ms for a 10-Hz video. Computation of (4) is not easy. There are two significant isfrom (4) without first sues. 1) It is impossible to estimate depends on the encoding all MBs in slice , since MBs that have not been encoded yet; and 2) in order to truly take into account the effects of predictive coding and context adapcan only be exactly calculated by encoding all tivity, . This requires that for each frame, of the MBs in slice every possible slice that could be generated should be encoded once to determine the corresponding rate and distortion. This is not a computationally feasible solution; hence, we propose the , and following method to estimate for a given slice configuration for any by approximating the effect of MB dependencies, which in turn affect the context and prediction. Predictive and context adaptive coding uses prediction from depends neighboring MBs to reduce the bitrate; hence, not only on , but also on , such that is a preceding

The aforementioned MB dependency structure is shown in Fig. 3. The recursive nature of the dependencies makes this a very complex problem. To ease the complexity, we make denote the neighboring the following assumption. Let macroblock set that is used in prediction and context for . This set is shown in gray in Fig. 3 for the black shaded MB. Then, we assume that (6) The reason for this assumption is that, for a , depends on indirectly, through . Hence, we assume that after the second prediction (i.e., neighborhood of neighborhood), the direct dependency is negligible. This is similar to approximating a high order Markov chain with a first order one. is given, and can be estimated. Therefore, if can be efficiently estimated for all Next, we discuss how MBs depending on the RM position. Since dependencies stop at slice boundaries, there are five possible dependency topologies based on the RM position, which are shown in Fig. 4. Types II and V occur most frequently, at the top row and at the body of a slice, respectively. Type I occurs only once per slice, but since it uses no prediction, it is important to estimate it accurately. Types III and IV can occur at most once for a slice; hence, they have minimal impact on accuracy. We classify these dependency topologies in three groups depending on the number of elements in the : 1) no prediction, , (type I); 2) prediction set , (type II); and 3) prediction from single MB, , (types III, IV, and V). Based from multiple MBs, on these topologies and using (6), we propose the following estimation method. Encode frame , , three times each with a special slicing configuration such that dependencies for types I, II, and V shown in Fig. 4 are estimated. These configurations , , and, are 1Although this formula suggests a joint search over parameter space, we, in fact, use an iterative optimization; first, find optimal MVs and reference frames using Lagrange minimization, then find optimal residuals by transform and quantization, and then find optimal modes using Lagrange minimization again.

HARMANCI AND TEKALP: RATE-DISTORTION OPTIMAL VIDEO TRANSPORT

1319

Fig. 4. Predictive coding and context adaptation is not allowed beyond resynchronization markers (RMs). Possible five cases are shown in this figure. Special cases for MBs at frame boundaries are not shown.

Fig. 5. Example of macroblock lookup process from frames f , f , and f using (7) based on the slice location.

, resulting in , and , respectively, and denotes the number of MBs in a row of picture. These and three results are then used to estimate based on for any as follows:

(7)

where denotes the index of the first MB in th slice; that . Hence, the results , and serve as a is, look up table for rate and distortion of MBs based on the relative location of the MB with respect to the RM. An example is shown in Fig. 5 bottom pane. The first macroblock of the slice does not depend on any other macroblock. The next row depends only on the macroblocks that have a left neighbor, and the following macroblocks depend on multiple neighbors. These correspond to types I, II, and V topologies, respectively. We can estimate the rate and distortions from encoded pictures shown on the upper pane. Now we can estimate (4), and finally the objective function, (3). We tested the accuracy of the proposed estimation technique of (7) by selecting 30 000 slices of random length and random locations from Foreman, Carphone, and News sequences. Fig. 6 shows the cumulative distribution of the error percentage. The axis is the error percentage, where error is defined as for both distortion . (SSD) and bitrate. The axis is probability that error

and and categorizes each “Types V and II” method uses MB as either type V or type II depending on the location of RM. and categorizes each MB “Only Type V” method uses only as type V. Estimating distortion is, in general, easier because of fixed quantization level; therefore, we focus on bitrate plot on the left pane. The proposed method estimation error exceeds 2% with only 0.05 probability, compared to “Types V and II” method, which happens with 0.2 probability. As expected, “Only Type V” method has very poor accuracy. Finally, since each slice is independently decodable and concealable, we note that the objective function has the property of additive combination of costs of slices. This has an important consequence for a dynamic programming solution, which we describe in the next section. B. Search via Dynamic Programming Even with the simplified estimation of the objective function, the exhaustive search is impossible because it requires determining and comparing the cost functions all of the possible configurations. The total number of possible configurations for a frame with macroblocks is (see Appendix). Hence, for possible configurations. In this seca QCIF frame, there are tion, we propose a dynamic programming method to manage the number of configurations. We use a specially generated -ary tree and dynamic pruning based method. For a given frame, the slices are independently decodable from each other, and the objective function grows additively with each additional slice. We call these the additivity and independence properties of the objective function. Therefore (8) where denotes the objective function, denotes the slice cost function that is calculated for slice using configuration (9) denotes the expected decoded distortion of th slice under configuration (10)

(11)

1320

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 5, MAY 2007

Fig. 6. Cumulative distribution of bitrate and distortion estimation errors.

Equation (7) is used to estimate (10) and (11). Note that (11) is is same for all of the MBs for a given slice. true, because The proposed search algorithm is based on a novel dynamic tree building method. Tree’s top node starts with and it is at the first level meaning that it is composed of only one slice. The configurations at second level are composed of two slices, third level of three, and so on. A node with configuration at th level generates new nodes at level by the following rule: , generate . All are added to the tree as the . children nodes of The tree built in this manner covers all possible slice configurations. Next, we describe how to prune the nodes while the tree is being built. We again start by inserting the configurainto the tree. Then, the following steps are tion, applied iteratively until all of the nodes are either expanded or pruned. 1) Find the unpruned and unexpanded node that has the biggest last slice, . 2) Spawn children nodes using such that, , . , search the tree for an unexpanded and un3) For each pruned node with equal last slice; , such that . If , then is pruned, otheris pruned. wise evalThis method always results in for a frame, and the outcome is same as exuations of haustive search. This can only be achieved by not expanding a node if there can be potentially matching unexpanded and unpruned nodes in the tree. This means a node is not expanded when there is a node whose last slice is bigger than this node’s last slice, because the children of the node with bigger last slice potentially matches this node. This is handled in the algorithm by expanding the node with biggest last slice at every iteration (Step 1). Step 2 is based on the independence and additivity property of the objective function. Consider comparing two configurations whose last slice’s have the same length. Let

and , where . The optimal subpartitioning of the last slice and . This is true because the will be same for both best configuration that can be achieved by using the same slice lengths except subpartitioning only the last slice of is, and for it is; . Therefore, if , then because of independence and additivity properties, it must be true that (or vice versa). Subpartitioning the last slice is equivalent to iteratively spawning nodes from the configuration using the last slice (i.e., creating a subtree that uses the node as the parent node). Therefore, there is no point in keeping the worse performer and its children nodes in the tree. Fig. 7 shows an example tree build and prune process for with fictitious evaluations. We first start with inserting {5}, then apply the algorithm described above. 1) Biggest: {5}. a) Spawn: {1,4}, {3,2}, {2,3}, {4,1}. b) (No match). 2) Biggest: {1,4} . a) Spawn: {1,1,3}, {1,2,2}, {1,3,1}. , prune {2,3}. b) , prune {1,2,2}. c) 3) Biggest: {1,1,3}. a) Spawn: {1,1,1,2}, {1,1,2,1}. , prune {1,1,1,2}. b) 4) Biggest: {3,2}. a) Spawn: {3,1,1}. b) (No match). The gray shaded nodes are the children nodes that would have been spawned if there was no pruning. The complete tree has nodes, for all of which objective function should be evaluated. The proposed algorithm starts with generating one children nodes. Then at each iteration root node and one node is pruned and one node is expanded to generate new nodes are generated, children nodes. At each iteration, . Summing them all, we obtain the with

HARMANCI AND TEKALP: RATE-DISTORTION OPTIMAL VIDEO TRANSPORT

1321

of the MBs in one slice has to go into the same packet. Hence, we introduce a new parameter for each slice, , which denotes denotes the FEC rate for th slice in th configuration. the number of possible code rates. With the FEC rate parameter for each slice, the packetization for a given configuration is actually defined: Every slice with the same FEC rate is put into the same packet. This will change the reception ordering at the decoder, but since each slice is appended a header for independent decodability, from the decoder’s point of view this is similar to handling out of order IP packets. Every different FEC rate in the configuration will bits of protocol overhead. Therefore, cause an additional for configuration , the total bitrate is calculated according to the formula (13) where according to Fig. 7. Slice configuration tree with five levels (N = 5) that covers all possible slices. Depending on the cost function evaluations, the nodes over gray background are pruned and objective function is not calculated for those nodes. versus 1 + The savings grow exponentially according to the formulae 2 1)=2. Note that N is also the maximum level of the tree. N (N

0

total number of nodes which is equal to the required number of objective function evaluations

(12)

is calculated using (10) and

is calculated

otherwise

(14)

; otherwise, Bytes. If RoHC is used, Equation (13) includes the effect of protocol headers and bitrate increase due to parity data; hence, we can optimize the tradeoff between packetization overhead and FEC rate freedom. Then we perform the following changes to the algorithm described in Section III-B. 1) The objective function is still (2), and distortion is calculated according to (11). However, bitrate is calculated according to the new method using (13). , found by (4), because 2) Modify MB loss probability, and are functions of code rate of the slice in transport model 2

IV. CROSS-LAYER RATE-DISTORTION OPTIMIZATION OF SLICING AND UNEQUAL FEC RATES In this section, we extend the proposed dynamic programming method to the case of transport model 2, where FEC rate is controlled by the application layer, which allows UEP in a cross-layer formulation. We reiterate that the FEC is actually performed at the link layer, while its rate is controlled by the application. Within this model a frame can consist of multiple packets; and multiple slices can be placed in a single packet in different orders. We address the problem of optimization of packetization and selection of unequal FEC rates for each packet where the link layer bandwidth is limited. At first the concept of packetization and optimizing FEC rate may seem unrelated. However, from the link layer perspective each IP packet is applied the same FEC rate. The decision of which MB goes into which packet is actually the decision of error protection rate for that MB. Also, protocol headers from RTP/UDP and IP are appended to each packet. Therefore, the packetization freedom is limited and there is a tradeoff between packetization, FEC freedom, and header overhead. We make two modifications to the proposed dynamic programming method: 1) we change the way the tree nodes are expanded, and 2) we define a new objective function. First of all, notice that multiple slices can be transmitted as one packet; hence, all

(15) 3) Finally, modify the node expansion algorithm: After a node spawns with configuration children nodes; , , for each generated child node, we apply all possible FEC rate combinations to the last two slices and evaluate the new objective function using (13) and (11). Since only the last two slices are newly generated, it is enough to perform FEC search only over these two slices. The FEC rates that result in smallest objective function are used as the FEC rates for the corresponding slices. This method still propagates only one node after expansion, possibly with different FEC rates. In the end, every slice will be associated with an optimal FEC rate. Since every slice with the same FEC rate are transmitted in a single packet, this algorithm effectively determines the network layer packetization by optimizing UEP while taking into account the protocol header overhead. V. IMPLEMENTATION OF THE PROPOSED SYSTEM There are dependent optimization problems within the proposed framework. For example, evaluation of (5) requires the

1322

loss probability of the MB to be known beforehand, but the loss probabilities depend on the code rate that will be decided later in the algorithm. Therefore, we choose to optimize macroblock level coding versus slice and packet level coding separately and iteratively. First, we use an initial estimated loss rate and perform mode and MV optimization. The initial estimates for loss rates are equal to those used for the previous frame, for the same MB. After mode and MV optimization, we perform resynchronization marker insertion, packetization, and unequal FEC rate optimization, and, finally, we fix the packetization and again perform mode and MV optimization and generate the final encoded frame. As we mentioned, for a given frame, the initial state of the optimization iterations comes from the previous frame. These are simply the loss probabilities of each MB as calculated for previous frame. If the initial loss probabilities are too different and there are no extra iterations, the quality for a MB can degrade significantly. This is best explained with the following example. Suppose that on the previous frame a MB was sent through a very high loss channel (i.e., no or low error protection). When this is taken as the initial loss probability during slice and UEP optimization for the current frame, the MB will be encoded in SKIP mode (no matter how bad the skip mode performance is compared to intra- or intermodes; expected distortion will be nearly independent of the selected mode) when the are being calculated. Depending estimation results on the neighboring MBs, the slice optimizer can decide to send this MB together with neighboring MBs around it in a well FEC protected slice such that it is received. Since it is protected well, it will be received by the decoder, and it will be decoded successfully resulting in an obvious bad quality. In order to challenge the above observations and to address the question of number of required iterations, we devised a setup that removes the initial condition problem. When estimation pictures are being calculated, we did not use the previous frame’s per MB loss probabilities but instead we uniformly partitioned the loss probability space ([0,1]) into 11 levels: . At each level, we created three estimation pictures, making total of 33 pictures. Then, when the bit length and expected distortion of a slice for a given code rate is being estimated, we perform a search over the 11 levels and at each level, we first estimate the number of bits, then estimate the loss probability. Then we compare this probability with the assumed loss probability during generation of corresponding , , and . We used the estimation that gives the best loss probability match. We compared the proposed system, which uses previous frame’s loss probabilities as initial guesses, with this experimental system and saw that there is little or no performance gain. This is mostly true because two consecutive frames are highly correlated. Therefore, we use the system with the initial guesses coming from the previous frame and we perform another mode and MV optimization after the slice boundaries and FEC rates are decided. Also for stability, the . The flow initial values are not allowed to exceed a limit, diagram for slicing and encoding a frame is given in Fig. 8. denotes the loss probabilities for each macroblock and are initialized to zero for the first frame.

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 5, MAY 2007

Fig. 8. Flow diagram of determining slices and encoding a frame.

The proposed dynamic programming based search method takes about 8 ms on a 2.8-GHz Pentium-4 processor for a QCIF sized frame. The bottleneck of the proposed method is the computation of , , and . The straightforward approach would be to encode , , , and independently. However, by allowing a slight loss of accuracy, we can avoid a separate motion vector search for each picture and even transformation/quantization of residuals by reusing the MVs and the transformed/quantized residuals. For example, the motion vectors calculated for can be reused during calculation of , , and . Clearly, this decreases the accuracy of the estimation, but the computational complexity will be close to that of encoding only one frame. The results presented in the next section are obtained using this complexity limited implementation. VI. RESULTS We used H264 baseline profile for our tests although proposed system can be applied to other standards such as MPEG2, H263 and MPEG4. GOP structure is I-P-P -P with ten frames. We used 100 frames of standard sequences; Football, Foreman, Carphone, and News at QCIF resolution and assuming 10-Hz frame rate. Sequences are temporally subsampled by dropping two frames out of every three frames. Each sequence represent a different motion complexity level; Football is very high, Foreman is high, Carphone is intermediate, and News is a low motion sequence. Hence, we would like to capture the behavior of the proposed system under various input characteristics. The link layer is simulated by using a modified version of the software provided by [28]. These modifications include Reed–Solomon channel coding at three rates; {5/12,7/12,10/12} 12/12 is also included for no channel coding-, and allowing bit errors to be passed to the application layer. Link layer fragment size is 48 Bytes and there are no retransmissions. The channel bitrate is set to 128 kbps. We study the bit error rates upto %2. For a given experiment we used fixed channel conditions, but as discussed, the proposed method can operate under time varying channel conditions without any modification. All experiments are repeated 50 times and the final average sequence pSNR is calculated as

where is the number of frames in a sequence and is the number of experiments (100 and 50, respectively, in this is the of th frame in th experiment. case). A. Results for Transport Model 1 In this set, of experiments we fixed the link layer FEC rate to 10/12 and compared the proposed adaptive slicing algorithm in

HARMANCI AND TEKALP: RATE-DISTORTION OPTIMAL VIDEO TRANSPORT

1323

Fig. 9. Adaptive slicing algorithm: News and Foreman sequences shown on top and bottom panes, respectively. Two consecutive frames are shown on the left and middle and the slice configuration found by proposed algorithm at medium fragment loss rate ( 0.02) is shown on the right.



Section III to fixed MB per slice algorithm at various bit-error rates. Fig. 9 shows an example slicing configuration found by the proposed algorithm when there is high residual loss rate (i.e., high loss probability even after channel coding). Notice that the Foreman frame is generally partitioned into smaller slices and how the News frame is adaptively partitioned into slices only where the loss would have high impact and hard to conceal. Fig. 10 shows the resultant average pSNR plots under various channel bit-error rates. It compares the proposed system with various fixed MB per slice methods: 5, 11, 33, and 99 MB per slice. In practice one can perform these experiments for every loss rate beforehand and determine the optimal number of MBs for fixed MB per slice method. Obviously there are two problems with this approach: 1) if the video is not available (i.e., captured real time) there is no way one can find the exhaustive optimal, and 2) if the channel conditions are dynamic the optimal changes with time. For experimental purposes, we pregenerated them to determine the best possible using fixed MB per slice approach under fixed channel conditions. At low loss rate, the proposed algorithm successfully operates as good as maximum achievable by fixed MB per slice methods. At higher loss rates, the proposed algorithm performs better than maximum achievable by fixed MB per slice methods up to 0.7 dB for News and 0.5 dB for Carphone. We also observe that News sequence has the highest gains compared to Foreman and Football, which has the lowest. This is expected since the proposed method successfully benefits from the varying local motion dynamics of News sequence. We also performed experiments to compare the low complexity and high complexity estimation methods discussed in Section V. The results show that the change is negligible, and both methods perform nearly same. For example, at %0.4 BER, for the Foreman sequence, the high- and low-complexity

methods resulted in 32.57 and 32.54 dB average sequence pSNR, respectively. B. Results for Transport Model 2 In this set of experiments, we compare link layer controlled FEC with application layer controlled FEC using the proposed UEP optimization method. As we stated in the body of the paper, we do not perform FEC at application layer; we only tell the link layer at what rate it should apply FEC. We performed experiments for link layer controlled FEC rates and compared it with the of {5/12,7/12,10/12} and proposed UEP method. The results are shown in Fig. 11. We see that especially for News sequence at high loss rates the proposed system is up to 2 dB better. For the Carphone sequence, the maximum difference occurs at around 1.8% BER with 0.6 dB. For Foreman, sequence peak occurs around 1.6% BER with 0.5 dB. On the other hand, we observe little improvement for the Football sequence. These results agree with the expectations: the proposed UEP method is able to utilize the locally varying dynamics of a video within a frame. The especially high gain for News is a proof of this. The low motion, which are actually no motion, parts of the video are less protected and the bits conserved by doing so is effectively utilized for higher impact regions. On the other hand, the Foreman, and Football sequences present similar characteristics throughout each frame; there is a constant camera and actor or actors motion. Therefore, UEP does not benefit much from it. There is an interesting observation we made during these experiments: for some MBs, the encoder distortion is higher than decoder distortion and final sequence pSNR of encoder may turn out to be less than the decoder pSNR. We see that the main reason for encoder’s low quality is the effect of using (5) for

1324

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 5, MAY 2007

Fig. 10. Comparison of proposed adaptive slicing and fixed MB per slice method.

MB mode decision. Since the encoder knows that the packet will be lost, it encodes at smallest bitrate possible, which is SKIP mode. But that may result in a very bad decision; hence, encoder quality is lowered. However, after the channel process, which is the dropping of that packet, the quality would have been better. This can be viewed as achieving the goal of joint source channel optimization since the encoder distortion should actually be measured after the channel process. Overall, controllable code rates effectively enriches the encoder’s decision space by including the FEC rates into the optimization framework. The encoder makes better decisions for the lossy channel when this is combined with the known channel model. C. Effect of Header Compression on UEP When RoHC is not used, the proposed UEP method has to spend extra bits in headers compared to EEP, which may be significant. In this set of experiments we study this tradeoff by disabling the RoHC and set the header length to 40 Bytes per IP packet, and reperform the experiments for Transport Model 2. The results show a similar trend among different videos; hence,

we only present the results for Foreman sequence in Table I. We see that, although the UEP method is slightly more affected, especially at higher loss rates, UEP still performs better.

VII. CONCLUSION The proposed RD optimal slicing and UEP system is suitable for use in practical real-time applications, thanks to the novel dynamic programming solution and performance estimation techniques presented. The following gains in terms of average sequence pSNR have been observed: • RD optimal slicing when packets with bit errors are delivered from the link and network layers, and with standard link layer controlled FEC provides up to 0.7-dB gain, depending on the video content; • UEP using application layer controllable FEC at the link layer provides up to 2-dB gain, depending on the video content. Although fixed channel models have been used in our experiments, the proposed methods only rely on the current BER of the channel and, therefore, are also suitable for use with time-varying channels.

HARMANCI AND TEKALP: RATE-DISTORTION OPTIMAL VIDEO TRANSPORT

1325

Fig. 11. Comparison of the link layer controlled (LLC) channel coding and the proposed UEP method.

TABLE I EFFECT OF ROHC ON THE UEP RESULTS FOR THE FOREMAN SEQUENCE

The coefficients of (17) are the binomial coefficients; hence, . Then the total number of configurations is

REFERENCES APPENDIX We calculate the total number of possible configurations using the ordinary generating function in combinatorial theory denote the generating function [29, p. 295]. Let (16) Then, the number of configurations with slices for an roblock frame is given by the coefficient of in

-mac-

(17)

[1] L.-A. Larzon, M. Degermark, and S. Pink, “UDP-Lite for Real Time Multimedia Applications,” Tech. Rep. HPL-IRI-1999-00, HP Labs, 1999. [2] D. Marpe, G. Blattermann, G. Heising, and T. Wiegand, “Video compression using context-based adaptive arithmetic coding,” presented at the Int. Conf. Image Processing, Oct. 2001. [3] D. Clark and D. Tennenhouse, “Architectural considerations for a new generation of protocols,” presented at the ACM SIGCOMM, 1990. [4] I. E. G. Richardson and M. J. Riley, “Varying slice size to improve error tolerance of MPEG video,” Proc. SPIE, vol. 2668, pp. 365–371, 1996. [5] R. Talluri, “Error resilient video coding in the ISO MPEG-4 standard,” IEEE Commun. Mag., vol. 1, no. 1, pp. 112–119, Jun. 1998. [6] G. Cote, S. Shirani, and F. Kossentini, “Optimal mode selection and synchronization for robust video communications over error-prone networks,” IEEE J. Sel. Areas Commun., vol. 18, no. 6, pp. 952–965, Jun. 2000. [7] E. Masala, H. Yang, K. Rose, and J. C. De Martin, “Rate-distortion optimized slicing, packetization and coding for error resilient video transmission,” in Proc. IEEE Data Compression Conf., 2004, pp. 182–191.

1326

[8] T. Wiegand, N. Farber, K. Stuhlmuller, and B. Girod, “Error-resilient video transmission using long-term memory motion-compensated prediction,” IEEE J. Sel. Areas Commun., vol. 18, no. 6, pp. 1050–1062, 2000. [9] U. Horn, K. Stuhlmuller, M. Link, and B. Girod, “Robust internet video transmission based on scalable coding and unequal error protection,” Image Commun., vol. 15, no. 1–2, pp. 77–94, Sep. 1999. [10] P. A. Chou and Z. Miao, “Rate-distortion optimized streaming of packetized media,” Tech. Rep. MSR-TR-2001-35, Microsoft Research, Feb. 2001. [11] J. Chakareski and P. A. Chou, “Application layer error correction coding for rate-distortion optimized streaming to wireless clients,” IEEE Trans. Commun., vol. 52, no. 10, pp. 1675–1687, Oct. 2004. [12] Physical Layer Standard for cdma2000 Spread Spectrum Systems, TIA/EIA/IS-2000.2-C, May 2002. [13] “An RTP payload format for generic forward error correction,” RFC: 2733, Dec. 1999. [14] M. van der Schaar and H. Radha, “Unequal packet loss resilience for fine-granular-scalability video,” IEEE Trans. Multimedia, vol. 3, no. 4, pp. 381–393, Dec. 2001. [15] M. van der Schaar and J. Meehan, “Robust transmission of MPEG-4 scalable video over 4 G wireless networks,” presented at the Int. Conf. Image Processing, Sep. 2002. [16] M. Budagavi, W. R. Heinzelman, J. Webb, and R. Talluri, “Wireless MPEG-4 video communication on DSP chips,” IEEE Signal Process. Mag., vol. 17, no. 1, pp. 36–53, Jan. 2000. [17] O. Harmanci and A. M. Tekalp, “Optimization of H264 for low delay video communications over lossy channels,” presented at the Int. Conf. Image Processing, Oct. 2004. [18] J.-G. Kim, J. Kim, J. Shin, and C.-C. Jay Kuo, “Coordinated packetlevel protection with a corruption model for robust video transmission,” presented at the Conf. Visual Communications and Image Processing, Jan. 2001. [19] “RTP: A transport protocol for real-time applications,” RFC: 1889, Jan. 1996. [20] “UDP: User datagram protocol,” RFC: 768, Aug. 1980. [21] O. Harmanci and A. M. Tekalp, “Stochastic frame buffers for rate-distortion optimized loss resilient video communications,” presented at the Int. Conf. Image Processing, Sep. 2005. [22] “RoHC: Robust header compression,” RFC: 3095, Jul. 2001. [23] R. Zhang, S. L. Regunathan, and K. Rose, “Video coding with optimal inter/intra-mode switching for packet loss resilience,” IEEE J. Sel. Areas Commun., vol. 18, no. 6, pp. 966–976, Jun. 2000. [24] G. Cote and F. Kossentini, “Optimal intra coding of blocks for robust video communication over the internet,” Signal Process.: Image Commun., vol. 15, pp. 25–34, Sep. 1999. [25] S. Wenger and G. Cote, “Using RFC2429 and H.263+ at low to medium bitrates for low latency applications,” in Proc. Packet Video Workshop, Apr. 1999. [26] G. J. Sullivan and T. Wiegand, “Rate-distortion optimization for video compression,” IEEE Commun. Mag., vol. 15, no. 6, pp. 74–90, Nov. 1998. [27] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G. J. Sullivan, “Rate-constrained coder control and comparison of video coding standards,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 688–703, Jul. 2003. [28] G. Roth, R. Sjberg, G. Liebl, T. Stockhammer, V. Varsa, and M. Karczewicz, Common Test Conditions for RTP/IP Over 3GPP/3GPP2 ITU-T SG16 Doc. VCEG-M77, 2001. [29] E. A. Bender and S. G. Williamson, Foundations of Combinatorics With Applications. New York: Dover, Feb. 2006.

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 16, NO. 5, MAY 2007

Oztan Harmanci (S’02) received the B.S. degree in electrical and electronics engineering from Bilkent University, Ankara, Turkey, in 2000, the M.S. degree in electrical and computer engineering from Georgia Institute of Technology, Atlanta, in 2001, and the Ph.D. degree in electrical and computer engineering from the University of Rochester, Rochester, NY, in 2006. He was with Microsoft Research, Redmond, WA, during the summers of 2004 and 2005. His research interests lie in the areas of image and video processing and watermarking, specifically error-resilient video streaming over lossy networks and video watermarking and streaming for forensic purposes.

A. Murat Tekalp (S’80–M’84–SM’91–F’03) received the M.S. and Ph.D. degrees in electrical, computer, and systems engineering from Rensselaer Polytechnic Institute (RPI), Troy, NY, in 1982 and 1984, respectively. He was with Eastman Kodak Company, Rochester, New York, from December 1984 to June 1987, and with the University of Rochester, Rochester, NY, from December 1984 to June 1987, and with the University of Rochester from July 1987 to June 2005, where he was promoted to Distinguished University Professor. Since June 2001, he has been a Professor at Koç University, Istanbul, Turkey. His research interests are in the area of digital image and video processing, including video compression and streaming, motion-compensated video filtering for high-resolution, video segmentation, content-based video analysis and summarization, 3-D video processing and compression, multicamera surveillance video processing, and protection of digital content. He authored the book Digital Video Processing (Prentice-Hall, 1995) and holds seven U.S. patents. His group contributed technology to the ISO/IEC MPEG-4 and MPEG-7 standards. Dr. Tekalp was named Distinguished Lecturer by the IEEE Signal Processing Society in 1998, and awarded a Fulbright Senior Scholarship in 1999. He received the TUBITAK Science Award (highest scientific award in Turkey) in 2004. He chaired the IEEE Signal Processing Society Technical Committee on Image and Multidimensional Signal Processing (January 1996 to December 1997). He served as an Associate Editor for the IEEE TRANSACTIONS ON SIGNAL PROCESSING (1990 to 1992) and the IEEE TRANSACTIONS ON IMAGE PROCESSING (1994 to 1996), and the Kluwer journal Multidimensional Systems and Signal Processing (1994 to 2002). He was an Area Editor for the Academic Press Journal Graphical Models and Image Processing (1995 to 1998). He was also on the Editorial Board of the Academic Press journal Visual Communication and Image Representation (1995 to 2002). He was appointed as the Special Sessions Chair for the 1995 IEEE International Conference on Image Processing, the Technical Program Co-Chair for IEEE ICASSP 2000 in Istanbul, the General Chair of IEEE International Conference on Image Processing (ICIP) in Rochester in 2002, and Technical Program Co-Chair of EUSIPCO 2005 in Antalya, Turkey. He is the Founder and First Chairman of the Rochester Chapter of the IEEE Signal Processing Society. He was elected as the Chair of the Rochester Section of IEEE for 1994 to 1995. At present, he is the Editor-in-Chief of the EURASIP journal Signal Processing: Image Communication (Elsevier). He is serving as the Chairman of the Electronics and Informatics Group of the Turkish Science and Technology Foundation (TUBITAK) and as an independent expert to review projects for the European Commission.