Hidden Markov Modeling for network communication ... - CiteSeerX

Hidden Markov Modeling for network communication channels Sandrine Vaton ENST Bretagne BP 832 29285 Brest,France. [email protected]

Kay6 Salamatian Laboratoire LIP6-CNRS UMR7606 Universit8 Pierre et Marie Curie 8 rue du Capitaine Scott 75015 Paris, France. [email protected]

ABSTRACT

Another domain where modeling is extremely important is the field of adaptive multimedia network applications. Network heterogeneity is a fact of life in today's Internet. Indeed, the current Internet provides users with only a single class of best effort service which does not promise anything in terms of guaranteed performance. Measurements show persistent problems with multimedia quality caused by congestion in the network, and thus by the impact of traffic in the network on any application stream. This impact is felt through high loss rates, varying delay, etc. The de-centralized nature of the Internet makes it very likely that it will continue to be as unpredictable in the future.

In this paper we perform the statistical analysis of an Internet communication channel. Our study is based on a Hidden Markov Model (HMM). The channel switches between different states; to each state corresponds the probability that a packet sent by the transmitter will be lost. The transition between the different states of the channel is governed by a Markov chain; this Markov chain is not observed directly, but the received packet flow provides some probabilistic information about the current state of the channel, as well as some information about the parameters of the model. In this paper we detail some useful algorithms for the estimation of the channel parameters, and for making inference about the state of the channel. We discuss the relevance of the Markov model of the channel; we also discuss how many states are required to pertinently model a real communication channel.

The unpredictable nature of network fluctuations makes it difficult for applications to determine in advance the network performance between a pair of Internet hosts. So, in order to always provide the best possible quality, applications have evolved to adapt dynamically to their current network environment. The T C P congestion avoidance mechanism is a good example of such an adaptive mechanism which aims to adapt the rate of transmission to the state of the network.

Keywords Hidden Markov Model, Internet modelling, active measurement Expectation-Maximization, network state estimation.

1.

INTRODUCTION

Network state is not a concrete and well defined notion. In fact, network state is an abstract variable, representative of the effect of all concurrent flows on one application flow. As applications have no direct access to information on router loads and characteristics, the network state is a hidden variable, that can be perceived by an application only through its effects on its data flow. Packets can thus be viewed as probes that give incomplete and delayed information about the path they have crossed. Active IP performance measurement tools generate such probe traffic to estimate the overall performance of an Internet path. Adaptive applications have to evaluate the network conditions from the information they can gather from the received packet flows, and adapt to it.

Recently, much research effort has been spent on the performance analysis of IP networks. These efforts resulted in the I P P M group of the I E T F defining different end-toend performance metrics [10]. However, these metrics are meaningless without any modeling to translate the measured metrics fluctuations into network states. The model can be seen here as an "Occam's razor", describing in a compressed and concise manner an IPPM metrics trace gathered on an Internet path. This paper presents the EM (Expectation-Maximization) algorithm as a generic approach for inferring about unseen parameters based on observed I P P M metrics.

For example, in TGP, the network state is a binary value (congested or non congested) and is estimated by monitoring packet losses: a single loss observed at time T is interpreted by the T C P congestion avoidance mechanism as a congested state for the network at time T. Even in such a case, the network state at time T is a hidden variable estimated by a delayed observation on a single packet loss.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM SIGMETRICS 2001 6/01 Cambridge, MA., USA © 2001 ACM ISBN 1-58113-334-0/01/06...$5.00

92

Two kinds of information can be extracted from the received packet flow: packet loss process and packet delay. Due to non-synchronized clocks between receivers and senders, the reliable measurement of packet delay is difficult. Strict synchronization of two entities connected by a varying delay link, can prove to be impossible without access to an external universal time reference as provided by a GPS (Global Positioning System) time reference [2]. In [9], complex mechanisms that converge asymptotically to the synchronization of two clocks are developed. But GPS acquisition cards have not been ~widely deployed in the Internet, making delay measurements unreliable. Moreover, delay is a continuous variable, making'state estimation based on it even more complex.

of states and calibrates an HMM for the loss, and a second step that will use this HMM to estimate the actual (or past) state of the network by observing the sequence of lost packets. The paper is organized as follows. We will first introduce our underlying model and HMMs. After that, we will define an estimator for the number of states of an HMM. In section 4, we will describe the EM (Expectation-Maximization) algorithm in the context of the calibration of HMMs for network channel modeling. Next, we will investigate state sequence inference using observed lost packet sequences. We will continue by illustrating the concepts developed in the paper with real traces collected from the Internet. A final discussion and some conclusions are presented in section 7'.

In this paper, we present an analysis of the end-to-end loss process and use it to make inferences about the state of the network as seen by the application.

2.

MODELING OF NETWORK CHANNELS

A clear explanation of the underlying network model as seen by the application is essential in order to remove ambiguities. In this work, the network is modelled as a valve that can be passing or blocking at any time. The state (passing or blocking) of the valve at a time t is S(t). The packet flow generated by our application generates packet i at time T~ and samples the state of the valve S(T~) at time T~, where T~ -T~ represents the delay needed to reach the bottleneck, where packets are lost. We will suppose in what follows that the sampling time is slided back to the sending time T~ = Ti. Indeed, a more complete analysis require knowledge of the distribution for the sampling delay

Previous empirical measurement have shown that the distribution of the number of consecutive lost packets is approximately geometric, or, rather, that the head of the distribution is geometric, and that the tail includes a few events (which might contribute significantly to the overall loss rate, since a single event in the tail indicates a loss burst with a large number of lost packets) which appear not to have any specific structure [13, 3, 1, 15, 16]. Unfortunately, the above result says little about the characteristics of the loss process because it only applies to the marginal distribution of the process and says nothing about the correlation structure of that process.

T~ - T,.

Most of the previous works have tried to model the loss sequence by a non Hidden Markov Model. [16] proposes a classical non Hidden Markov model of the Internet channel with 2 k states, with k in the range [0, 6]. In [16], the observed loss value at time t, Y(t) depends on (Y(t - 1), Y(t 2), ..., Y ( t - k)). This leads from a loss sequence of loss T to a Markov chain model with 2 k states and a finite memory (at most k packets) in the observed loss process that can be estimated with a complexity that is on the order of O(T). Unlike these models, HMMs (Hidden Markov Models) exhibit infinite dependences in the observed process, even with only 2 states. This strongly reduces the number of states needed to describe a given loss process. We will show further that in almost all cases studied fewer than 4 states are sufficient, where [16] has used up to 42 states. Moreover, a non hidden model cannot be used for state estimation and is thus only descriptive. This smaller number of state in HMMs should be traded off with the higher complexity induced by the hidden parameters estimation. The complexity of HMM estimation is on the order of O(K2T), where K is the number of states of the HMM.

The state estimation process tries to estimate statistical characteristics of the open/close process governing the valve, based on the observed packet loss process, which is the sampled process S(Ti). We make the essential hypothesis that the open/close process has reached a stable and stationary distribution. This stationary distribution is, of course, a function of the competing Internet traffic and of the traffic generated by our application. However, active probing IP performance measurement tools try to maintain low traffic to avoid disturbing the open/close stationary distribution. In the general case, we suppose that a loss trace contains K samples {S(T~)}, i = 1. . . . , K. Under ergodic and stationary hypotheses for the open/close stochastic process, it is possible to estimate the statistical characteristic of S(t), based on the samples. For example, the temporal mean ~ = ~ ~ = 1 B(T~) is an unbiased estimator of the open/close process mean p = E { S ( T ) } . The variance of this estimator will be yard/S} = 1 K where R ( r ) = E { ( S ( T + r) - p ) ( S ( T ) - #)} is the antocorrelation of the open/close process.

In this paper, we study more sophisticated models based on HMMs. HMMs have proved to be useful for a number of applications. One can cite different examples of their application in the field of telecommunications, including speech recognition [11], traffic characterization [14], source coding, channel coding, and equalization.

Actually, active probing techniques attempt to extend inferences made on the loss process seen by the probing flow, to other competing flows. The previous remark helps determine conditions for this extension. All competing flows are governed by the same open/close process, therefore under the stationary hypothesis for the open/close process, the temporal mean estimator of all flows will indeed converge to the same value. However the variance of this estimator will largely depend on the autocorrelation function of

We will develop a two step network state estimation procedure: first, a model calibration step that chooses a number

93

the open/close process and on the dynamics of the particuliar flow. For example, T C P flows that send a bunch of very close packets on the window opening will undergo higher estimate variance than competing U D P flows that send packets more regularly. This remark does not mean that a UDP flow will see lower loss rates than a T C P flow, it only says that if a UDP and a T C P flow are competing, the T C P flow may see a larger fluctuation of its loss rate than the more regularly spaced U D P flow.

the set of parameters by ~ = (FF,P). We will study these estimation problems in the following sections.

3.

ESTIMATION OF THE NUMBER OF STATES OF A HIDDEN MARKOV MODEL

In order to choose a correct number of state for the HMM, we need a consistent estimator, based on the empirically observed loss process X. In this paper, we use an estimator developed in [18] which is based on the notion of entropy and data compression. The entropy of a discrete random variable X is defined as :

If delay information is available, one can extend the above model to take it into account. In this case, the network can be modeled as an valve which is governed by an open/close process, with a delay element. The state estimation procedure will try to estimate the joint open/close and delay process using the observed measures. However, recent empirical studies have shown that delay and loss rates are statistically independent [9, 1]. This is mainly due to the fact that losses and delays do not occur at the same location in the network. Based on this empirical observation, the joint estimation problem would be slipt into two independent estimation problems, one for the loss process and the other for the delay. This separation procedure is in fact used in the T C P context : congestion detection which is based on the loss process is independent of delay estimation which is based on rtt exponential smoothing. In this paper we only consider the loss process modeling and estimation problem. The delay modeling can use the same approach, but it is more complex as the delay is a continuous variable.

H(X) = - E P r o b ( X = x} log P r o b { X = x} The entropy of a stationary stochastic process X = {X~} is defined as : H(X)=

lim 1-H(X1,... ,X~) ~ "-.¢-OC n

where H ( X : , . . . ,X,~) is the entropy of the joint random variable ( X I , ' " ,X~). We know from the A E P (Asymptotic Equipartition Property) [4] that if the stochastic process X is finite-valued, stationary and ergodic then : 1 log P r o b ( X = x ) ~ H ( X ) with probability 1 On the other hand, the compression theorem states that the optimal code length l~ (x) of a compression scheme converges toward the entropy of the stochastic process X :

We suppose in what follows that the open/close process in our model of the network follows a Markov chain. This is a very weak hypothesis, as a Markov chain of arbitrarily large order can be used to model an extremely wide range of processes [5]. We consider an HMM for the channel between a transmitter and a receiver on the Internet. The intuition behind our approach is that at regular sampling times, the open/close process will pass through different (hidden) states which are reflected in the observable loss rate fluctuations.

l~ ~ H ( X ) Based on the preceding preliminaries we define the following estimator for the number of states K* of a HMM : K* = Arg rain {Jl I - ~1 log n~ax Probj o {X = x}

- H(X) t< e~}

(1)

where maxa Probj.e{X = x} is the probability of observing the output sequence x for an HMM with j states under the parameter 0 that maximizes the likelihood of observing the output sequence x. ca is an arbitrary sequence converging towards 0.

The loss process X = (X~)T=i is defined as X~ = 0, if the t th packet reaches its destination and X~ = 1, if the packet is lost. The channel 'switches' between K different states following a Markov chain Y = {Ye} with state space S = {1, 2 , . . . , K } and stochastic transition matrix K r ( t ) = (Fq)~d=l where Fij = Prob{I4+l = j ]Yt = i } . The Markov chain is homogeneous, ergodic and the state distribution converges to a stationary distribution Ir which is the solution of the following set of equations : 7r.r = ~r and v.[1, 1 , . . . , 1]T -~ 1.

The intuition behind this estimator is that a good guess for the number of states will results in an optimal code-length (l~), close to the entropy of the stochastic process. This estimator has been shown in [18] to be consistent and asymptotically optimal.

In each of these K states, the channel is uniformly blocking or passing. This defines a probability that a packet is lost at time while the channel is in state i (1 < i < K); these probabilities are grouped in the observation matrix P = (Pi)ff=l where pl = Prob{X~ = 1 [ 14 = i}. In what follows we will represent a sub-vector (xi, Xi+l,. ." x j), i < j by x~.

Two main remaining problems about the above defined estimator are how to choose the parameter 0 that maximizes the likelihood of the observed sequence x and how to estimate the entropy of the stochastic process. The first problem will be addressed in the following sections. The second problem is resolved using a universal compression encoding such as Lempel-Ziv [17] or an arithmetic coding [8]. The main idea is that ff the observed sequence x is sufficiently long then the mean code-length of an encoding obtained by a universal compression scheme

To model the network channel by a HMM, we need a procedure for estimating the number of states (K), the transition matrix ( r ) and the observation matrix (P) of the Markov chain. Assuming that the number of states is K, we denote

94

will converge toward the entropy of the stochastic process. Using this observation and assuming that we have sufficient data to make the mean code-length converge, the estimator is rewritten as :

the 'complete data' Z = (X,Y) when 0 is the parameter of the model. Expectation involved in the computation of Q(9, 0k) is the expectation given that X = x and given that 0k is the parameter of the model.

K* = Arg rain { J [ I - "~ log moaxProb~',o{X = x}

(2)

The dependence structure of the HMM (see Fig. 1) is such that the complete log-likelihood L(X, Y; 0) can be split into two terms L(X,Y; 0) = L(Y; 0) + L(X I Y; 0).

where l=(x) represents the length of the encoding of the output sequence x.

Each iteration of the EM algorithm can consequently be decomposed into two steps :

- 1-l~(x)[< e~} n

However the convergence speed of the mean code length can be slow, and a long sequence should be used to attain a stable estimated entropy. Classical results in the theory of Lempel-Ziv coding show that the estimation error made by replacing entropy by the mean code-length of a Lempel-Ziv encoding is of order (_9(~-e~--!-~--e-~ [17]. However this error is log n ] of order (.9(Les--~'}for arithmetic coding [8].

1. Step E (Expectation): Compute Q(0,0~) = E{L(X,Y; 0 ) ] X = x, 0k}. 2. Step M (Maximization) : Maximize Q(O, 0~) with respect to 0 : 0k+1 = Argn~axQ(0, 0~).

Another estimator of the number of states of an HMM is proposed in [6]. In contrast to the estimator in Eq. 1 which needs an estimate of the likelihood, the estimator proposed in [6] does not. However, in our case, we already obtain this estimate as a byproduct of the estimation of the HMM parameters. This estimation will be described in the following section.

4.

The maximization involved in the M step is analytical and does not require intensive computation; the integration involved in the E step requires the computation of a non linear filter; this computation is based on the Forward Backward (or Baum-Welches) algorithm [11]. The application of the general EM framework to the estimation the HMM model for the loss process is non trivial and awkward. For the sake of completeness, we present the full derivation. To simplify the notation, we use the following convention :

INFERENCE ABOUT THE PARAMETERS OF THE CHANNEL: THE EM ALGORITHM

As the states of the Markov chain are not observable we must estimate the set of parameters 0 using the information hidden in the loss process. The estimation criterion is the Maximum Likelihood criterion. This criterion chooses the parameters t~ that maximize the probability of seeing the observed loss process X given that the HMM follows these parameters. The estimation procedure then reduces to an optimization problem with a simple cost function.

l{x,p} ~ l{x = 1 } p + l{x = 0 } ( l - p ) log l{x,p} =Al{x = 1} logp + l{x = O} log(1 - p) Denote by @~(i,j) and by 7~(i) the a posteriori probabilities given the parameter Ok and the observation x : ¢~(i,j) =E{I{Y~ = i,Y~+~ = j} t X = x; &}

The Expectation Maximization (EM) algorithm [11] is a valuable approach for maximum likelihood parameter estimation in mixture models and in various type of Markov modulated models Markov Modulated Poisson Processes, Markov Arrival Processes, ete).

~ ( i ) =E{~{Y~ = ~} I x = x;

0~}.

With these notations we have : K

L(Y; 0) =

i,j=l

It has a number of advantages, especially its stability : each iteration of the EM algorithm increases the likelihood of the model; this ensures the convergence of the algorithm to a local, but not necessarily global extremum of the likelihood. Another major advantage of the EM algorithm is its numerical complexity. A direct computation of the likelihood would require K T terms, where K is the number of different states for the underlying process, and T is the number of observations. On the other hand, the numerical complexity of the EM algorithm is of the order of K2T.

T

~ logr,~¢~(i,j), t=l

T

L(XlY;O) = ~,~(Oiogl{X~,p d. t~l

Maximizing L(Y; 0) with the constraint that ~ = 1 F~j = 1 results in a new estimate of r ( k + 1) such that :

~,j(k + 1) = E~=i E j ~ ¢~(i, 2 '

The EM algorithm involves maximizing iteratively with respect to 0 a function Q(0, O~) = E{L(X,Y; 0) I X = x; 0k }. In this expression Y is the unobserved Markov state sequence, X is the vector of observations -a probabilistic function of Y- and L(X, Y; 8) denotes the log-likelihood of

(3)

Maximizing L(X I Y; 0) results in a new estimate of P(k + 1) such that :

~+~(i) = E~=, ~ ( i ) l { x ~ = 1}

E~=~ ~,~(i)

95

(4)

F i g u r e 1: D e p e n d e n c e s t r u c t u r e o f t h e H i d d e n M a r k o v M o d e l .

The computation of the a posteriori probabilities Ct(i,j) are performed by means of a Forward-Backward algorithm. Let us introduce the 'forward' filter

J

at(i) = P r o b { X l , Yt = i l Ok} and the 'backward' filter fl~(i) = Prob{X~+~

lYe = ~, #~}

• It results from the weak Markov property that oL and/3 can be computed recursively : K

~h1(i) =l{x~+~,~,(k)}~(~)~,(~)

(51

j=l

O,

K

/3~(i) = ~ f',~(~l/3h~0ll{X~+,,~(~)}

(61

c~(.) is computed in the 'forward' direction

and/3(.) is computed in the 'backward' direction

/3~(.) = G(/3~+~(.), x~+~, ~)

F i g u r e 2: S u m m a r y o f t h e e s t i m a t i o n o f H i d d e n Markov Model parameters

• ¢ and '7 can be deduced straightforwardly from the eL and /3 filter as follows : ¢ ~ ( i , j ) oc a~(i)Fij(k)/3~+l(j)l{X~+l,l~j(k)}

~(i) oc

o~(i)/3~(i)

(7)

on the Viterbi algorithm, estimates the most probable state sequence ~roT using the observed loss process XoT.

(8)

where 0¢ denotes the equality up to a multiplicative factor such that Z~j-t ¢~(i,j) = I and that Z~=~ 7~(i) = 1. Fig. 2 illustrates the successive steps of the H M M parameter estimation process.

5.1

The Marginal Posterior Mode

The Forward Backward algorithm produces as a byproduct the a posteriori Marginal distribution 7t(i) = Prob(Y~ = i [ X} which leads to the MPM estimate which is the most probable state at time t, given the observed sequence X. This estimate is the maximizer in i of "yt(i) :

The next step involves inference about the state of the channel at each step based on the observed output sequence x and an a priori H M M for the channel.

~ M P M = Arg m a x 7~(i). l