A Method for Building Predictive HSMMs in Interactive ... - IEEE Xplore

0 downloads 0 Views 513KB Size Report
In this work, we develop a general method to create, select and validate a Hidden Semi-Markov. Model (HSMM) to predict behavior in interactive environments,.
A Method for Building Predictive HSMMs in Interactive Environments Víctor Rodríguez-Fernández

Antonio Gonzalez-Pardo

David Camacho

Universidad Autónoma de Madrid (UAM) 28049, Madrid, Spain. Email: [email protected]

Universidad Autónoma de Madrid (UAM) 28049, Madrid, Spain. Email: [email protected]

Universidad Autónoma de Madrid (UAM) 28049, Madrid, Spain. Email: [email protected]

Abstract—The study of user behavior based on his/her interactions with a system is widely extended over several fields of research. Often, it is useful to have an underlying model to generate behavioral predictions, allowing the system to automatically adapt to the user and to detect deviations from an expected behavior. In this work, we develop a general method to create, select and validate a Hidden Semi-Markov Model (HSMM) to predict behavior in interactive environments, based on previously seen interactions. The method is completely data-driven, unrestricted by any prior knowledge of the model structure, and easy to automate once some parameters has been adjusted. To test the proposed method, a multi-UAV mission simulator has been used, obtaining a model able to perform adequate predictions in terms of quality and time.

I. I NTRODUCTION When dealing with interactive environments, such as learning systems, web applications, games and simulation tools, the information given by the user interactions can help to recognize and extract some hidden information about the general use of the system. As a result, it becomes possible to develop techniques to learn models of user behavioral patterns from previously-seen interactions. These models can be used later for creating adaptive experiences in the environment, or what is more relevant for some safety-critical systems, for predicting future interactions and detecting abnormal deviations from the expected behavior. Traditionally, models of human behavior were studied by applied psychologists, focused on the theoretical aspects of human decision making [1]. Other researchers have attempted to derive computational models of human cognitive processes in order to emulate such processes in computer programs and predicting the consequence of high workload and time pressure [2]. Recent approaches for modeling user interactions are exclusively data-driven, and rely on pattern recognition techniques to predict future behaviors from user interface events. Some of the most popular modeling techniques in this field are Tree-based models [3], Bayesian Networks [4] and Markov-based models [5]. In this paper, we are focused on creating predictive models of user behavior in interactive environments through the use of Hidden Semi-Markov Models(HSMMs). This type of models are powerful because they are capable of exploiting the temporal dimension of a time series. Nevertheless, they require

c 978-1-5090-0623-6/16/$31.00 2016 IEEE

the definition of several parameters that need to be adjusted for working properly, which causes that they are used only when some prior-knowledge about the resulting model structure is known [6], [7]. However, in this work we do not make any assumption about the topology of the HSMM that we are looking for, thus we must search for the best model blindly. Based on the work of [8], [9], we have developed a complete methodology to create, select and validate an optimal HSMM automatically, given a dataset composed of timed interactions from an interactive environment. Unlike the aforementioned works, where the main goal is to study the applicability of HSMMs to a specific problem domain, here the focus is set on the generality of the method and the optimal tuning of the parameters involved in the process. No restrictions are imposed on the type of environment used, providing that it logs user actions, and the range of possible interactions is not too broad. To test this method, we make use of an interactive simulation environment, where the user must supervise the success of a surveillance mission performed by a group of Unmanned Aerial Vehicles(UAVs), while avoiding the possible incidents that may occur during the course of the mission [10]. This simulator has already been used in previous works to build simple behavioral models [11]. The rest of the paper is structured as follows: Section II provides an overview of HSMMs, giving some backgrounds about their utility and usage. In Section III we detail the methodology proposed for the model building process, from the data preprocessing to the model validation and acceptance. Section IV applies the proposed methodology to a multi-UAV simulation environment and finally, Section V concludes the work with some discussions and future research lines. II. BASICS ON H IDDEN S EMI -M ARKOV M ODELS Hidden Markov Models(HMMs) are stochastic models mainly used for the modeling and prediction of sequences of symbols, and time series in general. They are characterized by a set of N discrete (hidden) states S = {S1 , . . . , SN }, which can be interpreted as phases in a cognitive process which each produce typical behaviors [12]. Traditional HMMs are based on Discrete-Time Markov Chains(DTMCs), where the input time series are divided into equidistant time steps. The term Markov in a DTMC pertains to the time-dependence between

3146

the consecutive states St , which follows a Markov process. This means that the current state St only depends on the previous state St−1 and not on earlier states. The transition probabilities between the states of this chain are denoted by a square matrix A, with entries: aij (t) := P (st+1 = Sj |st = Si ),

1 ≤ i, j ≤ N (1) N This is a stochastic process, so we have that j=1 aij = 0 for all 1 ≤ i ≤ N . As in any Markov chain, we need to specify the set of initial state probabilities, Π, defined as: Πi := P (s1 = Si ),

1≤i≤N

(2)

The term hidden in a HMM indicates that the underlying states St cannot be observed directly during the process. Instead, what we see is the emission of that state. Although the observations that a HMM emit can be both continuous and discrete, in this paper, as it is focused on human interactions, we only work with the discrete case. Let O = {O1 , . . . , OV } be the set of all the V possible observation symbols in our data domain (also called the model dictionary), the emission function, b, of a given state Si is defined as a probability density function along the set O, i.e: bi (v) = P (ot = Ov | st = Si ),

1 ≤ v ≤ V,

1≤i≤N (3) defines the probability of emitting Ov at any time step t in which the process is located in state Si . Since the system can emit only one of the possible V observation symbols in each state at each time step, the function bi (v) is constrained to  V v=1 bi (v) = 0 for all 1 ≤ i ≤ N . Gathering together the emission probabilities of each state into a N × V matrix we obtain the so-called emission matrix B. A Hidden Semi-Markov Model (HSMM) is an extension from a classical HMM where the duration of each state is modeled explicitly by a probability density function. Each state has variable duration and a number of observations being produced in the state. Therefore, self-transition probabilities of matrix A lose their meaning, and thus we have aii = 0 for all 1 ≤ i ≤ N . The underlying DTMC is now a semi-Markov Chain, since the probability of moving from state Si to state Sj does not only depend on the value of the previous state, but also on the time spent in it. This improvement makes HSMMs suitable for use in a wider range of applications, among which outstands speech recognition, for being used by Ferguson et al. in the paper where these models were firstly defined [13]. Modeling state duration also makes HSMMs much more complex than traditional HMMs. The original approach proposed by Ferguson was to set a maximum state duration, M , for every state and then fit the state duration as a multinomial distribution, so that, given a state Si , we have di = Mult(d1i , . . . , dM i ), where dui is the probability of staying u time units in state Si . Thus, D = {dui } is the M × N matrix containing the duration distributions of each state. Modeling duration this way becomes inefficient when M is large, and also requires more training data due to the large

number of free parameters needed [14]. To overcome this problem, several methods have been successfully developed for estimating parametric distributions from the values of D, as it is the case of the exponential family, which includes Gamma and Gaussian distributions amongst others [15]. In sum, to model a problem through the use a HSMM we must define the set of parameters: λ := {A, B, Π, D}

(4)

Hence, the number of parameters for a N -state HSMM with a V -symbol dictionary and M as maximum state duration is, in the case of using parametric state distributions: |λ| = (N 2 − N ) + (N × V ) + N + (|θτ | × N ),

(5)

where each term of the equation accounts for a parameter of Eq. 4. |θτ | is the size of the parameterization θ for a given state duration type τ . As an example, setting τ as “Gaussian” we would have θτ = {μ, σ} and |θτ | = 2. Note that |θτ |  M , hence the usefulness of applying parameterizations. Three main computational issues need to be addressed with HSMMs: 1) Sequence Recognition: It is related to compute the probability that a given T -length observation sequence o = o1 o2 . . . oT is produced by a specific model λ (See Eq. 4). This allows us to decide whether some sequence belongs to some typical pattern or not. This probability, written as P (o | λ) is called sequence (log-)likelihood and can be computed by the so-called forward-backward algorithm, which was introduced by Rabiner in [16] and optimized for HSMMs by Guédon in [17]. 2) Sequence Decoding: It consists in determining, given a sequence of observation symbols o and a model λ, which corresponding sequence of hidden states s = s1 s2 . . . sL is most likely to produce it. This problem is addressed applying the popular Viterbi algorithm, also known as decoding, in a version adapted to HSMMs [18]. 3) Model training: In the majority of applications, the set of parameters λ of Eq 4 cannot be inferred analytically but need to be estimated from recorded sample data. That process is commonly known as training, and the first method to address it for classical HMMs was the Baum-Welch algorithm [16]. In brief, it is a form of Expectation-Maximization (EM) which tries to maximize the likelihood of a set of observation sequences o1 . . . oK to be produced by a model λ. Formally, this algorithm ˆ as follows computes the optimal model λ    K ˆ = argmax log P (o | λ) (6) λ λ

K

Convergence to a local optimum is proven in [16] with complexity O(N 2 V ) for a N −state HMM with a V −sized dictionary. For HSMMs, Guédon extended this algorithm with complexity O((N × V )(N + M )), where M is the maximum duration allowed for any state [17].

2016 IEEE Congress on Evolutionary Computation (CEC)

3147

III. D ESCRIPTION OF THE M ETHOD This section describes in detail the whole process developed to create, select, and validate a HSMM containing the behavioral patterns from a set of interaction logs extracted from an interactive environment. Three main features characterize this process: • Unrestricted: We do not impose any specific form over the resulting model (number of states, maximum state duration, parametrization...), except that we only work with parametric-duration HSMMs, for reasons of efficiency. • Unsupervised: Since the underlying model parameter estimation used is based on an E-M algorithm, the model can be build from unlabeled data. • Automatable: Once some parameters in the process has been tuned, we can execute it from start to finish, obtaining a valid HSMM, ready to be used. A. Data Preprocessing The first issue to overcome when modelling through the use of HSMMs is related to adapt the input data, which in our case consists in a set of interaction logs, mapping them into an sequence of observation symbols that a HSMM can process. From here we are assuming that the environment used has a dictionary I listing all the possible user interactions. Any logged interaction is usually characterized by an identifier (contained in dictionary I), some parameters referred to the object with which the user is interacting, and a timestamp marking the moment when the interaction was registered. Despite HSMMs model time information, they do not consider time as a continuous variable, but a discrete one. For that reason, every log must be sampled into equidistant time steps, where each one contains, at the most, one interaction. If two or more interactions fall into the same time step, only the first will remain, and the others will be moved to consecutive intervals. One important parameter to tune up when sampling the input data is the Time Step Resolution (TSR), which measures the size, in milliseconds, of every time step. Normally, to avoid collisions between close interactions, this value is set to half the minimum distance between two interactions in the logs [8]. However, for some applications where it is common to spend long time without interacting, a low TSR may lead to have too many empty steps, which decrements the efficiency of the HSMM building. Thus, setting TSR to half the median or the mean of the time differences in the interactions log results better in terms of accuracy and efficiency. Once the input logs are sampled, we must decide what to do with the empty time steps. An interesting approach consist in replicating one symbol occurrence along all empty spaces until the next occurrence. This only makes sense if every symbol represent something extensible over time. Since dictionary I is based on instant interactions, we must map it to a new dictionary O = f (I) abstracting the concepts and situations involved in any interaction in the environment.

3148

At the end of this process, the initial K simulation interaction logs will be transformed into K observation sequences within a V -sized dictionary O, suitable for HSMM modeling (See Figure 1 (1)). B. Learning one single HSMM As it was mentioned in Section II, training a single HSMM makes use of a form of Expectation-Maximization (EM) algorithm, which is proven to converge to a local optimum given an initial model specification. Since in the context of this work we do not have previous information about how our resulting model will look, we must search among the whole parameter space to achieve a global optimum model. Let λ be a N -state HSMM with a V -sized dictionary, τ -type state durations and M as maximum state duration. To fit and optimize λ to a given dataset we will execute the EM-based training algorithm from [17] a number of L times using different parameter initializations, and get the result with maximum (log)-likelihood to our training data, thus obtaining ˆ Formally, we have the optimized model λ.    K ˆ log P (o | λl ) (7) λ = arg max {λl |1≤l≤L}

K

where K is the number of observations sequences in the dataset and λl is the model obtained by executing the HSMM training algorithm the lth time. Parameter initialization for each training execution is performed pseudorandomly, ensuring that the constraints for each parameter are fulfilled (e.g., stochasticity of matrices A and B) (See Figure 1 (2)). C. Model Selection In the parameter estimation algorithm used to learn a single HSMM, the number of hidden states, N , the maximum state duration, M , and the type of parameterization used to model the state duration distributions (τ ), must be known in advance, which is usual in the context of some applications [6]. However, in this work we do not rely on any prior-knowledge to optimize the HSMMs, thus we must train models along the entire search space of the parameters [N, M, τ ] and select which of them fits best to our dataset. (See Figure 1 (3)) To perform model selection, we will make use of a popular statistical metric, the Bayesian Information Criterion (BIC), which tries to give a balance between the likelihood of a ˆ (N,M,τ ) as a HSMM model and its complexity [19]. Denote λ with assumed values of N , M , and τ , optimized following the process discussed in the previous section. We can compute the BIC value for that model as  ˆ (N,M,τ ) ) = −2 BIC(λ

 K

 log P (o

K

ˆ (N,M,τ ) ) |λ

ˆ (N,M,τ ) | log N + |λ

ˆ (N,M,τ ) | is the size of the model, given by the where |λ ˆ (N,M,τ ) (See Eq. 5). Note that, as we number of parameters of λ

2016 IEEE Congress on Evolutionary Computation (CEC)

,QWHUDFWLRQ /RJV

$

&

'DWD VDPSOLQJ

W

$

9VL]HGGLFWLRQDU\

,QWHUDFWLRQV 'LFWLRQDU\ ,

'DWDSUHSURFHVVLQJ 

2 ^22«29`

2EVHUYDWLRQ VHTXHQFHV

6\PERO 5HSOLFDWLRQ

& W

>2222@

2 I ,

765

1PLQ1PD[

0RGHO 9DOLGDWLRQ  3UHGLFWLRQ *HQHUDWLRQ

0PLQ0PD[ 0RGHO6HOHFWLRQ  6HOHFWHG +600

%77 3DUDPHWHU LQLWLDOL]DWLRQ

3UHGLFWLRQ (YDOXDWLRQ

$FFHSW

ȉ

1 VHOHFWLRQ

%,& VHOHFWLRQ

3UHGLFWLRQ /HQJWK

(0 WUDLQLQJ +600/HDUQLQJ 

Figure 1. General scheme of the proposed methodology for the creation, selection and validation of HSMMs from a set of simulation interaction logs.

are using parametric state durations, the only critical parameter ˆ (N,M,τ ) | is the number of states of the model. to determine |λ Generally, the less the BIC scores, the better model we achieve in terms of accuracy and simplicity, which reduces model selection to a problem of minimizing the BIC function. Thus, values of N, M , and τ can be optimized by computing ˆ, M ˆ , τ) = (N

ˆ (N,M,τ ) ) BIC(λ arg min Nmin ≤ N ≤ Nmax Mmin ≤ M ≤ Mmax τ ∈Θ

(8)

where [Nmin , Nmax ], [Mmin , Mmax ] and Θ bound the possible values over which N ,M and τ can iterate respectively. However, looking only at the minimum BIC value to obtain ˆ, M ˆ , τ) may sometimes lead to ignore some simpler models (N ˆ states) that get BIC scores close to the (with less than N minimum. In this work, we avoid this problem by defining a 2-step model selection: 1) We compute the set of possible candidates for best model, Λ, by selecting those which obtain BIC values close to the minimum: τ ˆ (N,M,τ ) ) = BIC(λ ˆ (N,M,τ ) ) − BIC(λ ˆ (Nˆ ,Mˆ , D IFF(λ )   (N,M,τ ) (N,M,τ ) ˆ ˆ Λ= λ | D IFF(λ ) ≤ BT T

Here, N ∈ [Nmin , Nmax ], M ∈ [Mmin , Mmax ] and τ ∈ Θ. BIC Tolerance Threshold (BTT) denotes the tolerance for admitting a model as pre-selected.

2) The final selection consists in simply choosing from Λ the model with the least number of states. If there is more than one, we select the one with the lowest BIC. D. Model Validation To evaluate the predictive performance of the proposed HSMM, after the model selection procedure, we implemented an online HSMM-based prediction method, along with a prediction score measure, in order to validate its usefulness in predicting new events in a real interaction-based system. (See Figure 1 (4)) 1) HSMM-based prediction method: As we intend to use our models in real-time applications, the prediction generation method must be part of a continuous process, where the predictions vary depending on the new observations reaching the system. In concrete, we will use a sliding-window strategy in which, given a current time step t and a fixed Prediction Length (P L) value (measured in time steps), our model will generate a P L-length prediction in the time window [t + 1, t + P L], based on all the past observations o1 . . . ot acquired by the system in the interval [1, t]. Every single prediction for a given time step t is determined by an estimation of the current state of the model at that moment. Based on this, to compute a P L-length prediction window, we must perform two main steps: 1) Estimation of the current state at time t, i.e, s¯t . 2) Prediction of next states for every time step of our P L-length time window, i.e, computation of [¯ st+1 , s¯t+P L ]. The estimation of s¯t is done via the Viterbi algorithm (See Section II). Given a sequence of past observations o1 o2 . . . ot ,

2016 IEEE Congress on Evolutionary Computation (CEC)

3149

7UDQVLWLRQ

7UDQVLWLRQ

W¶ W

6

6

6

6

6

6

6

6

6

6

6

6

W

W

W

W

W

W

W

W

W

W

W

W

/77 W

GWDGMXVWHG

6WDWH SUHGLFWLRQV 7LPHVWHSV

GW¶

1:

GWQRQDGMXVWHG 3/ 

Figure 2. Graphical representation of the sliding-window prediction method used for generating predictions. Here, Prediction Length (PL) value is fixed to 8 time steps.

we get the most likely state path s¯1 s¯2 . . . s¯t that may have produced that sequence. The last element of that path, s¯t , represents the estimation of the current state. The strategy to generate state predictions from t + 1 to t + P L consists in estimating the expected duration of the current state, d(¯ st ), iteratively. Generally, we can achieve this by computing the average duration of the current state, i.e: d(¯ st ) = μdSi

(9)

assuming s¯t = Si , where 1 ≤ i ≤ N . Of course, this duration average is computed according to the duration type that our model uses. One might go ahead with this value of d(¯ st ) and assign, for each time step tˆ in the interval [t + 1, t + d(¯ st )], the prediction stˆ = st = Si . However, it may occur that, at time step t, the process has already spent some time in the estimation s¯t , so the expected duration value should be decremented. To get the moment (time step) in which the process last entered to state s¯t = Si , named as Last Transition Time (LTT), we do

(10) LT T (t) = max tˆ ∈ [0, t] | s¯tˆ = s¯t and then, readjust the value of Eq. 9 by d(¯ st ) = μdSi − (t − LT T (t)).

(11)

Now we are able to assign the predictions s¯t+1 s¯t+2 . . . s¯t+d(¯st ) replicating the value of s¯t = Si . If every time step in [t, t+P L] has been assigned to a prediction we can move on to the next prediction window, restarting all this process. Otherwise, when t + d(¯ st ) < t + P L, we continue our prediction method by updating the current time t = t + d(¯ st ) and the estimation of the new current state s¯t as s¯t = arg max as¯t ,i , 1≤i≤N

(12)

i.e., we find the most likely transition from s¯t in matrix A of the model. Then we compute Eq. 9 for s¯t and assign more predictions. This process is repeated iteratively until the whole prediction window [t, t + P L] has been assigned. The formalization of this sliding-window prediction method is shown in Algorithm 1. Besides, Figure 2 shows a graphical representation of the process.

3150

Algorithm 1 HSMM sliding-window prediction generation Input: o = o1 o2 . . .T is a test observation sequence of length T . P L is the Prediction Length (window width). λ is the selected HSMM Output: T-length sequence of predicted states s¯ = s¯1 s¯2 . . . s¯T

2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18:

function P REDICTIONS(o, P L, λ) s¯ ← () windows ← split(o, P L) for w in windows do t ← w.startT ime s¯t ← last(Viterbi(o1 . . . ow.startT ime )) while t < (w.startT ime + P L) do if t = w.startT ime then Calculate LT T (t)  ⊕: Eq. 10 d(¯ st ) ← μdSi − (t − LT T (t))  ⊕: Eq. 11 else d(¯ st ) ← μds¯t  ⊕: Eq. 9 for i ← 1 to d(¯ st ) do s¯t+i ← s¯t t ← t + d(¯ st ) Calculate s¯t t ← t return s¯

 ⊕: Eq. 12

2) Predictions Evaluation: Once Algorithm 1 has been executed for a given test observation sequence o = o1 o2 . . . oT , we obtain its corresponding predicted state estimation sequence s¯ = s¯1 s¯2 . . . s¯T . Since this work is focused on unsupervised learning, test sequences are not labelled with the real state sequence that have produced them. Thus, in order to score our predictions, we must develop an observation-oriented evaluation. Let o¯t be the V -length vector containing the predicted probability of each observation symbol at time t (Recall V is the size of the model dictionary). Given a state prediction s¯t = Si , 1 ≤ i ≤ N , o¯t is trivially calculated as the emission distribution of the state s¯t , i.e, the ith row of emission matrix B: o¯t = (bs¯t ,1 , . . . , bs¯t ,V ). In order to evaluate the model, we measure the quality of the observation predictions o¯t against the real observed values ot . Three aspects have been considered to develop a robust quality measure: 1) Comparing only the observed value at time t against the prediction at that exact time is too severe. An observation at time t should be scored against the predictions around t. To address this issue, for each ot ∈ o, we choose the prediction oˆ ¯t ∈ o¯t−σ(¯st ) , . . . , o¯t+σ(¯st ) that maximizes the probability of ot , i.e: oˆ ¯t =

arg max

o ¯∈{o ¯t−σ(¯ ot+σ(¯ st ) ,...,¯ st ) }

o¯[ot ],

ot ∈ O

(13)

The boundaries around which we choose the best observation prediction at time t are given by σ(¯ st ), which

2016 IEEE Congress on Evolutionary Computation (CEC)

Table I I NTERACTIONS DICTIONARY (I) IN DWR Interaction Name

Figure 3. Screenshot of a multi-UAV mission in Drone Watch And Rescue (DWR). Interested readers can try it in http://savier.ii.uam.es:8888/#/training/ login or check [10] for more information

is the standard deviation of the duration at state s¯t , calculated according to the state duration type of our model (denote as τ in this work). 2) Best scores are assigned when the observed value ot is the most probable in the prediction oˆ¯t , i.e when oˆ¯t [ot ] = max oˆ¯t . Otherwise, penalties to the score will be applied, in proportion to how we get away from the highest observation probability. Based on this, we define our Prediction Score (PS) function as: ˆ

ˆ

PS(t) = e−(max o¯t −o¯t [ot ])

(14)

The more this score function gets closer to 1, the more accurate the prediction has performed. Since we use an exponential function, high scores are only assigned when the probability of the observed value in a prediction is very close to be the maximum possible. 3) Correctly predicting a very common interaction should be scored lower than predicting well a rare event. This can be achieved by balancing the score function of each observed value ot by its general frequency (F ) in the whole dataset. Thus, the score for a given test sequence o = o1 o2 . . . oT will be: Score(o) =

T 

PS(t)

t=1

1 F (ot )

(15)

Now we have Eq. 15 to measure the quality of predictions for a given test sequence, we can validate our HSMM by testing it with different test sequences (cross validation is recommended) and analyzing the resulting scores. One critical parameter for the prediction generation is the length of the prediction window, P L. If our model is able to perform valid predictions for a substantial value of P L, it will be considered as accepted. A score is considered valid if Eq. 15 scores above a validation threshold (VT), defined as VT = e−σ({F (ov )}) ,

ov ∈ O

(16)

ID

Description

Select UAV

0

Allows a user to monitor and control a specific UAV

Set UAV Speed

1

Change the speed of a selected UAV.

Set Simulation Speed

2

Users can “accelerate” time during a simulation.

Change UAV Path

3

Add/change/remove waypoints of any UAV

Select Waypoint Table

4

Waypoints can be edited in a waypoints table.

Set Control Mode

5

Control modes in DWR manage how a user can change a UAV path (Monitor mode, Add waypoints, Manual mode)

where σ({F (ov )} refers to the standard deviation of the vector containing the frequency of each observation symbol in our dataset. This means that we expect better scores (on average) for models with a balanced dictionary. IV. E XPERIMENTATION In this section, the method proposed above is applied to model the interactions extracted from a multi-UAV simulation environment. Below are described the simulator, experimental setup and results. A. Drone Watch & Rescue The simulation environment used as testbed for this work has been designed following several criteria of accessibility and usability. It is known as Drone Watch And Rescue (DWR) 1 , and its complete description can be found in [10]. DWR gamifies the concept of a multi-UAV mission (See Figure 3), challenging the operator to capture all mission targets consuming the minimum amount of resources, while avoiding at the same time the possible incidents that may occur during a mission (e.g., Danger Areas, Sensor breakdowns). To avoid these incidents, an operator in DWR can perform multiple interactions to modify both the UAVs in the mission and the waypoints composing their mission plan. Table I describes the set of interactions that can be performed in DWR, along with its identifier, used for the analysis carried out in this paper. B. Fitting a HSMM to model behavior in DWR 1) Dataset: In this experiment, the simulation environment (DWR) was tested using Computer Engineering students of the Autonomous University of Madrid (AUM), all of them inexperienced in this type of systems. All users received a brief tutorial before using the simulator, explaining the mission objectives and the basic controls. After that, they were told to plan and monitor a test mission with 3 UAVs, designed for this experiment (See Figure 3). The dataset resulted from this experiment comprises 87 distinct simulations, executed by 40 users. Those simulations 1 Available

at http://savier.ii.uam.es:8888/#/training/login

2016 IEEE Congress on Evolutionary Computation (CEC)

3151

RESULTED FROM MAPPING THE ORIGINAL INTERACTIONS DICTIONARY (S EE TABLE I) TO A OPERAND / OPERATION DICTIONARY, SUITABLE FOR SYMBOL REPLICATION . O BSERVATION FREQUENCIES ARE LISTED TOO .

Obs. ID (Obs. Freq) Monitoring

Operands

All UAVs UAV 3 UAV 2 UAV 1

9 (0.01) 6 (0.1) 3 (0.19) 0 (0.22)

Operations Waypoint Handling 7 (0.03) 4 (0.08) 1 (0.08)

Replanning 8 (0.07) 5 (0.1) 2 (0.1)

aborted before 20 seconds of duration or those which presented less than 10 interactions are removed from the dataset. From the 87 simulations composing our initial dataset, only 53 of them were considered as useful for this experiment (K = 53). 2) Experimental Setup: To run the method proposed in this work, some configuration parameters need to be adjusted, as shown in Figure 1. First of all, a Time Step Resolution (TSR) value must be fixed to sample the input simulation logs into equidistant time series. Recall that increasing this value decrements timing accuracy for each event in the sequence, but decreasing it also reduces the efficiency when learning the models. A good balance between accuracy and efficiency for this dataset is found by making TSR = 1000 ms, which is half the median of the time differences in all the interactions dataset. To allow a symbol replication on the input data and fill the empty time steps, we carried out a mapping between the original interactions dictionary, composed of the set of different identifiers showed in Table I, to an extended one, where each observation symbol represents a general operation (“Monitoring”, “Waypoint Handling” or “Replanning”) being applied over a specific operand in the simulator, i.e, over one of the three UAVs taking part in the test mission. Not only the interaction identifier is used to perform the dictionary mapping, but also some interaction parameters, such as the UAV affected by the interaction. Although the complete mapping function is not detailed in this work for space reasons, Table II shows the result of this mapping graphically, displaying each mapped observation symbol inside a matrix where each column represents an “operation” in the simulator and each row an “operand” or object. The resulting dictionary,O, features a total of 10 observation symbols, thus, following the notation used in this work, we have that V = 10. Regarding the model selection, we will look for models with a number of states between 2 (Nmin ) and 10 (Nmax ), since we want our model to be simple and interpretable. The Maximum State Duration values will go from 30 time steps (30 seconds providing that T SR = 1000) to 120, by 30-second steps, so M = {30, 60, 90, 120}. As state duration types, we consider Gaussian, Log-norm and Poisson distributions, which are commonly used when modelling time between events. The BIC Tolerance Threshold (BTT) to perform the 2-step model

3152

selection is set to 500. Finally, to perform the model validation, a Leave-One-Out Cross-Validation (LOOCV) is carried out, so that the selected model is trained and tested against every single observation sequence in the dataset. In order to explore which is the maximum Prediction Length under which the model performance is accepted, we will compare the Prediction Score for different PL values, from 2 to the average length of an observation sequence in our dataset, which is 183, so PL = [2, 183]. Taking into account Eq. 16 and the observation frequencies shown in Table II, the validation threshold (VT) to accept the model is set to V T = 0.76. 3) Experimental Results: Table III shows the results for the model selection process. As it can be seen, increasing the value of M barely alters the BIC value of the models, which indicates that state durations are generally short. In fact, regarding duration types (τ ), best values are found for the Log-Norm distribution, which is characterized for fitting well short-time events. Following the two-step model selection of the proposed methodology, the selected best HSMM to fit this dataset has the values N = 9, M = 90, τ =LogNorm. More details about the selected model (transition matrix, state durations...) are omitted here for space reasons. Finally, results for the model validation are shown in Figure 4. As it may be expected, the prediction performance of the model decreases as we generate predictions with increasing length. That performance drop is steeper when Prediction Score values are higher, due to the score metric is exponential. According to our threshold criteria defined in Section III-D, the maximum prediction length we can utilize while ensuring a good performance is P L = 15 time steps, which are equal to 15 seconds (T SR = 1000 ms). This can be seen graphically as the crossing point between the two lines in Figure 4. Due to DWR is a fast-interaction environment, and that the mean duration of the simulations stored in the dataset is around 183 seconds, generating 15−second predictions with good performance says that the model can predict almost a

1.00

Prediction Score (PS)

Table II O PERATIONAL D ICTIONARY (O) IN DWR,

0.75

0.50

0.25

0.00 0

50

100

150

Prediction Length (PL) Figure 4. Model validation comparing the predictions length with the prediction score. The dashed line represents the Validation Threshold (VT).

2016 IEEE Congress on Evolutionary Computation (CEC)

Table III BIC C OMPARISON AMONG DIFFERENT VALUES OF N (N UMBER OF STATES ), M (M AXIMUM S TATE D URATION ), AND τ (S TATE D URATION T YPE ). T HE LESS BIC THE BETTER MODEL . C ELLS IN ITALICS PASS THE FIRST STEP OF THE MODEL SELECTION PROCESS . T HE BOLDED CELL REPRESENTS THE SELECTED MODEL .

N=2 N=3 N=4 N=5 N=6 N=7 N=8 N=9 N=10

M=30 3.21 × 104 2.40 × 104 2.24 × 104 1.93 × 104 1.75 × 104 1.55 × 104 1.42 × 104 1.37 × 104 1.36 × 104

τ =Gaussian M=60 M=90 2.84 × 104 2.99 × 104 2.49 × 104 2.29 × 104 2.18 × 104 2.19 × 104 4 1.91 × 10 1.95 × 104 1.74 × 104 1.80 × 104 1.59 × 104 1.65 × 104 1.45 × 104 1.52 × 104 4 1.41 × 10 1.46 × 104 4 1.40 × 10 1.45 × 104

M=120 2.81 × 104 2.47 × 104 2.21 × 104 1.96 × 104 1.80 × 104 1.66 × 104 1.53 × 104 1.49 × 104 1.47 × 104

M=30 2.87 × 104 2.61 × 104 2.21 × 104 1.88 × 104 1.66 × 104 1.47 × 104 1.33 × 104 1.25 × 104 1.23 × 104

τ =Log-norm M=60 M=90 3.06 × 104 2.99 × 104 2.26 × 104 2.42 × 104 2.13 × 104 2.11 × 104 4 1.77 × 10 1.76 × 104 1.60 × 104 1.55 × 104 1.40 × 104 1.34 × 104 1.27 × 104 1.23 × 104 4 1.20 × 10 1 .17 × 10 4 4 1 .13 × 10 1 .16 × 10 4

10% of the duration correctly, which is, by far, sufficient to develop online adaptive experiences in the simulation, and also to implement systems to detect abnormal behavior. V. C ONCLUSIONS AND F UTURE W ORK This paper has presented a complete method to build valid HSMMs that generate behavioral predictions in interactive environments. We have tested this methodology using a UAV simulator, obtaining a model able to achieve high prediction scores for a reasonable prediction length. As future work, we intend to develop a deeper study about how the parameter tuning affects to the results of this method, specially the size, type and frequency of the interactions dictionary used. Also, we will improve the prediction generation method used to test the models, so that each prediction takes into account more than one state to be generated. Finally, a comparison between classical HMMs, HSMMs and other predictive models will be carried out for different environments in order to discover the best modeling technique depending on the features of environment used. ACKNOWLEDGMENTS This work is supported by the Spanish Ministry of Economy and Competitivity and European Regional Development Fund FEDER (TIN2014-56494-C4-4-P), Comunidad Autonoma de Madrid under project CIBERDINE S2013/ICE-3095, and Savier an Airbus Defense & Space project (FUAM-076914 and FUAM-076915). The authors would like to acknowledge the support obtained from Airbus Defence & Space, specially from Savier Open Innovation project members: José Insenser, Gemma Blasco and Juan Antonio Henríquez. R EFERENCES [1] J.-P. Barthélemy and E. Mullet, “Choice basis: A model for multi-attribute preference,” British Journal of Mathematical and Statistical Psychology, vol. 39, no. 1, pp. 106–124, 1986. [2] W. D. Gray, B. E. John, and M. E. Atwood, “The precis of project ernestine or an overview of a validation of goms,” in Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 1992, pp. 307–312. [3] S. ¸ Gündüz and M. T. Özsu, “A web page prediction model based on click-stream tree representation of user behavior,” in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003, pp. 535–540. [4] L. Hawarah, S. Ploix, and M. Jacomino, “User behavior prediction in energy consumption in housing using bayesian networks,” in Artificial Intelligence and Soft Computing. Springer, 2010, pp. 372–379.

M=120 2.90 × 104 2.21 × 104 2.05 × 104 1.73 × 104 1.56 × 104 1.39 × 104 1.25 × 104 1 .18 × 10 4 1 .15 × 10 4

M=30 3.68 × 104 2.98 × 104 2.50 × 104 2.16 × 104 2.04 × 104 1.89 × 104 1.72 × 104 1.59 × 104 1.50 × 104

τ =Poisson M=60 M=90 3.88 × 104 3.49 × 104 2.70 × 104 2.98 × 104 2.51 × 104 2.55 × 104 4 2.23 × 10 2.24 × 104 2.05 × 104 2.05 × 104 1.83 × 104 1.86 × 104 1.74 × 104 1.70 × 104 4 1.57 × 10 1.55 × 104 4 1.47 × 10 1.48 × 104

M=120 3.67 × 104 2.71 × 104 2.51 × 104 2.24 × 104 2.07 × 104 1.86 × 104 1.73 × 104 1.57 × 104 1.50 × 104

[5] M. Awad, I. Khalil et al., “Prediction of user’s web-browsing behavior: Application of markov model,” Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 42, no. 4, pp. 1131–1142, 2012. [6] J. O’Connell, F. A. Tøgersen, N. C. Friggens, P. Løvendahl, and S. Højsgaard, “Combining cattle activity and progesterone measurements using hidden semi-markov models,” Journal of Agricultural, Biological, and Environmental Statistics, vol. 16, no. 1, pp. 1–16, 2011. [7] F. Salfner and M. Malek, “Using hidden semi-markov models for effective online failure prediction,” in Reliable Distributed Systems, 2007. SRDS 2007. 26th IEEE International Symposium on. IEEE, 2007, pp. 161–174. [8] Y. Boussemart and M. L. Cummings, “Predictive models of human supervisory control behavioral patterns using hidden semi-markov models,” Engineering Applications of Artificial Intelligence, vol. 24, no. 7, pp. 1252–1262, 2011. [9] F. Cartella, J. Lemeire, L. Dimiccoli, and H. Sahli, “Hidden semi-markov models for predictive maintenance,” Mathematical Problems in Engineering, 2015. [10] V. Rodriguez-Fernandez, H. D. Menendez, and D. Camacho, “Design and development of a lightweight multi-uav simulator,” in Cybernetics (CYBCONF), 2015 IEEE 2nd International Conference on. IEEE, 2015, pp. 255–260. [11] V. Rodríguez-Fernández, A. Gonzalez-Pardo, and D. Camacho, “Modeling the behavior of unskilled users in a multi-uav simulation environment,” in Intelligent Data Engineering and Automated Learning–IDEAL 2015. Springer, 2015, pp. 441–448. [12] I. Visser, “Seven things to remember about hidden markov models: A tutorial on markovian models for time series,” Journal of Mathematical Psychology, vol. 55, no. 6, pp. 403–415, 2011. [13] J. D. Ferguson, “Variable duration models for speech,” in Proceedings of the Symposium on the Application of HMMs to Text and Speech, 1980, pp. 143–179. [14] T. V. Duong, H. H. Bui, D. Q. Phung, and S. Venkatesh, “Activity recognition and abnormality detection with the switching hidden semi-markov model,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. 838–845. [15] C. D. Mitchell and L. H. Jamieson, “Modeling duration in a hidden markov model with the exponential family,” in Acoustics, Speech, and Signal Processing, 1993. ICASSP-93., 1993 IEEE International Conference on, vol. 2. IEEE, 1993, pp. 331–334. [16] L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989. [17] Y. Guédon, “Estimating hidden semi-markov chains from discrete sequences,” Journal of Computational and Graphical Statistics, vol. 12, no. 3, pp. 604–639, 2003. [18] D. Burshtein, “Robust parametric modeling of durations in hidden markov models,” Speech and Audio Processing, IEEE Transactions on, vol. 4, no. 3, pp. 240–242, 1996. [19] K. P. Burnham and D. R. Anderson, “Multimodel inference understanding aic and bic in model selection,” Sociological methods & research, vol. 33, no. 2, pp. 261–304, 2004.

2016 IEEE Congress on Evolutionary Computation (CEC)

3153