Results on the PASCAL PROMO challenge

JMLR: Workshop and Conference Proceedings 1: 1–6

NIPS 2008 workshop on causality

Results on the PASCAL PROMO challenge Ivan Markovsky

IM @ ECS . SOTON . AC . UK

School of Electronics and Computer Science University of Southampton Southampton, SO17 1BJ, UK

Editor: Isabelle Guyon, Dominik Janzing and Bernhard Schölkopf

Abstract A solution to the PASCAL PROMO challenge “Simple causal effects in time series” is presented. The data is modeled as a sum of a constant-plus-sin term (autonomous part) and a term that is a linear function of a small number of inputs. The problem of identifying such a model from the data is nonconvex in the frequency and phase parameters of the sin and is combinatorial in the number of inputs. The proposed method is suboptimal and exploits several heuristics. First, the problem is split into two phases: 1) identification of the autonomous part and 2) identification of the input dependent part. Second, local optimization method is used to solve the problem in the first phase. Third, ℓ1 regularization is used in order to find a sparse solution in the second phase. Keywords: System identification, sparse approximation, ℓ1 regularization.

1. The proposed model and the corresponding identification problem The given data in the PASCAL challenge, workbench team (2008), is in the form of two simulated vector time series ud (1), . . . , ud (T ) ∈ Rm and yd (1), . . . , yd (T ) ∈ R p , where ud is referred to as an input (cause) and yd is referred to as an output (effect). The subscript “d” denotes “data” and is used in this paper to distinguish the given time series (ud , yd ) from a general trajectory (u, y). According to the problem specification, each output y j has a corresponding baseline component, denoted by ybl, j , that is periodic and slowly changing and a second component that is determined by a small number (between 1 and 50) of inputs. We model the baseline as an offset-plus-sin ybl, j (t; b j , c j , ω j , φ j ) = b j + c j sin(ω j t + φ j ),

(1)

where b j , c j , ω j , and φ j are parameters of the model, in order to capture the mean value and (slow) periodicity in the data. We model the component that is determined by the inputs as a linear function y j (t) = Au(t). The matrix A corresponds to what is called influence matrix in the problem description. With this choice, the overall model is an affine function of the inputs y j (t) = ybl, j (t; b j , c j , ω j , φ j ) + Au(t).

(2)

The prior knowledge that only a few inputs affect each output translates to the constraint that the parameter A ∈ R p×m is a sparse matrix, i.e., a matrix with many zero elements. In system theoretic terms, the input-output relation (2) is a static model that decouples into p independent multi inputs single output submodels. The overall model, however, is not static because c

2007 Ivan Markovsky.

M ARKOVSKY

the baseline is a response of an autonomous dynamical system (of order 3). It is linear in the parameters b j , c j , A, and nonlinear in ω j and φ j . Define Y := y(1) · · · y(T ) Ybl := ybl (1) · · · ybl (T ) , and U := u(1) · · · u(T ) . (“:=” stands for “left hand side is defined to be equal to the right hand side.”) With this notation, the model (2) is written more compactly as the following matrix equation Y = Ybl (b, c, ω , φ ) + AU,

(3)

where b := (b1 , . . . , b p ) is the vector of offsets, c := (c1 , . . . , c p ) of amplitudes, ω := (ω1 , . . . , ω p ) of frequencies, and φ := (φ1 , . . . , φ p ) of phases. We define the matrices Yd and Ud , constructed from the given data similarly to Y and U . An identification problem corresponding to (3) is minimize

over b, c, ω ∈ R p , φ ∈ [−π , π ] p , and A ∈ R p×m

subject to

each row of A has at most 50 nonzero elements.

kYd −Ybl (b, c, ω , φ ) − AUd k

(4)

Problem (4) is not completely specified because the norm used to measure the approximation error is not specified. Independent of the choice of norm, however, problem (4) is a nonconvex optimization problem because the parameters ω and φ enter nonlinearly in the model (3), and the sparsity constraint is combinatorial. In addition, the data (ud , yd ) in the PASCAL challenge makes problem (4) a medium size optimization problem, so that computing a global solution is not feasible. In what follows, we describe heuristics for finding a suboptimal solution of problem (4). • First, we solve (4) over b, c, ω , and φ , assuming that A = 0, i.e., we fit the data by a constantplus-sin model without considering the effect of the inputs. • Second, we apply the ℓ1 heuristic for fining sparse solution A to problem (4), with the parameters ω and φ of the Ybl term fixed to the value computed in the first step.

2. Identification of the autonomous term In this section, we solve the problem kYd −Ybl (b, c, ω , φ )kF . (5) q n 2 Here we have chosen the fitting criterion to be the Frobenius norm kEkF := ∑m i=1 ∑ j=1 ei j of the fitting error E := Y −Ybl . Note that problem (5) decouples into p independent problems minimize

minimize

over b, c, ω ∈ R p and φ ∈ [−π , π ] p

over b j , c j , ω j ∈ R and φ j ∈ [−π , π ]

kyd, j − ybl, j (b j , c j , ω j , φ j )k2 ,

(6)

where yd, j is the jth row of Yd and ybl, j is the jth row of Ybl . A generalization of this problem to sum of n-sinusoids is well studied in signal processing, where, it is known as harmonic retrieval and line spectral estimation. For an outdated but nicely written overview of the relevant literature, we refer the reader to Kay and Marple (1981). For our purposes, it suffices to say that the methods split into suboptimal heuristics (subspace-type methods) and methods based on local optimization (maximum likelihood-type methods). For the problem at hand, we use a method from the local optimization family because this approach allow us to fix the frequency to a predefined value chosen from empirical observation of 2

R ESULTS

ON THE

PASCAL PROMO

CHALLENGE

the data. For j = 1, . . . , p, the frequency ω j seems to be either 12π /T (one year period) or 6π /T (half year period). We solve two problems with ω j = 12π /T and 6π /T and choose the fit that leads to a smaller value of the cost function. This eliminates the variable ω j from (6). The method that we use for solving (6) is called the variable projections Golub and Pereyra (2003). For a given φ j , (6) is a linear least squares problem in b j and c j , so that its minimum can be found analytically. The minimum value is a function of φ j and minimization of this function over φ j is an equivalent problem to (6). The advantage of solving the equivalent problem is that it has one optimization variable, while the original problem has three optimization variable. This makes the approach based on the variable projections method more robust than a direct optimization approach. Two examples of fitting outputs yd, j by an offset-plus-sin function are shown in Figure 1. j=3

j=4

2400

1800

2200

1700 1600

yd, j and y∗bl, j

yd, j and y∗bl, j

2000 1800 1600 1400 1200 1000

1400 1300 1200 1100 1000

800 600

1500

900 200

400

600

800

800

1000

200

400

t

600

800

1000

t

Figure 1: Examples of offset-plus-sin fits obtained by solving problem (6). Solid line — yd, j (left j = 3, right j = 4), dashed line — (locally) optimal fit y∗bl, j .

3. Identification of the term involving the inputs Let φ ∗ , ω ∗ be a (locally) optimal solution of (5). The problem we solve in this section is minimize

over b, c ∈ R p and A ∈ R p×m

kYd −Ybl (b, c, φ ∗ , ω ∗ ) − AUd kF

subject to each row of A has at most 50 nonzero elements.

(7)

Note that the difference between the original problem (4) and problem (7) is that the φ and ω parameters are fixed to the values φ ∗ and ω ∗ computed in advance (from problem (5)). Obviously, this step of fixing optimization variables leads to suboptimality, however, it simplifies the problem and is an intuitively justifiable heuristic. The main difficulty in solving problem (7) is the sparsity constraint. The approach we adopt is based on the ℓ1 heuristic for producing sparse solutions, also known in the statistical literature as the lasso method. Note that, similarly to the situation in Section 2, problem (7) is row-wise separable: kyd, j − ybl, j (b j , c j , φ ∗j , ω ∗j ) − a⊤j Ud k2

minimize

over b j , c j ∈ R p and a j ∈ Rm

subject to

a j has at most 50 nonzero elements. 3

(8)

M ARKOVSKY

We propose the following heuristic for finding a suboptimal solution of (8) minimize

over b j , c j ∈ R p and a j ∈ Rm

subject to

ka j k1 ≤ γ j ,

kyd, j − ybl, j (b j , c j , φ ∗j , ω ∗j ) − a⊤j Ud k2

(9)

where γ j > 0 is a regularization parameter that controls the sparsity vs accuracy trade-off. The advantage of using (9) over the alternative approach of regularization by adding a term λ ka j k1 in the cost function is that γ j > 0 has a direct interpretation as a bound on the solution size. This allows us to propose a simple and computationally inexpensive way of choosing γ j automatically. Suppose that we are aiming to compute a sparse solution with at most k nonzero elements. We choose k rows of Ud (e.g., the first k) and let Ud′ be the matrix of the chosen rows stacked under each other. Then we take γ j := k(yd, j − ybl, j )Ud′+ k1 where Ud′+ is the pseudo inverse of Ud′ . Obviously this choice is heuristic, however, it has the justification that (yd, j −ybl, j )Ud′+ is a particular solution with k nonzero elements, so that the optimal solution should have ℓ1 norm less than k(yd, j − ybl, j )Ud′+ k1 . We use CVX, a package for specifying and solving convex programs Grant and Boyd (2008a,b), in order to translate problem (9) to standard convex optimization problem and solve it by existing optimization solvers. The computational engines, used by CVX are the SDPT3 and SeDuMi. Figure 2 shows examples of fits obtained with models (2), identified by solving problems (6) and (9). The model corresponding to the left plot involves 24 inputs and the model corresponding to the right plot involves 25 inputs. Comparing the results in Figure 1 with those in Figure 2, it is visible how some features of the data that are not captured by the autonomous part of the model are fitted by the part involving the inputs. j=3

j=4

2400

1800

2200

1700 1600 1500

1800

yd, j and y∗j

yd, j and y∗j

2000

1600 1400 1200

1300 1200 1100

1000

1000

800 600

1400

900 200

400

600

800

800

1000

t

200

400

600

800

1000

t

Figure 2: Examples of fits obtained by the overall model (2), identified solving problems (6) and (9). Solid line — yd, j (left j = 3, right j = 4), dashed line — suboptimal fit y∗j .

4. Nonuniqueness of the solution For a given baseline Ybl , model (3) is an overdetermined linear system of equations Y − Ybl = AU in A. Therefore, in order to be able to determine A uniquely from U and Y − Ybl , we need U to be full row rank. This condition is not satisfied for the given inputs Ud (rank(Ud ) = 862). Taking into account the sparsity constraint, however, we may still be able to recover a unique solution. This 4

R ESULTS

ON THE

PASCAL PROMO

CHALLENGE

would be the case if any nontrivial element in the left null space of A is a dense vector. We consider two special cases that lead to rank deficiency of U : • Zero inputs can not affect the output and therefore removing them from the model leads to an equivalent reduced model Y − Ybl = AredUred . Going back from the reduced model to the original model can be done by inserting arbitrary columns in Ared at the places of the removed inputs. In order to get as sparse solution as possible, however, we have to insert zero columns in Ared . The interpretation of this augmentation is to declare that according to the model certain inputs (the zero inputs ud, j ) have no effect on the outputs. • Inputs that are multiples of other inputs lead to essential nonuniqueness of the solution, which can not be recovered by the sparsity constraint. Based on the above consideration, we first remove from the data, in a preprocessing step, zero inputs and multiple copies of the same input, then identify (solving problems (6) and (9)) a model with the reduced set of inputs, and finally, in a postprocessing step recover from the reduced model an equivalent model with the original number of inputs. Apart from reducing the size of the problem the pre and postprocessing steps have the advantage of revealing nonuniqueness of the solution. For the data (ud , yd ), it turns out that all inputs that are copies of other inputs are constants 1, i.e., all time active inputs. Such inputs have the net effect of an offset on the output. However, offset is already included in the autonomous part of the model, see (1), so that we remove all inputs that are constant 1. Thus, the convention of including the offset in the autonomous part of the model has the advantage of resolving the nonuniqueness of the solution problem.

5. Results The proposed identification method is summarized in Algorithm 1. The total execution time for identifying model (3) is about 3 hours on a PC with Intel Core Duo 2.13GHz CPU and 2GB RAM. The authors of the challenge provide an influence matrix W that shows which inputs affect which outputs. In the evaluation of the identified model we assume that the nonzero elements of W indicate the “true” inputs affecting each output. The total number of true inputs affecting the outputs is 2321. The total number of inputs affecting the outputs, according to our method, is 1796, and the total number of correctly identified inputs is 507. Thus, 28% of the inputs identified by our method are correct. The percentage of correctly identified inputs for each output is shown in the following plot. More detailed results are given in Markovsky (2008). % correctly identified inputs

100 90 80 70 60 50 40 30 20 10 0

20

40

60

output 5

80

100

M ARKOVSKY

References G. Golub and V. Pereyra. Separable nonlinear least squares: the variable projection method and its applications. Institute of Physics, Inverse Problems, 19:1–26, 2003. M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming. http://stanford. edu/~boyd/cvx, 2008a. M. Grant and S. Boyd. Graph implementations for nonsmooth convex programs. In V. Blondel, S. Boyd, and H. Kimura, editors, Recent Advances in Learning and Control, pages 95–110. Springer, 2008b. S. Kay and S. Marple. Spectrum analysis–a modern perspective. Proc. of the IEEE, 69(11):1380–1419, 1981. I. Markovsky. Results on the pascal challenge “simple causal effects in time series”. Report 16779, ECS, Univ. of Southampton, http://eprints.ecs.soton.ac.uk/16779/, 2008. Causality workbench team. PROMO: Simple causal effects in time series, 08 2008. URL http://www. zurich.ibm.com/~jep/causality/promo.html.

Algorithm 1 Algorithm for identification of model (3). Input: Ud ∈ Rm×T and Yd ∈ R p×T . 1: Preprocessing: detect and remove redundant inputs, i.e., zero rows of Ud and rows that are multiple copies of other rows of Ud . The subsequent steps are applied on the reduced matrix Ud . 2: for j = 1 to p do 3: Let f ′ be a (local) minimum of (6) with ω j = 6π /T and let φ ′j be a corresponding optimal solution. 4: Let f ′′ be a (local) minimum of (6) with ω j = 12π /T and let φ ′′j be a corresponding optimal solution. 5: if f ′ < f ′′ then Let ω ∗j := 6π /T and φ ∗j := φ ′j . 6: else Let ω ∗j := 12π /T and φ ∗j := φ ′′j . 7: end if 8: Let γ j := k(yd, j − ybl, j )Ud (1:10, :)+ k1 (here we use MATLAB notation for indexing). 9: Let a′j be a solution to (9) with parameters φ j = φ ∗j and ω j = ω ∗j . 10: Determine the sparsity pattern of a′j by truncating elements smaller by the convergence tolerance to zero. 11: Let (b∗j , c∗j , a∗j ) be a solution to problem (8) with the sparsity pattern determined on the previous step and with parameters φ j = φ ∗j and ω j = ω ∗j . (This is a linear least squares problem) 12: end for 13: Postprocessing: add zero rows in A∗ that correspond to the removed inputs in the preprocessing step. Output: Ybl (b∗ , c∗ , ω ∗ , φ ∗ ) and A∗ .

Software reproducing the results is available from: http://www.ecs.soton.ac.uk/~im/challenge.tar

6