ROBUST KERNEL-BASED REGRESSION USING ... - Semantic Scholar

2013 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 22–25, 2013, SOUTHAPMTON, UK

ROBUST KERNEL-BASED REGRESSION USING ORTHOGONAL MATCHING PURSUIT George Papageorgiou, Pantelis Bouboulis, Sergios Theodoridis∗ Department of Informatics and Telecommunications University of Athens Athens, Greece, 157 84 Emails: geopapag, [email protected], [email protected] ABSTRACT Kernel methods are widely used for approximation of nonlinear functions in classic regression problems, using standard techniques, e.g., Least Squares, for denoising data samples in the presence of white Gaussian noise. However, the approximation deviates greatly, when impulse noise outlying the data enters the scene. We present a robust kernel-based method, which exploits greedy selection techniques, particularly Orthogonal Matching Pursuit (OMP), in order to recover the sparse support of the outlying vector; at the same time, it approximates the non-linear function via the mapping to a Reproducing Kernel Hilbert Space (RKHS). Index Terms— Robust Least Squares, Greedy Algorithms, Outliers, Orthogonal Matching Pursuit (OMP), KernelBased Regression, Reproducing Kernel Hilbert Space (RKHS) 1. INTRODUCTION A task of major interest in Machine Learning has always been that of parameter estimation. Regression analysis is the statistical technique for establishing the relation among a set of input-output variables. The performance of regression analysis methods, in practice, depends on the data generating process, where the existence of noise plays a key role. Usually, the assumption of additive white (Gaussian) noise is the way to model, attack and finally solve many practical problems, including classic regression ones. The main drawback in this modelling is robustness. A number of issues are posed, concerning the performance of the method, in the presence of non-Gaussian noise, e.g., with extreme noise values. In the last few years, the outliers’ modelling has gained in importance in the context of what is known as Big Data applications. Within our study, we will follow a path that explicitly models outliers and makes use of appropriate regularization techniques. ∗ This work was carried out under the 621 ARISTEIA program, cofinanced by the Greek Secretariat for Research and Development and the EU.

c 978-1-4673-1026-0/12/$31.00 2012 IEEE

In a set of numerical data, any value that is markedly smaller or larger than other surrounding values (locally extreme value) is called an outlier [1]. In general, outliers are quite difficult to be defined. There is as much controversy over what constitutes an outlier, as whether to remove them or not. Most often, is up to the analyst to decide. In an efficient mathematical approach, an outlier is “information” that is not an inlier, i.e., neither originated from the original data source nor from a common (usually expected) noise source. However, an important key characteristic that defines outliers is sparsity, i.e., extreme values are expected to be of insufficient amount. Lately, there has been an increased interest in the development of robust methods for denoising data samples containing outliers, since it is known that classic ones, e.g., Least Squares, fail. Our focus in this paper, will be upon the implementation of a robust method, which detects the outlier support as well as the “extreme” values themselves, and at the same time it obtains estimates of the original data using kernel functions. At the heart of our method lies a greedy selection algorithm. In particular, the performance of the Orthogonal Matching Pursuit (OMP) towards error reduction, provided the necessary spark in order to turn our focus towards this specific direction. Furthermore, it is known that OMP performs best when trying to recover very sparse vectors, which is usually the case when dealing with outlying observations. Although OMP lacks the stability and consistency towards recovering the sparsest vector in the general case of a redundant dictionary [2, 3, 4], it turns out the special structure of the matrix employed by the proposed algorithm ensures that, in most cases, the exact support of the outlier vector is recovered (see Section 4). 2. PROBLEM MODELLING AND PRIOR WORK Consider a finite set of training points (yk , xk ), k = 1, 2, ..., n, with yk ∈ R and xk ∈ Rm . The goal of a typical regression

task is to estimate the input-output relation via a model of the form yk = f (xk ) + ηk , k = 1, 2, ..., n, (1) where ηk is an unobservable noise sequence, usually assumed to be white (Gaussian) noise. In the case where f is a linear function, the problem is only limited in computing the corresponding coefficients that define a hyperplane, on which the solution lies. In the more general case, where f is a nonlinear function, we will assume that f belongs to a space of “smooth” functions H, which will be assumed to have a structure of a reproducing kernel Hilbert space (RKHS). The kernel function and the norm induced pby the inner product are κ : Rm × Rm → R and || · ||H = h·, ·iH , respectively. The Representer Theorem guarantees that, over the training set, the minimizer of the regularized minimization problem n X 2 min yk − f (xk ) + µ||f ||2H , µ ≥ 0 (2) f

k=1

formulation,

s.t.

n X

k=1

min ||u||0

u,f ∈H

(yk − f (xk ) − uk )2 + λ||f ||2H ≤ ε,

(5)

for fixed threshold parameters ε ≥ 0 and λ > 0. Although problem (4) (or (5)) is of an NP-hard combinatorial nature, greedy selection algorithms succeed in recovering the solution for certain data and model parameters. Another notable fact is the relation between (4) and the variational Least Trimmed Squares (VLTS) problem, for certain levels of sparsity of vector u, in [11]. 2.1. Convex relaxation It is evident that problem (4) is a nonconvex optimization task. To achieve stable solutions and robust properties, many authors prefer to consider an alternative convex task, which (in some sense) is close to the original problem. This is the popular convex relaxation technique. In our case, substituting the ℓ0 with the ℓ1 norm, problem (4) can be cast as:

P admits a representation fˆ(x) = ni=1 αi κ(x, xi ), where αi are the unknown (real) coefficients. The regularized term in (2), is used in order to guard our method against overfitting, a standard technique in Machine Learning tasks. For more details see [5], [6]. Problem (2) is actually a Least Squares task in the RKHS H. Although similar tasks can be successfully applied to remove Gaussian noise [7, 8, 9], it has been established that the presence of outliers causes their solution to overfit [10]. Hence, a data sequence containing outliers should not be modelled via (1). To this end, a sequence, uk , associated with the outliers is explicitly modelled and the input-output relation takes the form:

The precedent problem, presented in [11], was solved using an alternating direction algorithm (ADM). Despite the fact that in this case we deal with a convex problem, relaxation seems to lose some of its immediacy. As a result, our gains towards estimation or error reduction may not be ultimately achieved.

yk = f (xk ) + uk + ηk , k = 1, 2, ..., n.

3. KERNEL REGULARIZED OMP (KROMP)

(3)

As outliers are expected to often comprise a small fraction of the training sample, most of the values of uk are zeros. In general, a percentage of less than 20% of non zero values is expected, thus u := (u1 , u2 , ..., un )T is modelled as a sparse vector. Now that we have paved the way, it seems appropriate to reveal the gains while working under the sparsity approximation umbrella. Prior knowledge of sparsity over vector u, provides the tools to form the nonconvex minimization problem min u,

n X

f ∈H k=1

(yk − f (xk ) − uk )2 + µ||f ||2H + λ||u||0 ,

(4)

where µ > 0, is a user defined parameter controlling the trade-off between the two main goals of this task, i.e., minimizing the error, while keeping the complexity of the model, i.e., kf kH, low. Values of λ ≥ 0 are set in order to control the sparsity levels of vector u. This formulation was introduced in [11]. In this paper, we cast the task in the following

min u,

n X

f ∈H k=1

(yk − f (xk ) − uk )2 + µ||f ||2H + λ||u||1 .

The standard sparse approximation denoising problem has been studied in an extended list of papers, [3, 4, 12, 13, 14, 15, 16]. Our focus in this work is to provide a robust kernelbased denoising method that can efficiently remove not only the typical Gaussian noise, but also impulses and types of noise with heavy tailed distributions. To this end, we aim at efficiently solving problem (5). Let φ(·) : Rm → H, φ(x) = κ(·, x), denote the feature map of H, that transforms the data from the input to the feature space H, where H is the RKHS induced by the kernel κ. Under this framework, we map the data to a high dimensional space, H, which gives us the luxury of adopting linear tools to attack the specific problem. Furthermore, the reproducing property of the RKHS, i.e., κ(x, x′ ) = hφ(x), φ(x′ )iH , ensures that the actual structure of the space may be ignored, as the computation of any inner product can be given by the kernel function. Recall that for every set of points, xi , xj , i, j = 1, 2, ..., n, the Gram matrix Kij = κ(xi , xj ) is a positive (semi-)definite matrix. Although a variety of kernel functions

are available, the most standard, which is also used in our experiments, is the Gaussian radial basis function kernel with parameter σ, see [5]. In the following, we make the a priori assumption that the estimated function, f , can be expressed as a finite linear combination of kernel functions centered at the training data, i.e., f=

n X

αk κ(·, xk ) + c,

k=1

Hence, instead of solving problem (5), we target our efforts at estimating the solution of

s.t.

||Kα + c1 + u

min ||u||0

u,α,c − y||22

+ λ||α||22 + λc2 ≤ ε,

(6)

where α ∈ Rn , c ∈ R are the kernel expansion coefficients and the bias, 1 ∈ Rn is the vector of ones and y, u ∈ Rn are the measurement and outlier vectors, respectively. At this point, it is important to make the following remarks. The quadratic inequality constraint in (6) could also be written as J(z) = ||Az − y||22 + λz T Bz ≤ ε, where     α In 0 On A = K 1 In , z =  c  , B =  0T 1 0T  , u On 0 On

while In denotes the unitary matrix, 0 the zero vector and On the all zero square matrix. At each step, according to the OMP rationale, our algorithm selects the most correlated column and attempts to solve minz J(z). First of all, notice that the square symmetric matrix B is a projection matrix, i.e., B = B 2 , of vectors in R2n+1 to the lower dimension subspace Rn+1 . Substituting in J(z) and reformulating, our minimization problem becomes equivalent to A y 2 √ min z− (7) , 0 z λB 2 which could be viewed as a classic Least Squares problem. Next, we would like to see if J(z) in (7) attends a minimum and whether it is unique. Note, that for any data set (yk , xk ), k = 1, 2, ..., n, and for all ε ≥ 0, we can find z such that J(z) ≤ ε. This means that the feasible set of (6) is always nonempty1. Finally, recall that (7) acquires a unique solution, if and only if the nullspaces of A and B intersect only trivially, i.e., N(A) ∩ N (B) = {0} ([17], [18]). For A simplicity, let D = √ . It is straightforward to prove λB that the set of normal equations obtained from (7) is (AT A + λB)z = AT y, where (AT A + λB) is invertible, as the following proposition establishes: 1 For

example, if we select z = (0, 0, y)T , then J(z) = 0.

Proposition 1 Matrix AT A + λB is (strictly) positive definite, hence invertible. To prove this, decompose a nonzero vector x into three parts (according to dimensions of A) and xT M x is a strictly positive quantity for arbitrary x 6= 0. Moreover, we make use of the following well known theorem. Theorem 1 Matrix AT A + λB is (strictly) positive definite if and only if the columns of matrix D are linearly independent, i.e., rank(D) = 2n + 1. Consequently, the minimizer z ∗ ∈ R2n+1 of (7) is unique. See, also [19]. Another formulation of (7) for δ > 0 is min ||Az − y||22 , s.t. ||Bz||2 ≤ δ.

(8)

z

Equivalence between (8) and (7) is well established in [20]. Now suppose zλ∗ is the unique minimizer of (7) for certain λ∗ > 0. Then zλ∗ solves problem (8) with δ = ||Bzλ∗ ||2 . Problem (8) reveals the physical properties of the problem and opens the way to interpret the performance of the greedy algorithm. 3.1. Algorithm implementation The basic concept of the algorithm, lies in the restriction of the column selection set over the last part of matrix A, i.e., matrix In , whose columns form an orthonormal basis in Rn . This is due to our modelling; therefore, outliers are expected to appear in the third part of vector z, i.e., vector u, which is known to be sparse. This comprises the major task of our new algorithm, KROMP, which is to perform a selection over the active set of columns. An overview of the algorithm and its converging properties (error reduction per step) are described below. Prior to the implementation, computation of the kernel matrix K is required, given the input vectors xk , k = 1, 2, ..., n and using the Gaussian radial basis kernel function 2 with kernel parameter σ, i.e., κσ (x, y) = exp −||x−y|| . 2σ2 At this point, we should emphasize that a careful tuning of the kernel parameter should be made, since correct selection determines whether the algorithm identifies the actual outlier support or not. This is also the case for λ and ǫ values. The method used in the present work is cross validation, for λ as well as for σ values. The same level of sensitivity also holds for the convex relaxation problem, see [11]. At each step, we define two separate sets, one for the active and another for the inactive columns of matrix A, denoted Sac and Sinac , respectively. Initially, we fix Sinac to include indices from the first n + 1 columns of matrix A and define (0) a) Ainac as the matrix that contains the columns of A, whose (0) indices belong to Sinac and b) Binac as the matrix containing the rows of B, whose indices belong to Sinac . During the

algorithmic process, matrix A is augmented by an optimal selected column and B is augmented by zeros, in order to match the column dimension of matrix A. Let z (0) denote the solution of the regularized problem (8), restricted over the initial inactive columns of matrix A (0) and let r (0) = Ainac z (0) − y denote the initial residual. Since this is a noise removal method, it is expected that y ∈ / (0) 2 R(Ainac ) , irrespective of whether the added noise sequence is Gaussian, impulse (or both), or heavy tailed. Suppose that the ℓ2 norm of the residual r (0) is below our threshold parameter ǫ. This assumption underlies the fact that no impulse outlying the noise exists and that the problem is significantly simplified to solving a regularized Least Squares problem. In this case, the algorithm stops. However, if outliers are present, the algorithm will proceed in order to approximate the sparse outlier vector support.

:=

arg minj∈Sac ||r (k−1) − hej , r (k−1) iej ||22

=

arg maxj∈Sac |rj

(k−1)

|,

(9)

(k−1)

At k + 1 step, the process is repeated and another column (k) ejk+1 is added to matrix Ainac . At this stage we have, (k)

Ainac = [K 1 ej1 · · · ejk ejk+1 ] = [Ainac ejk+1 ]. (k+1)

Now, let z∗ ∈ Rn+k+2 be the unique minimizer of (k+1) Lk+1 (z) = ||Ainac z − y||22 subject to the constraint (k) kBinac zk2 ≤ ǫ (i.e., the minimization function of (8) at the current step). It can be shown that the residual obtained at each iteration cycle is strictly decreasing. To this end, con(k) (k) sider the vector z∗ , which denotes z∗ augmented by the opposite value of the jk+1 -th coordinate of the residual vector (k) (k) (k) (k) r (k) , i.e., z∗ = (z∗ , −rjk+1 )T . Observe that z∗ belongs to the feasible set defined by the inequality constraint of (8) (k+1) (k) at the current step3 . Hence, Lk+1 (z∗ ) ≤ Lk+1 (z∗ ). 2 Denotes

the range of a matrix. the feasible set remains the same, while matrix B is augmented by zero elements at each step. 3 Geometrically

(k+1) (k)

(k)

where the last strict inequality is due to the fact that |rjk+1 | > 0, as jk+1 is selected according to (9)4 . Thus, we conclude that (k+1)

where rj is the j-th coordinate of the residual vector r (k−1) and ej is the unit norm vector from the standard orthonormal basis of Rn . Then, Sinac is enlarged by jk and (k−1) matrix Ainac is augmented by ejk . Next, the solution of (k) (8) is computed, i.e., z∗ = (α(k) , c(k) , uj1 , · · · , ujk )T ∈ n+k+1 R , taking into account the replacement of matrix A (k) with matrix Ainac . Finally, the residual is calculated as (k) r (k) = Ainac z (k) − y.

(k+1)

(k)

Lk+1 (z∗ ) = kAinac z∗ − yk2

2 ! (k)

h

i z∗

(k)

= Ainac ejk+1 · − y

(k)

−rjk+1

2

(k) (k)

(k) = Ainac z∗ − rjk+1 ejk+1 − y

2

(k) = r (k) − rjk+1 ejk+1

2

< r (k) , (10)

kr (k+1) k2 = Lk+1 (z∗

At each iteration step, k, KROMP selects the index jk

Moreover, we have that

(k)

) ≤ Lk+1 (z∗ ) < kr (k) k2 ,

which proves our claim. Moreover, we can see that the residual eventually will drop below the predefined threshold, ǫ, no matter how small this is. However, if the user selects a very small ǫ, then the proposed procedure will continue and model all noise samples (even those originating from a Gaussian source) as impulses, filling up the vector u, which will no longer be sparse. Hence, sensible tuning of ǫ is of importance. Algorithm 1 describes the procedure in detail. Algorithm 1 : Kernel Regularized OMP (KROMP) Input: K, y, λ, ǫ Initialization: k := 0 Sinac = {1, 2, ..., n + 1}, Sac = {n + 2, ..., 2n + 1} Ainac = [K 1], Aac = In = [e1 · · · en ] Solve: z (0) := arg minz ||Ainac z − y||22 + λ||α||22 + λc2 Initial Residual: r (0) = Ainac z (0) − y while ||r (k) ||2 > ǫ do k := k + 1 (k−1) Find: jk := arg maxj∈Sac |rj | Update Support: Sinac = Sinac ∪ {jk }, Sac = Sac − {jk } Ainac = [Ainac ejk ] Update Current solution: z (k) := arg minz ||Ainac z − y||22 + λ||α||22 + λc2 Update Residual: r (k) = Ainac z (k) − y end while Output: z = (α, c, u)T after k iterations Each step of KROMP involves solving a linear system using Cholesky decomposition, which has complexity of O((n + k)3 ), where k