A Gentle Introduction to. Support Vector Machines in Biomedicine. Alexander
Statnikov*, Douglas Hardin#,. Isabelle Guyon†, Constantin F. Aliferis*. (Materials
...
A Gentle Introduction to Support Vector Machines in Biomedicine Alexander Statnikov*, Douglas Hardin#, Isabelle Guyon†, Constantin F. Aliferis* (Materials about SVM Clustering were contributed by Nikita Lytkin*) *New
York University, #Vanderbilt University, †ClopiNet
Part I • Introduction • Necessary mathematical concepts • Support vector machines for binary classification: classical formulation • Basic principles of statistical machine learning
2
Introduction
3
About this tutorial Main goal: Fully understand support vector machines (and important extensions) with a modicum of mathematics knowledge. • This tutorial is both modest (it does not invent anything new) and ambitious (support vector machines are generally considered mathematically quite difficult to grasp). • Tutorial approach: learning problem main idea of the SVM solution geometrical interpretation math/theory basic algorithms extensions case studies. 4
Data-analysis problems of interest 1. Build computational classification models (or “classifiers”) that assign patients/samples into two or more classes. - Classifiers can be used for diagnosis, outcome prediction, and other classification tasks. - E.g., build a decision-support system to diagnose primary and metastatic cancers from gene expression profiles of the patients: Primary Cancer
Classifier model Patient
Biopsy
Metastatic Cancer
Gene expression profile 5
Data-analysis problems of interest 2. Build computational regression models to predict values of some continuous response variable or outcome. - Regression models can be used to predict survival, length of stay in the hospital, laboratory test values, etc. - E.g., build a decision-support system to predict optimal dosage of the drug to be administered to the patient. This dosage is determined by the values of patient biomarkers, and clinical and demographics data: 1 2.2 3 423 2 3 92 2 1 8
Patient
Biomarkers, clinical and demographics data
Regression model
Optimal dosage is 5 IU/Kg/week
6
Data-analysis problems of interest 3. Out of all measured variables in the dataset, select the smallest subset of variables that is necessary for the most accurate prediction (classification or regression) of some variable of interest (e.g., phenotypic response variable). - E.g., find the most compact panel of breast cancer biomarkers from microarray gene expression data for 20,000 genes: Breast cancer tissues Normal tissues
7
Data-analysis problems of interest 4. Build a computational model to identify novel or outlier patients/samples. - Such models can be used to discover deviations in sample handling protocol when doing quality control of assays, etc. - E.g., build a decision-support system to identify aliens.
8
Data-analysis problems of interest 5. Group patients/samples into several clusters based on their similarity. - These methods can be used to discovery disease sub-types and for other tasks. - E.g., consider clustering of brain tumor patients into 4 clusters based on their gene expression profiles. All patients have the same pathological sub-type of the disease, and clustering discovers new disease subtypes that happen to have different characteristics in terms of patient survival and time to recurrence after treatment.
Cluster #1 Cluster #2
Cluster #3
Cluster #4
9
Basic principles of classification
• Want to classify objects as boats and houses. 10
Basic principles of classification
• All objects before the coast line are boats and all objects after the coast line are houses. • Coast line serves as a decision surface that separates two classes.
11
Basic principles of classification These boats will be misclassified as houses
12
Basic principles of classification Longitude
Boat
Latitude
House
• The methods that build classification models (i.e., “classification algorithms”) operate very similarly to the previous example. • First all objects are represented geometrically.
13
Basic principles of classification Longitude
Boat
Latitude
Then the algorithm seeks to find a decision surface that separates classes of objects
House
14
Basic principles of classification Longitude These objects are classified as houses
?
?
? ?
?
?
These objects are classified as boats
Latitude
Unseen (new) objects are classified as “boats” if they fall below the decision surface and as “houses” if the fall above it
15
The Support Vector Machine (SVM) approach • • •
Support vector machines (SVMs) is a binary classification algorithm that offers a solution to problem #1. Extensions of the basic SVM algorithm can be applied to solve problems #1-#5. SVMs are important because of (a) theoretical reasons: - Robust to very large number of variables and small samples - Can learn both simple and highly complex classification models - Employ sophisticated mathematical principles to avoid overfitting
and (b) superior empirical results.
16
Main ideas of SVMs Gene Y
Normal patients
Cancer patients
Gene X
• Consider example dataset described by 2 genes, gene X and gene Y • Represent patients geometrically (by “vectors”)
17
Main ideas of SVMs Gene Y
Normal patients
Cancer patients
Gene X
• Find a linear decision surface (“hyperplane”) that can separate patient classes and has the largest distance (i.e., largest “gap” or “margin”) between border-line patients (i.e., “support vectors”); 18
Main ideas of SVMs Gene Y Cancer
Cancer
Decision surface
kernel Normal Normal Gene X
• If such linear decision surface does not exist, the data is mapped into a much higher dimensional space (“feature space”) where the separating decision surface is found; • The feature space is constructed via very clever mathematical projection (“kernel trick”). 19
History of SVMs and usage in the literature • Support vector machine classifiers have a long history of development starting from the 1960’s. • The most important milestone for development of modern SVMs is the 1992 paper by Boser, Guyon, and Vapnik (“A training algorithm for optimal margin classifiers”) 10000 9000
Use of Support Vector Machines in the Literature 8,180
General sciences Biomedicine
8000 7000
30000
Use of Linear Regression in the Literature
8,860
General sciences Biomedicine
25000
6,660
19,200
20000
24,100 22,200 18,700
20,100 20,000 19,500
19,100 17,700
6000 4,950
5000 4000
3,530
10000
3000
0
14,900
15,500
14,900
16,000
13,500 9,770
10,800
12,000
2,330
2000 1000
15000
19,600 18,300
17,700
1,430 359
906
621 4
12
46
99
1998
1999
2000
2001
201
351
521
726
917
1,190
5000
0
2002
2003
2004
2005
2006
2007
1998
1999
2000
2001
2002
2003
2004
2005
2006
20
2007
Necessary mathematical concepts
21
How to represent samples geometrically? Vectors in n-dimensional space (Rn) • Assume that a sample/patient is described by n characteristics (“features” or “variables”) • Representation: Every sample/patient is a vector in Rn with tail at point with 0 coordinates and arrow-head at point with the feature values. • Example: Consider a patient described by 2 features: Systolic BP = 110 and Age = 29. This patient can be represented as a vector in R2: Age (110, 29)
(0, 0)
Systolic BP
22
How to represent samples geometrically? Vectors in n-dimensional space (Rn) Patient 3
Patient 4
Age (years)
60
Patient 1
40
Patient 2
20
0 200 150
300
100
200 50
100 0
Patient id 1 2 3 4
Cholesterol (mg/dl) 150 250 140 300
Systolic BP (mmHg) 110 120 160 180
0
Age (years) 35 30 65 45
Tail of the vector (0,0,0) (0,0,0) (0,0,0) (0,0,0)
Arrow-head of the vector (150, 110, 35) (250, 120, 30) (140, 160, 65) (300, 180, 45)
23
How to represent samples geometrically? Vectors in n-dimensional space (Rn) Patient 3
Patient 4
Age (years)
60
Patient 1
40
Patient 2
20
0 200 150
300
100
200 50
100 0
0
Since we assume that the tail of each vector is at point with 0 coordinates, we will also depict vectors as points (where the arrow-head is pointing). 24
Purpose of vector representation • Having represented each sample/patient as a vector allows now to geometrically represent the decision surface that separates two groups of samples/patients. A decision surface in 7
A decision surface in R3
R2 7 6
6
5
5
4 4
3 3
2
2
1
1
0 10
0 0
1
2
3
4
5
6
7
5 0
0
1
2
3
4
5
6
7
• In order to define the decision surface, we need to introduce some basic math elements… 25
Basic operation on vectors in Rn 1. Multiplication by a scalar Consider a vector a = (a1 , a2 ,..., an ) and a scalar c Define: ca = (ca1 , ca2 ,..., can ) When you multiply a vector by a scalar, you “stretch” it in the same or opposite direction depending on whether the scalar is positive or negative. a = (1,2) c=2 c a = (2,4)
ca
4 3 2 1
4
a = (1,2) c = −1 c a = (−1,−2)
a
3 2 1
-2
-1
0
a 1
2
3
4
-1 0
1
2
3
4
ca
-2
26
Basic operation on vectors in Rn 2. Addition Consider vectors a = (a1 , a2 ,..., an ) and b = (b1 , b2 ,..., bn ) Define: a + b = (a1 + b1 , a2 + b2 ,..., an + bn )
a = (1,2) b = (3,0) a + b = (4,2)
4 3 2 1
0
a 1
Recall addition of forces in classical mechanics.
a +b 2
b
3
4
27
Basic operation on vectors in Rn 3. Subtraction Consider vectors a = (a1 , a2 ,..., an ) and b = (b1 , b2 ,..., bn ) Define: a − b = (a1 − b1 , a2 − b2 ,..., an − bn ) a = (1,2) b = (3,0) a − b = (−2,2)
4 3 2
-3
a −b
1
-2
0
-1
a 1
2
b
3
4
What vector dowe need to add to b to get a ? I.e., similar to subtraction of real numbers.
28
Basic operation on vectors in Rn 4. Euclidian length or L2-norm Consider a vector a = (a1 , a2 ,..., an ) 2 2 2 ... a = a + a + + a Define the L2-norm: 1 2 n 2 We often denote the L2-norm without subscript, i.e. a a = (1,2) a 2 = 5 ≈ 2.24
3 2 1
0
a 1
Length of this vector is ≈ 2.24 2
3
4
L2-norm is a typical way to measure length of a vector; other methods to measure length also exist.
29
Basic operation on vectors in Rn 5. Dot product
Consider vectors a = (a1 , a2 ,..., an ) and b = (b1 , b2 ,..., bn ) n Define dot product: a ⋅ b = a1b1 + a2b2 + ... + anbn = ∑ ai bi i =1
The law of cosines says that a⋅ b =|| a ||2 || b ||2 cos θ where θ is the angle between a and b. Therefore, when the vectors are perpendicular a ⋅b = 0. a = (1,2) b = (3,0) a ⋅b = 3
a = (0,2) b = (3,0) a ⋅b = 0
4 3 2 1
0
a
θ 1
2
b
3
4
4 3 2
a
1
θ 0
1
2
b
3
4
30
Basic operation on vectors in Rn 5. Dot product (continued) n a ⋅ b = a1b1 + a2b2 + ... + anbn = ∑ ai bi i =1
2 • Property: a ⋅ a = a1a1 + a2 a2 + ... + an an = a 2 • In the classical regression equation y = w ⋅ x + b the response variable y is just a dot product of the vector representing patient characteristics (x ) and the regression weights vector (w ) which is common across all patients plus an offset b. 31
Hyperplanes as decision surfaces • A hyperplane is a linear decision surface that splits the space into two parts; • It is obvious that a hyperplane is a binary classifier. A hyperplane in R2 is a line
A hyperplane in R3 is a plane 7
7
6 6
5 5
4 4
3
3
2 1
2
0 10
1
5 0 0
1
2
3
4
5
6
7
0
0
1
2
3
A hyperplane in Rn is an n-1 dimensional subspace
4
5
6
7
32
Equation of a hyperplane First we show with show the definition of hyperplane by an interactive demonstration.
Click here for demo to begin
or go to http://www.dsl-lab.org/svm_tutorial/planedemo.html
Source: http://www.math.umn.edu/~nykamp/ 33
Equation of a hyperplane Consider the case of R3:
w
P
x − x0
x
An equation of a hyperplane is defined by a point (P0) and a perpendicular vector to the plane (w) at that point.
P0
x0 O
Define vectors: x0 = OP0 and x = OP , where P is an arbitrary point on a hyperplane.
A condition for P to be on the plane is that the vector x − x0 is perpendicular to w : w ⋅ ( x − x0 ) = 0 or define b = − w ⋅ x0 w ⋅ x − w ⋅ x0 = 0 w⋅ x + b = 0 The above equations also hold for Rn when n>3.
34
Equation of a hyperplane Example
w = (4,−1,6) P0 = (0,1,−7) b = − w ⋅ x0 = −(0 − 1 − 42) = 43 ⇒ w ⋅ x + 43 = 0 ⇒ (4,−1,6) ⋅ x + 43 = 0 ⇒ (4,−1,6) ⋅ ( x(1) , x( 2 ) , x(3) ) + 43 = 0 ⇒ 4 x(1) − x( 2 ) + 6 x(3) + 43 = 0
+ direction
w P0
w ⋅ x + 50 = 0 w ⋅ x + 43 = 0 w ⋅ x + 10 = 0
- direction
What happens if the b coefficient changes? The hyperplane moves along the direction of w . We obtain “parallel hyperplanes”.
Distance between two parallel hyperplanes w ⋅ x + b1 = 0 and w ⋅ x + b2 = 0 is equal to D = b1 − b2 / w .
35
(Derivation of the distance between two parallel hyperplanes) tw w x2
w ⋅ x + b2 = 0 w ⋅ x + b1 = 0
x1
x2 = x1 + tw D = tw = t w w ⋅ x2 + b2 = 0 w ⋅ ( x1 + tw) + b2 = 0 2 w ⋅ x1 + t w + b2 = 0 2 ( w ⋅ x1 + b1 ) − b1 + t w + b2 = 0 2 − b1 + t w + b2 = 0 2 t = (b1 − b2 ) / w ⇒ D = t w = b1 − b2 / w 36
Recap We know… • How to represent patients (as “vectors”) • How to define a linear decision surface (“hyperplane”) We need to know… • How to efficiently compute the hyperplane that separates two classes with the largest “gap”? Gene Y
Need to introduce basics of relevant optimization theory Normal patients
Cancer patients
Gene X 37
Basics of optimization: Convex functions • A function is called convex if the function lies below the straight line segment connecting two points, for any two points in the interval. • Property: Any local minimum is a global minimum!
Local minimum Global minimum
Convex function
Global minimum
Non-convex function 38
Basics of optimization: Quadratic programming (QP) • Quadratic programming (QP) is a special optimization problem: the function to optimize (“objective”) is quadratic, subject to linear constraints. • Convex QP problems have convex objective functions. • These problems can be solved easily and efficiently by greedy algorithms (because every local minimum is a global minimum).
39
Basics of optimization: Example QP problem Consider x = ( x1 , x 2 ) 1 2 || x ||2 subject to x1 +x 2 −1 ≥ 0 Minimize 2 quadratic objective
linear constraints
This is QP problem, and it is a convex QP as we will see later We can rewrite it as: Minimize
1 2 ( x1 + x22 ) subject to x1 +x 2 −1 ≥ 0 2 quadratic objective
linear constraints 40
Basics of optimization: Example QP problem f(x1,x2)
x1 +x 2 −1 ≥ 0 x1 +x 2 −1
1 2 ( x1 + x22 ) 2
x1 +x 2 −1 ≤ 0 x2 x1 The solution is x1=1/2 and x2=1/2.
41
Congratulations! You have mastered all math elements needed to understand support vector machines. Now, let us strengthen your knowledge by a quiz 42
Quiz 1) Consider a hyperplane shown with white. It is defined by equation: w ⋅ x + 10 = 0 Which of the three other hyperplanes can be defined by equation: w ⋅ x + 3 = 0 ?
w P0
- Orange - Green - Yellow 4
a
3
2) What is the dot product between vectors a = (3,3) and b = (1,−1) ?
2 1
-2
-1
0 -1
1
2
3
4
b
-2
43
Quiz 4
a
3
3) What is the dot product between vectors a = (3,3) and b = (1,0) ?
2 1
-2
-1
0
b
1
2
3
4
3
4
-1 -2
4 3
4) What is the length of a vector a = (2,0) and what is the length of all other red vectors in the figure?
2
a
1
-2
-1
0
1
2
-1 -2
44
Quiz 5) Which of the four functions is/are convex?
1
2
3
4 45
Support vector machines for binary classification: classical formulation
46
Case 1: Linearly separable data; “Hard-margin” linear SVM Given training data:
Negative instances (y=-1)
x1 , x2 ,..., x N ∈ R n
y1 , y2 ,..., y N ∈ {−1,+1}
Positive instances (y=+1)
• Want to find a classifier (hyperplane) to separate negative instances from the positive ones. • An infinite number of such hyperplanes exist. • SVMs finds the hyperplane that maximizes the gap between data points on the boundaries (so-called “support vectors”). • If the points on the boundaries are not informative (e.g., due to noise), SVMs will not do well. 47
Statement of linear SVM classifier w⋅ x + b = 0
w ⋅ x + b = −1
w ⋅ x + b = +1
The gap is distance between parallel hyperplanes:
w ⋅ x + b = −1 and w ⋅ x + b = +1
Or equivalently:
w ⋅ x + (b + 1) = 0 w ⋅ x + (b − 1) = 0
We know that Negative instances (y=-1)
Positive instances (y=+1)
D = b1 − b2 / w
Therefore:
D = 2/ w
Since we want to maximize the gap,
we need to minimize w
or equivalently minimize
1 2
2 w
(
1 2
is convenient for taking derivative later on) 48
Statement of linear SVM classifier w ⋅ x + b ≤ −1
w⋅ x + b = 0 w ⋅ x + b ≥ +1
In addition we need to impose constraints that all instances are correctly classified. In our case:
w ⋅ xi + b ≤ −1 if yi = −1 w ⋅ xi + b ≥ +1 if yi = +1 Equivalently:
yi ( w ⋅ xi + b) ≥ 1
Negative instances (y=-1)
Positive instances (y=+1)
In summary: 2 1 w Want to minimize 2 subject to yi ( w ⋅ xi + b) ≥ 1 for i = 1,…,N Then given a new instance x, the classifier is f ( x ) = sign( w ⋅ x + b) 49
SVM optimization problem: Primal formulation n
Minimize
1 2
2 w ∑ i i =1
subject to
Objective function
yi ( w ⋅ xi + b) − 1 ≥ 0
for i = 1,…,N
Constraints
• This is called “primal formulation of linear SVMs”. • It is a convex quadratic programming (QP) optimization problem with n variables (wi, i = 1,…,n), where n is the number of features in the dataset.
50
SVM optimization problem: Dual formulation • The previous problem can be recast in the so-called “dual form” giving rise to “dual formulation of linear SVMs”. • It is also a convex quadratic programming problem but with N variables (αi ,i = 1,…,N), where N is the number of samples. N Maximize ∑ α i − ∑ α iα j yi y j xi ⋅ x j subject to α i ≥ 0 and ∑ α i yi = 0 . N
i =1
N
1 2
i , j =1
i =1
Objective function
Constraints
N Then the w-vector is defined in terms of αi: w = ∑ α i yi xi i =1
And the solution becomes: f ( x ) = sign(∑ α i yi xi ⋅ x + b) N
i =1
51
SVM optimization problem: Benefits of using dual formulation 1) No need to access original data, need to access only dot products. Objective function: ∑ α i − ∑ α iα j yi y j xi ⋅ x j N
i =1
N
1 2
i , j =1
Solution: f ( x ) = sign(∑ α i yi xi ⋅ x + b) N
i =1
2) Number of free parameters is bounded by the number of support vectors and not by the number of variables (beneficial for high-dimensional problems). E.g., if a microarray dataset contains 20,000 genes and 100 patients, then need to find only up to 100 parameters! 52
(Derivation of dual formulation) n
Minimize
1 2
2 w subject to y ( w ⋅ xi + b) − 1 ≥ 0 for i = 1,…,N ∑ i i i =1
Constraints
Objective function
Apply the method of Lagrange multipliers.
( Λ w Define Lagrangian P , b, α ) =
( w − α y ( w ∑ ∑ i i ⋅ xi + b) − 1) n
1 2
i =1
N
2 i
i =1
a vector with n elements a vector with N elements
We need to minimize this Lagrangian with respect to w, b and simultaneously require that the derivative with respect to α vanishes , all subject to the constraints that α i ≥ 0.
53
(Derivation of dual formulation) If we set the derivatives with respect to w, b to 0, we obtain: N ∂Λ P (w, b, α ) = 0 ⇒ ∑ α i yi = 0 ∂b i =1 N ∂Λ P (w, b, α ) α 0 w y x = ⇒ = ∑ i i i ∂w i =1
We substitute the above into the equation for Λ P (w,b, α ) and obtain “dual formulation of linear SVMs”:
Λ D (α ) = ∑ α i − ∑ α iα j yi y j xi ⋅ x j N
i =1
N
1 2
i , j =1
We seek to maximize the above Lagrangian with respect to α , subject to the N
constraints that α i ≥ 0 and ∑ α i yi = 0 . i =1
54
Case 2: Not linearly separable data; “Soft-margin” linear SVM What if the data is not linearly separable? E.g., there are outliers or noisy measurements, or the data is slightly non-linear.
0 0 0
0
0 0
0 0
0 0
0 0 0
0
0
Want to handle this case without changing the family of decision functions. Approach: Assign a “slack variable” to each instance ξ i ≥ 0 , which can be thought of distance from the separating hyperplane if an instance is misclassified and 0 otherwise. N 2 1 Want to minimize 2 w + C ∑ ξ i subject to yi ( w ⋅ xi + b) ≥ 1 − ξ i for i = 1,…,N i =1 f ( x ) = sign ( w ⋅ x + b) Then given a new instance x, the classifier is
55
Two formulations of soft-margin linear SVM Primal formulation: n
Minimize
1 2
N
w + C ξ y w ( ⋅ xi + b) ≥ 1 − ξ i for i = 1,…,N subject to ∑ ∑ i i i =1
2 i
i =1
Objective function
Constraints
Dual formulation: N Minimize ∑ α i − 12 ∑ α iα j yi y j xi ⋅ x j subject to 0 ≤ α i ≤ C and ∑ α i yi = 0 n
N
i =1
i , j =1
for i = 1,…,N.
Objective function
i =1
Constraints
56
Parameter C in soft-margin SVM Minimize
1 2
N 2 w + C ∑ ξ i subject to yi ( w ⋅ xi + b) ≥ 1 − ξ i for i = 1,…,N i =1
C=100
C=1
C=0.15
C=0.1
• When C is very large, the softmargin SVM is equivalent to hard-margin SVM; • When C is very small, we admit misclassifications in the training data at the expense of having w-vector with small norm; • C has to be selected for the distribution at hand as it will be discussed later in this tutorial. 57
Case 3: Not linearly separable data; Kernel trick Gene 2
Tumor
Tumor
?
kernel
Φ
?
Normal
Normal Gene 1
Data is not linearly separable in the input space
Data is linearly separable in the feature space obtained by a kernel
Φ : RN → H
58
Kernel trick Original data x (in input space) f ( x) = sign( w ⋅ x + b)
Data in a higher dimensional feature space Φ (x ) f ( x) = sign( w ⋅ Φ ( x ) + b)
N w = ∑ α i yi xi
N w = ∑ α i yi Φ ( xi )
i =1
i =1
f ( x) = sign(∑ α i yi Φ ( xi ) ⋅ Φ ( x ) + b) N
i =1
f ( x) = sign(∑ α i yi K ( xi ,x ) + b) N
i =1
Therefore, we do not need to know Ф explicitly, we just need to define function K(·, ·): RN × RN R. Not every function RN × RN R can be a valid kernel; it has to satisfy so-called Mercer conditions. Otherwise, the underlying quadratic program may not be solvable. 59
Popular kernels A kernel is a dot product in some feature space:
K ( xi , x j ) = Φ ( xi ) ⋅ Φ ( x j )
Examples:
K ( xi , x j ) = xi ⋅ x j 2 K ( xi , x j ) = exp(−γ xi − x j ) K ( xi , x j ) = exp(−γ xi − x j ) q K ( xi , x j ) = ( p + xi ⋅ x j ) 2 K ( xi , x j ) = ( p + xi ⋅ x j ) q exp(−γ xi − x j ) K ( xi , x j ) = tanh(kxi ⋅ x j − δ )
Linear kernel Gaussian kernel Exponential kernel Polynomial kernel Hybrid kernel Sigmoidal 60
Understanding the Gaussian kernel 2 Consider Gaussian kernel: K ( x , x j ) = exp(−γ x − x j ) Geometrically, this is a “bump” or “cavity” centered at the training data point x j : "bump” “cavity”
The resulting mapping function is a combination of bumps and cavities.
61
Understanding the Gaussian kernel Several more views of the data is mapped to the feature space by Gaussian kernel
62
Understanding the Gaussian kernel Linear hyperplane that separates two classes
63
Understanding the polynomial kernel 3 Consider polynomial kernel: K ( xi , x j ) = (1 + xi ⋅ x j ) Assume that we are dealing with 2-dimensional data (i.e., in R2). Where will this kernel map the data? 2-dimensional space
x(1)
x( 2 )
kernel 10-dimensional space
1 x(1)
x( 2 )
x(21)
x(22 )
x(1) x( 2 )
x(31)
x(32 )
x(1) x(22 )
x(21) x( 2 ) 64
Example of benefits of using a kernel x( 2 ) x1 x3
x4 x2
• Data is not linearly separable in the input space (R2). • Apply kernel K ( x , z ) = ( x ⋅ z ) 2 to map data to a higher x(1) dimensional space (3dimensional) where it is linearly separable. 2
2 2 x(1) z(1) = x(1) z(1) + x( 2 ) z( 2 ) = K ( x, z ) = ( x ⋅ z ) = ⋅ x z ( 2 ) ( 2 ) x(21) z(21) 2 2 2 2 = x(1) z(1) + 2 x(1) z(1) x( 2 ) z( 2 ) + x( 2 ) z( 2 ) = 2 x(1) x( 2 ) ⋅ 2 z(1) z( 2 ) = Φ ( x ) ⋅ Φ ( z ) 2 2 x z ( 2) ( 2) 65
[
]
Example of benefits of using a kernel x(21) Therefore, the explicit mapping is Φ ( x ) = 2 x(1) x( 2 ) 2 x ( 2) x( 2 )
x(22 )
x1 x3
x4
x2
x1 , x2
kernel x(1)
Φ
x3 , x4
x(21)
2 x(1) x( 2 )
66
Comparison with methods from classical statistics & regression • Need ≥ 5 samples for each parameter of the regression model to be estimated: Number of variables
Polynomial degree
Number of parameters
Required sample
2
3
10
50
10
3
286
1,430
10
5
3,003
15,015
100
3
176,851
884,255
100
5
96,560,646
482,803,230
• SVMs do not have such requirement & often require much less sample than the number of variables, even when a high-degree polynomial kernel is used. 67
Basic principles of statistical machine learning
68
Generalization and overfitting • Generalization: A classifier or a regression algorithm learns to correctly predict output from given inputs not only in previously seen samples but also in previously unseen samples. • Overfitting: A classifier or a regression algorithm learns to correctly predict output from given inputs in previously seen samples but fails to do so in previously unseen samples. • Overfitting Poor generalization.
69
Example of overfitting and generalization There is a linear relationship between predictor and outcome (plus some Gaussian noise). Algorithm 2 Outcome of Interest Y
Outcome of Interest Y
Algorithm 1 Training Data Test Data
Predictor X
Predictor X
• Algorithm 1 learned non-reproducible peculiarities of the specific sample available for learning but did not learn the general characteristics of the function that generated the data. Thus, it is overfitted and has poor generalization. • Algorithm 2 learned general characteristics of the function that produced the data. Thus, it generalizes. 70
“Loss + penalty” paradigm for learning to avoid overfitting and ensure generalization • Many statistical learning algorithms (including SVMs) search for a decision function by solving the following optimization problem:
Minimize (Loss + λ Penalty) • Loss measures error of fitting the data • Penalty penalizes complexity of the learned function • λ is regularization parameter that balances Loss and Penalty
71
SVMs in “loss + penalty” form SVMs build the following classifiers: f ( x ) = sign( w ⋅ x + b)
Consider soft-margin linear SVM formulation: Find w and b that N 2 1 w + C ξ Minimize 2 ∑ i subject to yi (w ⋅ xi + b) ≥ 1 − ξi for i = 1,…,N i =1
This can also be stated as: Find w and b that N 2 Minimize ∑ [1 − yi f ( xi )]+ +λ w 2 i =1
Loss (“hinge loss”)
Penalty
(in fact, one can show that λ = 1/(2C)). 72
Meaning of SVM loss function [ 1 ( y f x − Consider loss function: ∑ i i )]+ N
i =1
• Recall that […]+ indicates the positive part • For a given sample/patient i, the loss is non-zero if 1 − yi f ( xi ) > 0 • In other words, yi f ( xi ) < 1 • Since yi = {−1,+1}, this means that the loss is non-zero if f ( xi ) < 1 for yi = +1 f ( xi ) > −1 for yi= -1
• In other words, the loss is non-zero if w ⋅ xi + b < 1 for yi = +1 w ⋅ xi + b > −1 for yi= -1
73
Meaning of SVM loss function • If the instance is negative, it is penalized only in regions 2,3,4 • If the instance is positive, it is penalized only in regions 1,2,3
w ⋅ x + b = −1
w⋅ x + b = 0 w ⋅ x + b = +1
1 2 3 4
Negative instances (y=-1)
Positive instances (y=+1)
74
Flexibility of “loss + penalty” framework Minimize (Loss + λ Penalty) Loss function
Penalty function
[ 1 − y f ( x ∑ i i )]+ N
Hinge loss:
i =1
Mean squared error:
∑(y i =1
i
Mean squared error: N
∑(y i =1
i
∑ ( yi − f ( xi )) 2 N
2
SVMs
Ridge regression
λ w1
− f ( xi )) 2
Mean squared error:
2
λw2
− f ( xi )) 2
N
λw2
Resulting algorithm
Lasso
λ1 w 1 + λ2 w 2 2
Elastic net
i =1
Hinge loss:
[ 1 − y f ( x ∑ i i )]+ N
i =1
λ w1
1-norm SVM 75
Part 2 • Model selection for SVMs • Extensions to the basic SVM model: 1. 2. 3. 4. 5. 6.
SVMs for multicategory data Support vector regression Novelty detection with SVM-based methods Support vector clustering SVM-based variable selection Computing posterior class probabilities for SVM classifiers 76
Model selection for SVMs
77
Need for model selection for SVMs Gene 2
Gene 2 Tumor
Tumor
Normal Gene 1
• It is impossible to find a linear SVM classifier that separates tumors from normals! • Need a non-linear SVM classifier, e.g. SVM with polynomial kernel of degree 2 solves this problem without errors.
Normal
Gene 1
• We should not apply a non-linear SVM classifier while we can perfectly solve this problem using a linear SVM classifier! 78
A data-driven approach for model selection for SVMs • Do not know a priori what type of SVM kernel and what kernel parameter(s) to use for a given dataset? • Need to examine various combinations of parameters, e.g. consider searching the following grid: Polynomial degree d
Parameter C
(0.1, 1)
(1, 1)
(10, 1)
(100, 1)
(1000, 1)
(0.1, 2)
(1, 2)
(10, 2)
(100, 2)
(1000, 2)
(0.1, 3)
(1, 3)
(10, 3)
(100, 3)
(1000, 3)
(0.1, 4)
(1, 4)
(10, 4)
(100, 4)
(1000, 4)
(0.1, 5)
(1, 5)
(10, 5)
(100, 5)
(1000, 5)
• How to search this grid while producing an unbiased estimate 79 of classification performance?
Nested cross-validation Recall the main idea of cross-validation: test data
train
train test train
What combination of SVM parameters to apply on training data?
test
valid
train
train
valid
train
Perform “grid search” using another nested loop of cross-validation. 80
Example of nested cross-validation
data
Consider that we use 3-fold cross-validation and we want to optimize parameter C that takes values “1” and “2”.
{
Outer Loop P1
Training Testing Average C Accuracy set set Accuracy P1, P2 P3 1 89% 83% P1,P3 P2 2 84% P2, P3 P1 1 76%
P2 …
P3
…
Inner Loop Training Validation set set
P1 P2 P1 P2
P2 P1 P2 P1
C 1 2
Accuracy
86% 84% 70% 90%
Average Accuracy
85% 80%
}
choose C=1 81
On use of cross-validation • Empirically we found that cross-validation works well for model selection for SVMs in many problem domains; • Many other approaches that can be used for model selection for SVMs exist, e.g.:
Generalized cross-validation Bayesian information criterion (BIC) Minimum description length (MDL) Vapnik-Chernovenkis (VC) dimension Bootstrap
82
SVMs for multicategory data
83
One-versus-rest multicategory SVM method Gene 2
? Tumor I
**** * * * * * * * ** * * * *
?
Tumor II
Tumor III Gene 1
84
One-versus-one multicategory SVM method Gene 2
Tumor I
* * * * * Tumor II * * * * * * * * * * * * ? ?
Tumor III Gene 1
85
DAGSVM multicategory SVM method AML vs.vs. ALL T-cell AML ALL T-cell Not AML
Not ALL T-cell
ALL ALLB-cell B-cellvs. vs.ALL ALLT-cell T-cell Not ALL B-cell
ALL T-cell
Not ALL T-cell
AML vs. ALL B-cell Not AML
ALLALL B-cell B-cell
Not ALL B-cell
AML
86
SVM multicategory methods by Weston and Watkins and by Crammer and Singer Gene 2
? ?
* * * * * * * * * * * * * * * * *
Gene 1
87
Support vector regression
88
ε-Support vector regression (ε-SVR) Given training data: y
*
** ** * * * * *** * ** * *
x1 , x2 ,..., x N ∈ R n y1 , y2 ,..., y N ∈ R +ε -ε
x
Main idea: Find a function f ( x ) = w ⋅ x + b that approximates y1,…,yN : • it has at most ε derivation from the true values yi • it is as “flat” as possible (to avoid overfitting)
E.g., build a model to predict survival of cancer patients that can admit a one month error (= ε ).
89
Formulation of “hard-margin” ε-SVR y
* * *
* * * * * * ** * * * *
+ε w ⋅ x + b = 0 -ε f ( x ) = w⋅ x + b Find
2 w subject
by minimizing to constraints: yi − ( w ⋅ x + b ) ≤ ε yi − ( w ⋅ x + b) ≥ −ε 1 2
for i = 1,…,N. x
I.e., difference between yi and the fitted function should be smaller than ε and larger than -ε all points yi should be in the “ε-ribbon” around the fitted function. 90
Formulation of “soft-margin” ε-SVR y
*
* * * * * * ** * * * * * * * *
+ε -ε
x
If we have points like this (e.g., outliers or noise) we can either: a) increase ε to ensure that these points are within the new ε-ribbon, or b) assign a penalty (“slack” variable) to each of this points (as was done for “soft-margin” SVMs)
91
Formulation of “soft-margin” ε-SVR y
*
ξi
* * * * * * ** * * ξi* * * * * * *
+ε -ε
f ( x ) = w⋅ x + b Find N 2 by minimizing 12 w + C ∑ (ξ i + ξ i* ) i =1 subject to constraints: yi − ( w ⋅ x + b ) ≤ ε + ξ i yi − ( w ⋅ x + b) ≥ −ε − ξ i*
ξ i , ξ i* ≥ 0 for i = 1,…,N. x
Notice that only points outside ε-ribbon are penalized! 92
Nonlinear ε-SVR y
* * *** * * ** ** *
+ε * * * -ε
y
** *
** ** * * * * *** *
x +Φ(ε) -Φ(ε)
Φ(x)
Cannot approximate well this function with small ε!
kernel
Φ
Φ
−1
y
* * *** * * ** ** *
* * * +ε -ε
x 93
ε-Support vector regression in “loss + penalty” form
Build decision function of the form: f ( x ) = w ⋅ x + b Find w and b that
2 Minimize ∑ max(0, | yi − f ( xi ) | −ε ) +λ w 2 N
i =1
Loss (“linear ε-insensitive loss”)
Penalty
Loss function value
Error in approximation
94
Comparing ε-SVR with popular regression methods Loss function Linear ε-insensitive loss: y − f x max( 0 , | ( ∑ i i ) | −ε ) N
Penalty function
2
2
2
2
λw2
Resulting algorithm ε-SVR
i =1
Quadratic ε-insensitive loss: 2 max( 0 , ( ( y − f x ∑ i i )) − ε ) N
λw2
Another variant of ε-SVR
i =1
Mean squared error: 2 − y f x ( ( ∑ i i )) N
λw2
Ridge regression
i =1
Mean linear error: N
| y − f ( x ∑ i i)| i =1
λw2
Another variant of ridge regression 95
Comparing loss functions of regression methods Linear ε-insensitive loss Loss function value
-ε
ε Error in approximation
Mean squared error Loss function value
-ε
ε Error in approximation
Quadratic ε-insensitive loss Loss function value
-ε
ε Error in approximation
Mean linear error Loss function value
-ε
ε Error in approximation 96
Applying ε-SVR to real data In the absence of domain knowledge about decision functions, it is recommended to optimize the following parameters (e.g., by cross-validation using grid-search): • parameter C • parameter ε • kernel parameters (e.g., degree of polynomial) Notice that parameter ε depends on the ranges of variables in the dataset; therefore it is recommended to normalize/re-scale data prior to applying ε-SVR. 97
Novelty detection with SVM-based methods
98
What is it about? Decision function = +1 Predictor Y
• Find the simplest and most compact region in the space of predictors where the majority of data samples “live” (i.e., with the highest density of samples). • Build a decision function that takes value +1 in this region and -1 elsewhere. • Once we have such a decision function, we can identify novel or outlier samples/patients in the data.
******* ******** ***** ******** ** * * * **************** * * ************************** * * Decision function = -1
* *
Predictor X
99
Key assumptions •
•
We do not know classes/labels of samples (positive or negative) in the data available for learning this is not a classification problem All positive samples are similar but each negative sample can be different in its own way
Thus, do not need to collect data for negative samples!
100
Sample applications “Normal” “Novel”
“Novel” Modified from: www.cs.huji.ac.il/course/2004/learns/NoveltyDetection.ppt
“Novel” 101
Sample applications Discover deviations in sample handling protocol when doing quality control of assays. Protein Y
* * *
Samples with low quality of processing from the lab of Dr. Smith
* * Samples with low quality of processing from ICU patients
********* ** ************* * *********** * *** * * ** ** * * **
Samples with high-quality of processing
Samples with low quality of processing from infants
Protein X
Samples with low quality of processing from patients with lung cancer
102
Sample applications Identify websites that discuss benefits of proven cancer treatments.
** ** *
Weighted frequency of word Y Websites that discuss cancer prevention methods
** * * * ** ******* * *********** * * * * * * * ******** ***** * ** ********** * * * * * ** *** *
Websites that discuss benefits of proven cancer treatments
Websites that discuss side-effects of proven cancer treatments
* ** *
*** * Blogs of cancer patients ** *
Weighted frequency of word X
Websites that discuss unproven cancer treatments 103
One-class SVM Main idea: Find the maximal gap hyperplane that separates data from the origin (i.e., the only member of the second class is the origin).
w⋅ x + b = 0
ξi Origin is the only member of the second class
ξj Use “slack variables” as in soft-margin SVMs to penalize these instances
104
Formulation of one-class SVM: linear case
Given training data: x1 , x2 ,..., x N ∈ R n
Find f ( x ) = sign( w ⋅ x + b) by minimizing
w⋅ x + b = 0
ξi
ξj
i.e., the decision function should be positive in all training samples except for small deviations
1 2
w
2
1 + νN
N
∑ξ i =1
i
+b
subject to constraints: w ⋅ x + b ≥ −ξ i ξi ≥ 0 upper bound on for i = 1,…,N.
the fraction of outliers (i.e., points outside decision surface) allowed in the data 105
Formulation of one-class SVM: linear and non-linear cases Linear case
Non-linear case (use “kernel trick”)
Find f ( x ) = sign( w ⋅ x + b) by minimizing
1 2
2 1 w + νN
subject to constraints: w ⋅ x + b ≥ −ξ i ξi ≥ 0 for i = 1,…,N.
Find f ( x ) = sign( w ⋅ Φ ( x ) + b)
N
∑ξ i =1
i
+b
by minimizing
1 2
2 1 w + νN
N
∑ξ i =1
i
subject to constraints: w ⋅ Φ ( x ) + b ≥ −ξ i ξi ≥ 0 for i = 1,…,N.
106
+b
More about one-class SVM • One-class SVMs inherit most of properties of SVMs for binary classification (e.g., “kernel trick”, sample efficiency, ease of finding of a solution by efficient optimization method, etc.); • The choice of other parameter ν significantly affects the resulting decision surface. • The choice of origin is arbitrary and also significantly affects the decision surface returned by the algorithm.
107
Support vector clustering Contributed by Nikita Lytkin
108
Goal of clustering (aka class discovery)
Given a heterogeneous set of data points x1 , x2 ,..., xN ∈ R n Assign labels y1 , y2 ,..., y N ∈ {1,2,..., K } such that points with the same label are highly “similar“ to each other and are distinctly different from the rest
Clustering process
109
Support vector domain description • Support Vector Domain Description (SVDD) of the data is a set of vectors lying on the surface of the smallest hyper-sphere enclosing all data points in a feature space – These surface points are called Support Vectors
kernel
Φ
R
R R
110
SVDD optimization criterion Formulation with hard constraints: Minimize
subject to
R2
Squared radius of the sphere
|| Φ ( xi ) − a ||2 ≤ R 2
for i = 1,…,N
Constraints
R
R
a' R'
a
R
111
Main idea behind Support Vector Clustering • Cluster boundaries in the input space are formed by the set of points that when mapped from the input space to the feature space fall exactly on the surface of the minimal enclosing hyper-sphere – SVs identified by SVDD are a subset of the cluster boundary points
Φ
R
−1
R
a
R
112
Cluster assignment in SVC • Two points xi and x j belong to the same cluster (i.e., have the same label) if every point of the line segment ( xi , x j ) projected to the feature space lies within the hypersphere Some points lie outside the hyper-sphere
Φ
R
R
a
Every point is within the hypersphere in the feature space
R
113
Cluster assignment in SVC (continued) • Point-wise adjacency matrix is constructed by testing the line segments between every pair of points • Connected components are extracted • Points belonging to the same connected component are assigned the same label
C A
E
B D
A B C D E
C A
A
B
C
D
E
1
1
1
0
0
1
1
0
0
1
0
0
1
1 1
E
B D
114
Effects of noise and cluster overlap • In practice, data often contains noise, outlier points and overlapping clusters, which would prevent contour separation and result in all points being assigned to the Noise same cluster Ideal data
Typical data
Outliers
Overlap 115
SVDD with soft constraints • SVC can be used on noisy data by allowing a fraction of points, called Bounded SVs (BSV), to lie outside the hyper-sphere – BSVs are not considered as cluster boundary points and are not assigned to clusters by SVC Noise
Noise
Typical data
kernel
Φ Outliers
Outliers
R
R
a
R
Overlap Overlap
116
Soft SVDD optimization criterion Primal formulation with soft constraints: Minimize
R 2 subject to
Squared radius of the sphere
Introduction of slack variables ξ i mitigates the influence of noise and overlap on the clustering process
|| Φ ( xi ) − a ||2 ≤ R 2 + ξ i ξ i ≥ 0 for i = 1,…,N Soft constraints Outliers Noise
R
R + ξi
Overlap
R
a
117
Dual formulation of soft SVDD Minimize W = ∑ β i K ( xi , xi ) − ∑ β i β j K ( xi , x j ) i
i, j
subject to 0 ≤ β i ≤ C for i = 1,…,N Constraints
• As before, K ( xi , x j ) = Φ ( xi ) ⋅ Φ ( x j ) denotes a kernel function • Parameter 0 < C ≤ 1 gives a trade-off between volume of the sphere and the number of errors (C=1 corresponds to hard constraints) 2 • Gaussian kernel K ( xi , x j ) = exp(−γ xi − x j ) tends to yield tighter contour representations of clusters than the polynomial kernel • The Gaussian kernel width parameter γ > 0 influences tightness of cluster boundaries, number of SVs and the number of clusters • Increasing γ causes an increase in the number of clusters 118
SVM-based variable selection
119
Understanding the weight vector w Recall standard SVM formulation:
w⋅ x + b = 0
Find w and b that minimize 2 1 subject to yi ( w ⋅ xi + b) ≥ 1 2 w
for i = 1,…,N.
Use classifier: f ( x ) = sign( w ⋅ x + b)
Negative instances (y=-1)
Positive instances (y=+1)
• The weight vector w contains as many elements as there are input n variables in the dataset, i.e. w ∈ R . • The magnitude of each element denotes importance of the corresponding variable for classification task.
120
Understanding the weight vector w x2
w = (1,1) 1x1 + 1x2 + b = 0
w = (1,0) 1x1 + 0 x2 + b = 0
x2
x1
X1 and X2 are equally important x2
x1
X1 is important, X2 is not
w = (0,1) 0 x1 + 1x2 + b = 0
x3
w = (1,1,0) 1x1 + 1x2 + 0 x3 + b = 0 x2
x1
X2 is important, X1 is not
x1
X1 and X2 are equally important, X3 is not 121
Understanding the weight vector w Gene X2
Melanoma Nevi
SVM decision surface
w = (1,1) 1x1 + 1x2 + b = 0
Decision surface of another classifier w = (1,0) 1x1 + 0 x2 + b = 0
True model X1 Gene X1
• In the true model, X1 is causal and X2 is redundant X2 Phenotype •SVM decision surface implies that X1 and X2 are equally important; thus it is locally causally inconsistent •There exists a causally consistent decision surface for this example •Causal discovery algorithms can identify that X1 is causal and X2 is redundant 122
Simple SVM-based variable selection algorithm Algorithm: 1. Train SVM classifier using data for all variables to estimate vector w 2. Rank each variable based on the magnitude of the corresponding element in vector w 3. Using the above ranking of variables, select the smallest nested subset of variables that achieves the best SVM prediction accuracy.
123
Simple SVM-based variable selection algorithm Consider that we have 7 variables: X1, X2, X3, X4, X5, X6, X7 The vector w is: (0.1, 0.3, 0.4, 0.01, 0.9, -0.99, 0.2) The ranking of variables is: X6, X5, X3, X2, X7, X1, X4 Classification accuracy
Subset of variables X6
X5
X3
X2
X7
X1
X6
X5
X3
X2
X7
X1
X6
X5
X3
X2
X7
X6
X5
X3
X2
X6
X5
X3
X6
X5
X6
X4
0.920 0.920 0.919 0.852
Best classification accuracy Classification accuracy that is statistically indistinguishable from the best one
0.843 0.832 0.821
Select the following variable subset: X6, X5, X3, X2 , X7
124
Simple SVM-based variable selection algorithm • SVM weights are not locally causally consistent we may end up with a variable subset that is not causal and not necessarily the most compact one. • The magnitude of a variable in vector w estimates the effect of removing that variable on the objective function of SVM (e.g., function that we want to minimize). However, this algorithm becomes suboptimal when considering effect of removing several variables at a time… This pitfall is addressed in the SVM-RFE algorithm that is presented next. 125
SVM-RFE variable selection algorithm Algorithm: 1. Initialize V to all variables in the data 2. Repeat 3. Train SVM classifier using data for variables in V to estimate vector w 4. Estimate prediction accuracy of variables in V using the above SVM classifier (e.g., by cross-validation) 5. Remove from V a variable (or a subset of variables) with the smallest magnitude of the corresponding element in vector w 6. Until there are no variables in V 7. Select the smallest subset of variables with the best prediction accuracy 126
SVM-RFE variable selection algorithm 10,000 genes
SVM model Prediction accuracy
5,000 genes
SVM model Prediction accuracy
Important for classification
5,000 genes
Discarded
Not important for classification
2,500 genes
…
Important for classification
2,500 genes
Discarded
Not important for classification
• Unlike simple SVM-based variable selection algorithm, SVMRFE estimates vector w many times to establish ranking of the variables. • Notice that the prediction accuracy should be estimated at each step in an unbiased fashion, e.g. by cross-validation. 127
SVM variable selection in feature space The real power of SVMs comes with application of the kernel trick that maps data to a much higher dimensional space (“feature space”) where the data is linearly separable. Gene 2
Tumor
Tumor
?
kernel
Φ
?
Normal
Normal Gene 1
input space
feature space 128
SVM variable selection in feature space • We have data for 100 SNPs (X1,…,X100) and some phenotype. • We allow up to 3rd order interactions, e.g. we consider: • X1,…,X100 • X12,X1X2, X1X3,…,X1X100 ,…,X1002 • X13,X1X2X3, X1X2X4,…,X1X99X100 ,…,X1003
• Task: find the smallest subset of features (either SNPs or their interactions) that achieves the best predictive accuracy of the phenotype. • Challenge: If we have limited sample, we cannot explicitly construct and evaluate all SNPs and their interactions (176,851 features in total) as it is done in classical statistics. 129
SVM variable selection in feature space Heuristic solution: Apply algorithm SVM-FSMB that: 1. Uses SVMs with polynomial kernel of degree 3 and selects M features (not necessarily input variables!) that have largest weights in the feature space. E.g., the algorithm can select features like: X10, (X1X2), (X9X2X22), (X72X98), and so on. 2. Apply HITON-MB Markov blanket algorithm to find the Markov blanket of the phenotype using M features from step 1. 130
Computing posterior class probabilities for SVM classifiers
131
Output of SVM classifier 1. SVMs output a class label (positive or negative) for each sample: sign( w ⋅ x + b) 2. One can also compute distance from the hyperplane that separates classes, e.g. w ⋅ x + b. These distances can be used to compute performance metrics like area under ROC curve.
w⋅ x + b = 0
Negative samples (y=-1)
Positive samples (y=+1)
Question: How can one use SVMs to estimate posterior class probabilities, i.e., P(class positive | sample x)? 132
Simple binning method 1. Train SVM classifier in the Training set. 2. Apply it to the Validation set and compute distances from the hyperplane to each sample. Sample # 1
2
3
4
5
Distance
-1
8
3
4
2
...
98 99
100
-2
0.8
0.3
Training set
Validation set
3. Create a histogram with Q (e.g., say 10) bins using the above distances. Each bin has an upper and lower value in terms of distance.
Testing set
Number of samples in validation set
25 20 15 10 5 0 -15
-10
-5
0 Distance
5
10
15
133
Simple binning method 4.
Given a new sample from the Testing set, place it in the corresponding bin. E.g., sample #382 has distance to hyperplane = 1, so it is placed in the bin [0, 2.5]
Training set
Number of samples in validation set
25 20
10
Testing set
5 0 -15
5.
Validation set
15
-10
-5
0 Distance
5
10
15
Compute probability P(positive class | sample #382) as a fraction of true positives in this bin. E.g., this bin has 22 samples (from the Validation set), out of which 17 are true positive ones , so we compute P(positive class | sample #382) = 17/22 = 0.77
134
Platt’s method Convert distances output by SVM to probabilities by passing them through the sigmoid filter:
1 P ( positive class | sample) = 1 + exp( Ad + B) where d is the distance from hyperplane and A and B are parameters. 1 0.9
P(positive class|sample)
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -10
-8
-6
-4
-2
0 Distance
2
4
6
8
10
135
Platt’s method 1. Train SVM classifier in the Training set. 2. Apply it to the Validation set and compute distances from the hyperplane to each sample. Sample # 1
2
3
4
5
Distance
-1
8
3
4
3.
2
...
98 99
100
-2
0.8
0.3
Determine parameters A and B of the sigmoid function by minimizing the negative log likelihood of the data from the Validation set.
Training set
Validation set Testing set
4. Given a new sample from the Testing set, compute its posterior probability using sigmoid function.
136
Part 3 • Case studies (taken from our research) 1. 2. 3. 4. 5. 6.
Classification of cancer gene expression microarray data Text categorization in biomedicine Prediction of clinical laboratory values Modeling clinical judgment Using SVMs for feature selection Outlier detection in ovarian cancer proteomics data
• Software • Conclusions • Bibliography
137
1. Classification of cancer gene expression microarray data
138
Comprehensive evaluation of algorithms for classification of cancer microarray data Main goals: • Find the best performing decision support algorithms for cancer diagnosis from microarray gene expression data; • Investigate benefits of using gene selection and ensemble classification methods.
139
Classifiers • • • • • • • • • • •
K-Nearest Neighbors (KNN) Backpropagation Neural Networks (NN) Probabilistic Neural Networks (PNN) Multi-Class SVM: One-Versus-Rest (OVR) Multi-Class SVM: One-Versus-One (OVO) Multi-Class SVM: DAGSVM Multi-Class SVM by Weston & Watkins (WW) Multi-Class SVM by Crammer & Singer (CS) Weighted Voting: One-Versus-Rest Weighted Voting: One-Versus-One Decision Trees: CART
instance-based neural networks
kernel-based
voting decision trees 140
Ensemble classifiers dataset
Classifier 1
Classifier 2
…
Classifier N
Prediction 1
Prediction 2
…
Prediction N
dataset
Ensemble Classifier Final Prediction 141
Gene selection methods Highly discriminatory genes Uninformative genes
1.
Signal-to-noise (S2N) ratio in one-versus-rest (OVR) fashion;
2.
Signal-to-noise (S2N) ratio in one-versus-one (OVO) fashion;
3.
Kruskal-Wallis nonparametric one-way ANOVA (KW);
4.
Ratio of genes betweencategories to within-category sum of squares (BW).
genes
142
Performance metrics and statistical comparison 1. Accuracy + can compare to previous studies + easy to interpret & simplifies statistical comparison
2. Relative classifier information (RCI) + easy to interpret & simplifies statistical comparison + not sensitive to distribution of classes + accounts for difficulty of a decision problem
• Randomized permutation testing to compare accuracies of the classifiers (α=0.05) 143
Microarray datasets Number of Dataset name Sam- Variables Cateples (genes) gories
Reference
11_Tumors
174
12533
11
Su, 2001
14_Tumors
308
15009
26
Ramaswamy, 2001
9_Tumors
60
5726
9
Staunton, 2001
Brain_Tumor1
90
5920
5
Pomeroy, 2002
Brain_Tumor2
50
10367
4
Nutt, 2003
Leukemia1
72
5327
3
Golub, 1999
Leukemia2
72
11225
3
Armstrong, 2002
Lung_Cancer
203
12600
5
Bhattacherjee, 2001
SRBCT
83
2308
4
Khan, 2001
10509
2
Singh, 2002
5469
2
Shipp, 2002
Prostate_Tumor 102 DLBCL
77
Total: •~1300 samples •74 diagnostic categories •41 cancer types and 12 normal tissue types
144
Summary of methods and datasets Cross-Validation Designs (2)
Gene Selection Methods (4)
Performance Metrics (2)
Statistical Comparison
10-Fold CV
S2N One-Versus-Rest
Accuracy
LOOCV
S2N One-Versus-One
RCI
Randomized permutation testing
Non-param. ANOVA
Gene Expression Datasets (11) 11_Tumors
One-Versus-One
14_Tumors
Ensemble Classifiers (7) Majority Voting
KNN Backprop. NN Prob. NN Decision Trees One-Versus-Rest One-Versus-One
Based on MCSVM outputs
Method by WW
MC-SVM OVR MC-SVM OVO MC-SVM DAGSVM Decision Trees
Brain Tumor1 Brain_Tumor2 Leukemia1
Lung_Cancer SRBCT
Majority Voting Decision Trees
9_Tumors
Leukemia2
Binary Dx
DAGSVM
Multicategory Dx
One-Versus-Rest
Method by CS
WV
BW ratio
Based on outputs of all classifiers
MC-SVM
Classifiers (11)
Prostate_Tumors DLBCL 145
em
ia1
ia2 Lu ng _C an ce r SR Pr BC os T tat e_ Tu mo DL r BC L
Le uk
or s
or 1
or 2
or s
em
_T um
_T um
_T um
_T um
Tu mo rs
Le uk
11
Br ain
Br ain
14
9_ Accuracy, % 60
40 OVR OVO DAGSVM WW CS KNN NN PNN
20
0
146
MC-SVM
Results without gene selection
100
80
Results with gene selection Improvement of diagnostic performance by gene selection (averages for the four datasets)
Diagnostic performance before and after gene selection 100
70
9_Tumors
14_Tumors
60
60 40
50
Accuracy, %
Improvement in accuracy, %
80
40 30
20
100
Brain_Tumor1
Brain_Tumor2
80
20 60
10
40 20
OVR
OVO DAGSVM SVM
WW
CS
KNN
NN non-SVM
PNN
SVM
non-SVM
SVM
non-SVM
Average reduction of genes is 10-30 times 147
Comparison with previously published results Multiclass SVMs (this study)
100
Accuracy, %
80
60
Multiple specialized classification methods (original primary studies)
40
20
ia2 Lu ng _C an ce r SR Pr BC os T tat e_ Tu mo DL r BC L
em
ia1
Le uk
em
or s
Le uk
_T um
or 1 11
_T um
or 2 Br ain
_T um
or s Br ain
_T um
14
9_
Tu mo rs
0
148
Summary of results • Multi-class SVMs are the best family among the tested algorithms outperforming KNN, NN, PNN, DT, and WV. • Gene selection in some cases improves classification performance of all classifiers, especially of non-SVM algorithms; • Ensemble classification does not improve performance of SVM and other classifiers; • Results obtained by SVMs favorably compare with the literature. 149
Random Forest (RF) classifiers • Appealing properties – – – – – –
Work when # of predictors > # of samples Embedded gene selection Incorporate interactions Based on theory of ensemble learning Can work with binary & multiclass tasks Does not require much fine-tuning of parameters
• Strong theoretical claims • Empirical evidence: (Diaz-Uriarte and Alvarez de Andres, BMC Bioinformatics, 2006) reported superior classification performance of RFs compared to SVMs and other methods 150
Key principles of RF classifiers
Testing data Training data
4) Apply to testing data & combine predictions
1) Generate bootstrap samples
2) Random gene selection
3) Fit unpruned decision trees
151
Results without gene selection
• SVMs nominally outperform RFs is 15 datasets, RFs outperform SVMs in 4 datasets, algorithms are exactly the same in 3 datasets. • In 7 datasets SVMs outperform RFs statistically significantly. • On average, the performance advantage of SVMs is 0.033 AUC and 0.057 RCI. 152
Results with gene selection
• SVMs nominally outperform RFs is 17 datasets, RFs outperform SVMs in 3 datasets, algorithms are exactly the same in 2 datasets. • In 1 dataset SVMs outperform RFs statistically significantly. • On average, the performance advantage of SVMs is 0.028 AUC and 0.047 RCI. 153
2. Text categorization in biomedicine
154
Models to categorize content and quality: Main idea
1. Utilize existing (or easy to build) training corpora 1
2
3 4
2. Simple document representations (i.e., typically stemmed and weighted words in title and abstract, Mesh terms if available; occasionally addition of Metamap CUIs, author info) as “bag-of-words” 155
Models to categorize content and quality: Main idea Unseen Examples
3. Train SVM models that capture implicit categories of meaning or quality criteria
Labeled Examples
Labeled
4. Evaluate models’ performances - with nested cross-validation or other appropriate error estimators - use primarily AUC as well as other metrics (sensitivity, specificity, PPV, Precision/Recall curves, HIT curves, etc.) 5. Evaluate performance prospectively & compare to prior cross-validation estimates
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Txmt
Diag
Estimated Performance
Prog
Etio
2005 Performance
156
Models to categorize content and quality: Some notable results Category
Average AUC
Range over n folds
Treatment
0.97*
0.96 - 0.98
Etiology
0.94*
0.89 – 0.95
Prognosis
0.95*
0.92 – 0.97
Diagnosis
0.95*
0.93 - 0.98
1. SVM models have excellent ability to identify high-quality PubMed documents according to ACPJ gold standard
Method
Treatment AUC
Etiology AUC
Prognosis AUC
Diagnosis AUC
Google Pagerank
0.54
0.54
0.43
0.46
Yahoo Webranks
0.56
0.49
0.52
0.52
Impact Factor 2005
0.67
0.62
0.51
0.52
Web page hit count
0.63
0.63
0.58
0.57
Bibliometric Citation Count
0.76
0.69
0.67
0.60
Machine Learning Models
0.96
0.95
0.95
0.95
2. SVM models have better classification performance than PageRank, Yahoo ranks, Impact Factor, Web Page hit counts, and bibliometric citation counts on the Web according to ACPJ gold standard 157
Models to categorize content and quality: Some notable results Treatment - Fixed Sensitivity
Gold standard: SSOAB
Area under the ROC curve*
SSOAB-specific filters
0.893
Citation Count
0.791
ACPJ Txmt-specific filters 0.548 Impact Factor (2001)
0.549
Impact Factor (2005)
0.558
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0.98
0.98
Treatment - Fixed Specificity
0.89 0.71
Fixed Sens Query Filters
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Spec
Sens
Learning Models
0.98
0.75 0.44
Query Filters
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
3. SVM models have better classification performance than PageRank, Impact Factor and Citation count in Medline for SSOAB gold standard
0.96
Query Filters
0.91
0.91
Sens Query Filters
Fixed Spec Learning Models
Diagnosis - Fixed Specificity
0.88 0.68
Fixed Sens
0.94
Spec
0.96
Learning Models
0.68
Learning Models
Diagnosis - Fixed Sensitivity 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0.91
Etiology - Fixed Specificity
0.98
Fixed Sens
0.91
Fixed Spec
Query Filters
Etiology - Fixed Sensitivity 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0.95 0.8
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Spec
0.97
Sens
Learning Models
Fixed Spec
Query Filters
Prognosis - Fixed Sensitivity
0.97
0.82 0.65
Learning Models
Prognosis - Fixed Specificity 1
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0.8
0.8
Fixed Sens Query Filters
0.87 0.71
Spec Learning Models
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0.8
Sens Query Filters
0.77
0.77
Fixed Spec Learning Models
4. SVM models have better sensitivity/specificity in PubMed than CQFs at 158 comparable thresholds according to ACPJ gold standard
Other applications of SVMs to text categorization Model
Area Under the Curve
Machine Learning Models 0.93 Quackometer*
0.67
Google
0.63
1. Identifying Web Pages with misleading treatment information according to special purpose gold standard (Quack Watch). SVM models outperform Quackometer and Google ranks in the tested domain of cancer treatment. 2. Prediction of future paper citation counts (work of L. Fu and C.F. Aliferis, AMIA 2008) 159
3. Prediction of clinical laboratory values
160
Dataset generation and experimental design • StarPanel database contains ~8·106 lab measurements of ~100,000 inpatients from Vanderbilt University Medical Center. • Lab measurements were taken between 01/1998 and 10/2002. For each combination of lab test and normal range, we generated the following datasets.
01/1998-05/2001
Training
06/2001-10/2002
Validation (25% of Training)
Testing
161
Query-based approach for prediction of clinical cab values Training data
Data model
database
Testing sample
Train SVM classifier Validation data SVM classifier Performance These steps are performed for every data model
Testing data
These steps are performed for every testing sample
Optimal data model Prediction 162
Classification results Excluding cases with K=0 (i.e. samples with no prior lab measurements)
Area under ROC curve (without feature selection)
Area under ROC curve (without feature selection)
BUN
>1 75.9%
Range of normal values 2.5 1 digital image available Diagnoses made in 2004
Features Lesion location
Family history of melanoma
Irregular Border
Streaks (radial streaming, pseudopods)
Max-diameter
Fitzpatrick’s Photo-type
Number of colors
Slate-blue veil
Min-diameter
Sunburn
Atypical pigmented network
Whitish veil
Evolution
Ephelis
Abrupt network cut-off
Globular elements
Age
Lentigos
Regression-Erythema
Comedo-like openings, milia-like cysts
Gender
Asymmetry
Hypo-pigmentation
Telangiectasia
Method to explain physician-specific SVM models
FS
Build DT
Build SVM SVM ”Black Box” Regular Learning
Apply SVM
Meta-Learning
Results: Predicting physicians’ judgments
All
RFE
(features)
HITON_PC (features)
HITON_MB (features)
(features)
Expert 1
0.94 (24)
0.92 (4)
0.92 (5)
0.95 (14)
Expert 2
0.92 (24)
0.89 (7)
0.90 (7)
0.90 (12)
Expert 3
0.98 (24)
0.95 (4)
0.95 (4)
0.97 (19)
NonExpert 1
0.92 (24)
0.89 (5)
0.89 (6)
0.90 (22)
NonExpert 2
1.00 (24)
0.99 (6)
0.99 (6)
0.98 (11)
NonExpert 3
0.89 (24)
0.89 (4)
0.89 (6)
0.87 (10)
Physicians
Results: Physician-specific models
Results: Explaining physician agreement Patient 001
Expert 1 AUC=0.92 R2=99%
Blue veil
irregular border
streaks
yes
no
yes
Expert 3 AUC=0.95 R2=99%
Results: Explain physician disagreement Patient 002
Expert 1 AUC=0.92 R2=99%
Blue veil
irregular border
streaks
number of colors
evolution
no
no
yes
3
no
Expert 3 AUC=0.95 R2=99%
Results: Guideline compliance Physician
Reported guidelines
Compliance
Experts1,2,3, non-expert 1
Pattern analysis
Non-compliant: they ignore the majority of features (17 to 20) recommended by pattern analysis.
Non expert 2
ABCDE rule
Non compliant: asymmetry, irregular border and evolution are ignored.
Non expert 3
Non-standard. Reports using 7 features
Non compliant: 2 out of 7 reported features are ignored while some nonreported ones are not
On the contrary: In all guidelines, the more predictors present, the higher the likelihood of melanoma. All physicians were compliant with this principle.
5. Using SVMs for feature selection
176
Feature selection methods Feature selection methods (non-causal) • SVM-RFE This is an SVM-based • Univariate + wrapper feature selection method • Random forest-based • LARS-Elastic Net • RELIEF + wrapper • L0-norm • Forward stepwise feature selection • No feature selection Causal feature selection methods • HITON-PC This method outputs a • HITON-MB Markov blanket of the • IAMB response variable • BLCD (under assumptions) • K2MB … 177
13 real datasets were used to evaluate feature selection methods Dataset name
Domain
Number of variables
Number of samples
Infant_Mortality
Clinical
86
5,337
Died within the first year
discrete
Ohsumed
Text
14,373
5,000
Relevant to nenonatal diseases
continuous
ACPJ_Etiology
Text
28,228
15,779
Relevant to eitology
continuous
Lymphoma
Gene expression
7,399
227
3-year survival: dead vs. alive
continuous
Gisette
Digit recognition
5,000
7,000
Separate 4 from 9
continuous
Dexter
Text
19,999
600
Relevant to corporate acquisitions
continuous
Sylva
Ecology
216
14,394
Ponderosa pine vs. everything else
continuous & discrete
Proteomics
2,190
216
Cancer vs. normals
continuous
Thrombin
Drug discovery
139,351
2,543
Binding to thromin
discrete (binary)
Breast_Cancer
Gene expression
17,816
286
Estrogen-receptor positive (ER+) vs. ER-
continuous
Hiva
Drug discovery
1,617
4,229
Activity to AIDS HIV infection
discrete (binary)
Nova
Text
16,969
1,929
Separate politics from religion topics
discrete (binary)
Financial
147
7,063
Personal bankruptcy
continuous & discrete
Ovarian_Cancer
Bankruptcy
Target
Data type
178
Classification performance vs. proportion of selected features Magnified
Original 1 0.9 Classification performance (AUC)
Classification performance (AUC)
0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 2
HITON-PC with G test RFE
0.55 0.5
0
0.5 Proportion of selected features
1
0.89 0.88 0.87 0.86 0.85
HITON-PC with G2 test RFE 0.05 0.1 0.15 0.2 Proportion of selected features 179
Statistical comparison of predictivity and reduction of features Predicitivity
SVM-RFE (4 variants)
Reduction
P-value
Nominal winner
P-value
Nominal winner
0.9754
SVM-RFE
0.0046
HITON-PC
0.8030
SVM-RFE
0.0042
HITON-PC
0.1312
HITON-PC
0.3634
HITON-PC
0.1008
HITON-PC
0.6816
SVM-RFE
• Null hypothesis: SVM-RFE and HITON-PC perform the same; • Use permutation-based statistical test with alpha = 0.05. 180
Simulated datasets with known causal structure used to compare algorithms
181
Comparison of SVM-RFE and HITON-PC
182
Comparison of all methods in terms of causal graph distance SVM-RFE
HITON-PC based causal methods
183
Summary results HITON-PC based causal methods
HITON-PCFDR methods
SVM-RFE
184
Statistical comparison of graph distance
Comparison average HITON-PC-FDR with G2 test vs. average SVM-RFE
Sample size = 200 Nominal P-value winner