Foundations of Machine Learning Lecture 9

71 downloads 10294 Views 4MB Size Report
Foundations of Machine Learning. Lecture 9. Mehryar Mohri. Courant Institute and Google Research [email protected] ...
Foundations of Machine Learning Ranking Mehryar Mohri Courant Institute and Google Research [email protected]

Motivation Very large data sets: too large to display or process. limited resources, need priorities. ranking more desirable than classification.

• • •

Applications: search engines, information extraction. decision making, auctions, fraud detection.

• •

Can we learn to predict ranking accurately? Mehryar Mohri - Foundations of Machine Learning

page 2

Related Problem Rank aggregation: given n candidates and k voters each giving a ranking of the candidates, find ordering as close as possible to these. closeness measured in number of pairwise misrankings. problem NP-hard even for k = 4 (Dwork et al., 2001).

• •

Mehryar Mohri - Foundations of Machine Learning

page 3

This Talk Score-based ranking Preference-based ranking

Mehryar Mohri - Foundations of Machine Learning

page 4

Score-Based Setting Single stage: learning algorithm receives labeled sample of pairwise preferences; returns scoring function h: U R. Drawbacks: h induces a linear ordering for full set U . does not match a query-based scenario.

• • • •

Advantages: efficient algorithms. good theory:VC bounds, margin bounds, stability bounds (FISS 03, RCMS 05, AN 05, AGHHR 05, CMR 07).

• •

Mehryar Mohri - Foundations of Machine Learning

page 5

Score-Based Ranking Training data: sample of i.i.d. labeled pairs drawn from U U according to some distribution D , S = (x1 , x1 , y1 ), . . . , (xm , xm , ym )

with yi =

U

+1 if xi >pref xi 0 if xi =pref xi or no information 1 if xi 0 , m X 1 b R⇢ (h) = m i=1





yi h(x0i )

m X 1 b R⇢ (h)  1yi [h(x0i ) m i=1

Mehryar Mohri - Foundations of Machine Learning

h(xi )

h(xi )]⇢ .

page 9



.

Marginal Rademacher Complexities Distributions: D1 marginal distribution with respect to the first element of the pairs; D2 marginal distribution with respect to second element of the pairs.

• •

Samples: S1 = (x1 , y1 ), . . . , (xm , ym ) S2 = (x01 , y1 ), . . . , (x0m , ym ) .

Marginal Rademacher complexities: 1 b S (H)] RD (H) = E[ R m 1

Mehryar Mohri - Foundations of Machine Learning

2 b S (H)]. RD (H) = E[ R m 2 page 10

Ranking Margin Bound (Boyd, Cortes, MM, and Radovanovich 2012; MM, Rostamizadeh, and Talwalkar, 2012)

Theorem: let H be a family of real-valued functions. Fix > 0 , then, for any > 0 , with probability at least 1 over the choice of a sample of size m, the following holds for all h H : R(h)

R (h) +

Mehryar Mohri - Foundations of Machine Learning

2

D2 1 RD (H) + R m m (H) +

page 11

log 1 . 2m

Ranking with SVMs see for example (Joachims, 2002)

Optimization problem: application of SVMs. m

1 w min w, 2

2

i i=1

subject to: yi w · i

+C

0,

(xi ) i

(xi )

1

[1, m] .

Decision function: h: x

Mehryar Mohri - Foundations of Machine Learning



(x) + b.

page 12

i

Notes The algorithm coincides with SVMs using feature mapping (x, x )

(x, x ) =

(x )

(x).

Can be used with kernels: K 0 ((xi , x0i ), (xj , x0j )) =

(xi , x0i ) ·

(xj , x0j )

= K(xi , xj ) + K(x0i , x0j )

K(x0i , xj )

Algorithm directly based on margin bound.

Mehryar Mohri - Foundations of Machine Learning

page 13

K(xi , x0j ).

Boosting for Ranking Use weak ranking algorithm and create stronger ranking algorithm. Ensemble method: combine base rankers returned by weak ranking algorithm. Finding simple relatively accurate base rankers often not hard. How should base rankers be combined?

Mehryar Mohri - Foundations of Machine Learning

page 14

CD RankBoost (Freund et al., 2003; Rudin et al., 2005)

H

{0, 1}X .

0 t

+

+ t

+

t

= 1, st (h) =

Pr

(x,x ) Dt

sgn f (x, x )(h(x )

h(x)) = s .

RankBoost(S = ((x1 , x1 , y1 ) . . . , (xm , xm , ym ))) 1 for i 1 to m do 1 2 D1 (xi , xi ) m 3 for t 1 to T do 4 5 6 7

base ranker in H with smallest

ht

t

+ t

=

Ei

Dt

yi ht (xi )

+

t

Zt for i

8 9 T 10 return

1 t 2 log t + 0 t + 2[ t t

1

]2 1 to m do

Dt+1 (xi , xi ) T t=1 t ht

normalization factor Dt (xi ,xi ) exp

t yi

ht (xi ) ht (xi )

Zt

T

Mehryar Mohri - Foundations of Machine Learning

page 15

ht (xi )

Notes Distributions Dt over pairs of sample points: originally uniform. at each round, the weight of a misclassified example is increased. (x ) t (x)] e y[ tQ observation: Dt+1 (x, x ) = |S| ts=1 Zs , since

• • •

Dt+1 (x, x ) =

Dt (x, x )e

y

t [ht (x

Zt

) ht (x)]

1 e = |S|

y

Pt

s [hs (x

s=1

t s=1

) hs (x)]

Zs

weight assigned to base classifier ht : t directy depends on the accuracy of ht at round t .
 Mehryar Mohri - Foundations of Machine Learning

page 16

.

Coordinate Descent RankBoost Objective Function: convex and differentiable. T

F( ) =

e

y[

T (x

)

T (x)]

=

exp

(x,x ,y) S

(x,x ,y) S

e

t [ht (x

y t=1

x

0 1 pairwise loss

Mehryar Mohri - Foundations of Machine Learning

page 17

)

ht (x)] .

• Direction: unit vector e with t

dF ( + et ) et = argmin d t



Since F (

dF ( + et ) d

+ et ) =

e

(x,x ,y) S

y

PT

s=1

s [hs (x

. =0 ) hs (x)]

e

y [ht (x ) ht (x)]

,

T

y[ht (x )

= =0

ht (x)] exp

s [hs (x

y

)

hs (x)]

s=1

(x,x ,y) S

T

=

y[ht (x ) (x,x ,y) S

=

[

+ t

t

ht (x)]DT +1 (x, x ) m

Zs s=1

T

] m

Zs . s=1

Thus, direction corresponding to base classifier selected by the algorithm.

Mehryar Mohri - Foundations of Machine Learning

page 18

• Step size: obtained via dF ( + et ) =0 d T

y[ht (x )

ht (x)] exp

s [hs (x

y

)

hs (x)] e

y[ht (x ) ht (x)]

s=1

(x,x ,y) S

T

y[ht (x )

ht (x)]DT +1 (x, x ) m

Zs e

y[ht (x ) ht (x)]

=0

s=1

(x,x ,y) S

y[ht (x )

ht (x)]DT +1 (x, x )e

y[ht (x ) ht (x)]

=0

(x,x ,y) S

[

+ t e

1 = log 2

t + t

e ]=0

.

t

Thus, step size matches base classifier weight used in algorithm. Mehryar Mohri - Foundations of Machine Learning

page 19

=0

Bipartite Ranking Training data: sample of negative points drawn according to D

• S = (x , . . . , x ) U. • sample of positive points drawn according to D 1

m

S+ = (x1 , . . . , xm ) U.

Problem: find hypothesis h : U ! R in H with small generalization error RD (h) =

Mehryar Mohri - Foundations of Machine Learning

Pr

x D ,x

D+

h(x ) < h(x) .

page 20

+

Notes Connection between AdaBoost and RankBoost (Cortes & MM, 04; Rudin et al., 05). if constant base ranker used. relationship between objective functions.

• •

More efficient algorithm in this special case (Freund et al., 2003). Bipartite ranking results typically reported in terms of AUC.

Mehryar Mohri - Foundations of Machine Learning

page 21

AdaBoost and CD RankBoost Objective functions: comparison. FAda ( ) =

exp ( yi f (xi )) xi S

=

S+

exp (+f (xi )) + xi S

exp ( f (xi )) x i S+

= F ( ) + F+ ( ). FRank ( ) =

exp (i,j) S

[f (xj )

f (xi )]

S+

=

exp (+f (xi )) exp ( f (xi )) (i,j) S

S+

= F ( )F+ ( ). Mehryar Mohri - Foundations of Machine Learning

page 22

AdaBoost and CD RankBoost (Rudin et al., 2005)

Property: AdaBoost (non-separable case). constant base learner h = 1 equal contribution of positive and negative points (in the limit). consequence: AdaBoost asymptotically achieves optimum of CD RankBoost objective. Observations: if F+ ( ) = F ( ) ,

• •

d(FRank ) = F+ d(F ) + F d(F+ ) = F+ d(F ) + d(F+ ) = F+ d(FAda ). Mehryar Mohri - Foundations of Machine Learning

page 23

Bipartite RankBoost - Efficiency Decomposition of distribution: for (x, x ) (S , S+ ) , D(x, x ) = D (x)D+ (x ).

Thus, Dt+1 (x, x ) =

Dt (x, x )e

Zt, =

Dt, (x)e x S

Mehryar Mohri - Foundations of Machine Learning

) ht (x)]

Zt

Dt, (x)e = Zt,

with

t [ht (x

t ht (x)

t ht (x)

Dt,+ (x )e Zt,+

Zt,+ =

t ht (x

Dt,+ (x )e x

S+

page 24

)

, t ht (x

)

.

ROC Curve (Egan, 1975)

Definition: the receiver operating characteristic (ROC) curve is a plot of the true positive rate (TP) vs. false positive rate (FP). TP: % positive points correctly labeled positive. FP: % negative points incorrectly labeled positive.

• •

True positive rate

1 .8 .6

-

.4

h(x3 )

.2

+

h(x14 ) h(x5 ) sorted scores

0 0

.2

.4 .6 .8 False positive rate

Mehryar Mohri - Foundations of Machine Learning

1 page 25

h(x1 )

h(x23 )

Area under the ROC Curve (AUC) (Hanley and McNeil, 1982)

Definition: the AUC is the area under the ROC curve. Measure of ranking quality. True positive rate

1 .8 .6

AUC

.4

.2 0 0

Equivalently, 1 AU C(h) = mm =1

m

.2

.4 .6 .8 False positive rate

m

1h(xj )>h(xi ) = i=1 j=1

R(h).

Mehryar Mohri - Foundations of Machine Learning

1

Pr [h(x ) > h(x)]

b x D b+ x D

page 26

This Talk Score-based ranking Preference-based ranking

Mehryar Mohri - Foundations of Machine Learning

page 27

Preference-Based Setting Definitions: U : universe, full set of objects. V : finite query subset to rank, V U. : target ranking for V (random variable).

• • •

Two stages: can be viewed as a reduction. learn preference function h: U U [0, 1] . given V , use h to determine ranking of V .

• •

Running-time: measured in terms of |calls to h |. Mehryar Mohri - Foundations of Machine Learning

page 28

Preference-Based Ranking Problem Training data: pairs (V, to D: (V1 , subsets ranked by different labelers.

) sampled i.i.d. according

1 ), (V2 , 2 ), . . . , (Vm , m )

Vi

U.

learn classifier

preference function h : U U [0, 1] . Problem: for any query set V U , use h to return ranking h,V close to target with small average error R(h, ) =

Mehryar Mohri - Foundations of Machine Learning

(V,

E

) D

[L(

h,V

,

)].

page 29

Preference Function h(u, v)close to 1 when u preferred to v , close to 0 otherwise. For the analysis, h(u, v) {0, 1} .

Assumed pairwise consistent:

h(u, v) + h(v, u) = 1.

May be non-transitive, e.g., we may have h(u, v) = h(v, w) = h(w, u) = 1.

Output of classifier or ‘black-box’.

Mehryar Mohri - Foundations of Machine Learning

page 30

Loss Functions (for fixed (V, ) ) Preference loss: L(h,

)=

2 n(n

1)

h(u, v) (v, u). u⇥=v

Ranking loss: L( , ⇥ ) =

2 n(n

Mehryar Mohri - Foundations of Machine Learning

1)

(u, v)⇥ (v, u). u⇥=v

page 31

(Weak) Regret Preference regret: Rclass (h) = E [L(h|V , V,

)]

˜ E min E [L(h, V

˜ h

|V

)].

Ranking regret: R⇥rank (A) = E [L(As (V ), ⇥ )] V,⇥ ,s

Mehryar Mohri - Foundations of Machine Learning

E min

E [L(˜ , ⇥ )].

V ˜ ⇤S(V ) ⇥ |V

page 32

Deterministic Algorithm (Balcan et al., 07)

Stage one: standard classification. Learn preference function h : U U [0, 1] . Stage two: sort-by-degree using comparison function h . sort by number of points ranked below. quadratic time complexity O(n2 ).

• •

Mehryar Mohri - Foundations of Machine Learning

page 33

Randomized Algorithm (Ailon & MM, 08)

Stage one: standard classification. Learn preference function h : U U [0, 1] .

Stage two: randomized QuickSort (Hoare, 61) using h as comparison function. comparison function non-transitive unlike textbook description. but, time complexity shown to be O(n log n) in general.

• •

Mehryar Mohri - Foundations of Machine Learning

page 34

Randomized QS v h(v, u) = 1

h(u, v) = 1

u random pivot left recursion

Mehryar Mohri - Foundations of Machine Learning

right recursion

page 35

Deterministic Algo. - Bipartite Case (V = V+

V )

(Balcan et al., 07)

Bounds: for deterministic sort-by-degree algorithm expected loss:



E [L(A(V ),

V,

• regret:

)]

Rrank (A(V ))

2 E [L(h, V,

)].

2 Rclass (h).

Time complexity: (|V |2 ).

Mehryar Mohri - Foundations of Machine Learning

page 36

Randomized Algo. - Bipartite Case (V = V+

V )

Bounds: for randomized QuickSort expected loss (equality):



E [L(Qhs (V ),

V,

• regret:

,s

Rrank (Qhs (·))

(Ailon & MM, 08) (Hoare, 61).

)] = E [L(h,

Time complexity: full set: O(n log n) . top k: O(n + k log k).

V,

)].

Rclass (h) .

• •

Mehryar Mohri - Foundations of Machine Learning

page 37

Proof Ideas QuickSort decomposition: 1 puv + 3



puvw

w⇥ {u,v}

⇥ h(u, w)h(w, v) + h(v, w)h(w, u) = 1.

Bipartite property: u (u, v) +

(v, w) +

(w, u) =

(v, u) +

(w, v) +

(u, w).

Mehryar Mohri - Foundations of Machine Learning

v

w

page 38

Lower Bound Theorem: for any deterministic algorithm A , there is a bipartite distribution for which Rrank (A)

2 Rclass (h).

• thus, factor of 2 = best in deterministic case. • randomization necessary for better bound.

Proof: take simple case U = V = {u, v, w} and assume that h induces a cycle. u up to symmetry, A returns



u, v, w or w, v, u. Mehryar Mohri - Foundations of Machine Learning

v

h page 39

w

Lower Bound If A returns u, v, w , then choose as: u, v

w

u

v

h

- + If A returns w, v, u , then choose as: w, v

L[h, L[A,

1 ]= ; 3 2 ]= . 3

u

- + Mehryar Mohri - Foundations of Machine Learning

page 40

w

Guarantees - General Case Loss bound for QuickSort: V,

E [L(Qhs (V ), ,s

)]

2 E [L(h,

)].

V,

Comparison with optimal ranking (see (CSS 99)): Es [L(Qhs (V ),

optimal )]

Es [L(h, Qhs (V ))]

where

optimal

3 L(h,

2 L(h,

optimal )

optimal ),

= argmin L(h, ).

Mehryar Mohri - Foundations of Machine Learning

page 41

Weight Function Generalization: ⇥ (u, v) =

(u, v) ⇤(

(u),

(v)).

Properties: needed for all previous results to hold, symmetry: (i, j) = (j, i) for all i, j . (i, k) for i < j < k . monotonicity: (i, j), (j, k) triangle inequality: (i, j) (i, k) + (k, j) for all triplets i, j, k.

• • •

Mehryar Mohri - Foundations of Machine Learning

page 42

Weight Function - Examples Kemeny: Top-k:

w(i, j) = 1, w(i, j) =

Bipartite: w(i, j) =

i, j.

1 if i k or j 0 otherwise.

k;

1 if i k and j > k; 0 otherwise.

k-partite: can be defined similarly.

Mehryar Mohri - Foundations of Machine Learning

page 43

(Strong) Regret Definitions Ranking regret: Rrank (A) = E [L(As (V ), ⇥ )] V,⇥ ,s

min E [L(˜|V , ⇥ )]. ˜

V,⇥

Preference regret: Rclass (h) = E [L(h|V , V,

)]

˜ |V , min E [L(h ˜ h

V,

)].

All previous regret results hold if for V1 , V2 E [ |V1

(u, v)] = E [ |V2

{u, v} ,

(u, v)]

for all u, v (pairwise independence on irrelevant alternatives). Mehryar Mohri - Foundations of Machine Learning

page 44

References •

Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., & Roth, D. (2005). Generalization bounds for the area under the roc curve. JMLR 6, 393–425.



Agarwal, S., and Niyogi, P. (2005). Stability and generalization of bipartite ranking algorithms. COLT (pp. 32–47).



Nir Ailon and Mehryar Mohri. An efficient reduction of ranking to classification. In Proceedings of COLT 2008. Helsinki, Finland, July 2008. Omnipress.



Balcan, M.-F., Bansal, N., Beygelzimer, A., Coppersmith, D., Langford, J., and Sorkin, G. B. (2007). Robust reductions from ranking to classification. In Proceedings of COLT (pp. 604– 619). Springer.



Stephen Boyd, Corinna Cortes, Mehryar Mohri, and Ana Radovanovic (2012). Accuracy at the top. In NIPS 2012.



Cohen, W. W., Schapire, R. E., and Singer,Y. (1999). Learning to order things. J. Artif. Intell. Res. (JAIR), 10, 243–270.



Cossock, D., and Zhang, T. (2006). Subset ranking using regression. COLT (pp. 605–619).

Mehryar Mohri - Courant Seminar

page 45

Courant Institute, NYU

References •

Corinna Cortes and Mehryar Mohri. AUC Optimization vs. Error Rate Minimization. In Advances in Neural Information Processing Systems (NIPS 2003), 2004. MIT Press.



Cortes, C., Mohri, M., and Rastogi, A. (2007a). An Alternative Ranking Problem for Search Engines. Proceedings of WEA 2007 (pp. 1–21). Rome, Italy: Springer.



Corinna Cortes and Mehryar Mohri. Confidence Intervals for the Area under the ROC Curve. In Advances in Neural Information Processing Systems (NIPS 2004), 2005. MIT Press.



Crammer, K., and Singer,Y. (2001). Pranking with ranking. Proceedings of NIPS 2001, December 3-8, 2001,Vancouver, British Columbia, Canada] (pp. 641–647). MIT Press.



Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar. Rank Aggregation Methods for the Web. WWW 10, 2001. ACM Press.

• •

J. P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, 1975. Yoav Freund, Raj Iyer, Robert E. Schapire and Yoram Singer. An efficient boosting algorithm for combining preferences. JMLR 4:933-969, 2003.

Mehryar Mohri - Courant Seminar

page 46

Courant Institute, NYU

References •

J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 1982.



Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, pages 115–132, 2000.



Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press. 2012.



Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the Web, Stanford Digital Library Technologies Project, 1998.



Lehmann, E. L. (1975). Nonparametrics: Statistical methods based on ranks. San Francisco, California: Holden-Day.



Cynthia Rudin, Corinna Cortes, Mehryar Mohri, and Robert E. Schapire. Margin-Based Ranking Meets Boosting in the Middle. In Proceedings of The 18th Annual Conference on Computational Learning Theory (COLT 2005), pages 63-78, 2005.



Thorsten Joachims. Optimizing search engines using clickthrough data. Proceedings of the 8th ACM SIGKDD, pages 133-142, 2002.

Mehryar Mohri - Courant Seminar

page 47

Courant Institute, NYU