A General Boosting Method and its Application to Learning Ranking ...

A General Boosting Method and its Application to Learning Ranking Functions for Web Search Zhaohui Zheng Hongyuan Zha Tong Zhang Olivier Chapelle Keke Chen Gordon Sun

Yahoo! Inc. 701 First Avene Sunnyvale, CA 94089 zhaohui,tzhang,chap,kchen,gzsun @yahoo-inc.com

College of Computing Georgia Institute of Technology Atlanta, GA 30032 [email protected]

Abstract We present a general boosting method extending functional gradient boosting to optimize complex loss functions that are encountered in many machine learning problems. Our approach is based on optimization of quadratic upper bounds of the loss functions which allows us to present a rigorous convergence analysis of the algorithm. More importantly, this general framework enables us to use a standard regression base learner such as single regression tree for fitting any loss function. We illustrate an application of the proposed method in learning ranking functions for Web search by combining both preference data and labeled data for training. We present experimental results for Web search using data from a commercial search engine that show significant improvements of our proposed methods over some existing methods.

1 Introduction There has been much interest in developing machine learning methods involving complex loss functions beyond those used in regression and classification problems [13]. Many methods have been proposed dealing with a wide range of problems including ranking problems, learning conditional random fields and other structured learning problems [1, 3, 4, 5, 6, 7, 11, 13]. In this paper we propose a boosting framework that can handle a wide variety of complex loss functions. The proposed method uses a regression black box to optimize a general loss function based on quadratic upper bounds, and it also allows us to present a rigorous convergence analysis of the method. Our approach extends the gradient boosting approach proposed in [8] but can handle substantially more complex loss functions arising from a variety of machine learning problems. As an interesting and important application of the general boosting framework we apply it to the problem of learning ranking functions for Web search. Specifically, we want to rank a set of documents according to their relevance to a given query. We adopt the following framework: we extract a set of features for each query-document pair, and learn a function so that we can rank the documents using the values , say with larger values are ranked higher. We call such a function a ranking function. In Web search, we can identify two types of training data for learning a ranking function: 1) preference data indicating a document is more relevant than another with respect to a query [11, 12]; and 2) labeled data where documents are assigned ordinal labels representing degree of relevancy. In general, we will have both preference data and labeled data for 1

training a ranking function for Web search, leading to a complex loss function that can be handled by our proposed general boosting method which we now describe.

2 A General Boosting Method We consider the following general optimization problem:

! "# %$

(1)

where denotes a prediction function which we are interested in learning from the data, & is a prechosen function class, and "' is a risk functional with respect to . We consider the following form of the risk functional : -

"# ( * -

) ,- +

.0/1

" 2 / %$435343 $6 2 798 %$;: - %$

(2)

- function with respect to the first @ - arguments / $4 B"CD:? FE ; or a two-variable

HG ), such as thoseR in) ranking based on pair) 1 / / indicates whether / " = $ $ ? : I J

K L N O M $ K C

: " P C E ; FE , where :PQ wise comparisons: E is preferred to or1 not; or it can be a multi-variable function as used in some structured prediction E problems: " / $4Z o x 1 We consider that all inequality.

O

c? o (

"'c Z|

are different, but some of the

2

O 8

"'c o Z G

)

o wi o $

(6)

in (2) might have been identical, hence the

o i is a diagonal matrix upper bounding kv v o thev Hessian h v between c and 'cZ . If we define v C b " Ac "d v k v , then o Q 0 u 6$ C FE is equal to the above quadratic form (up to a constant). So o can be found by calling the regression weak learner _ . Since at each step we try to minimize an upper bound c of , if we let the minimum be o c , it is clear that "AcZ o c x c? o c x "'c . This means that by optimizing with respect to the problem c that can be handled by _ , we also make progress with respect to optimizing . The algorithm based where h

on this idea is listed in Algorithm 1 for the loss function in (5).

Convergence analysis of this algorithm can be established using the idea summarized above; see details in appendix. However, in partice, instead of the quadratic upper bound (which has a theoretical garantee easier to derive), one may also consider minimizing an approximation to the Taylor expansion, which would be closer to a Newton type method.

Algorithm 1 Greedy Algorithm with Quadratic Approximation b Input: ` AFd .w/ 2 2 let A

M ) for K MO$ b k =$ G 5$ 4< 5< < ;dy .0/ 2 2 , with either letk i¡ % Newton-type method with diagonal Hessian ¢ ¤

]£ E £]'cOA= FE or diagonal upper bound on the Hessian % Upper-bound minimization i global bh h k¥ / Fd .w/ 2 2 , where ¢ £ £]'cOA= let f pick lcs¦§M let o ct J_q"i^$;`¨$f©$=lc pick step-size ªecs¦§M , typically by line search on let Ac5« / ¬'cZ§ªec o c end

The main conceptual difference between our view and that of Friedman is that he views regression , while our work views it as a natural as a “reasonable” approximation to the first order gradient consequence of second order approximation of the objective function (in which the quadratic term serve as an upper bound of the Hessian either locally or globally). This leads to algorithmic difference. In our approach, a good choice of the second order upper bound (leading to tighter bound) may require non-uniform weights i . This is inline with earlier boosting work in which samplereweighting was a central idea. In our framework, the reweighting naturally occurs when we choose a tight second order approximation. Different reweighting can affect the rate of convergence in our analysis. The other main difference with Friedman is that he only considered objective functions of the form (4); we propose a natural extension to the ones of the form (5).

3 Learning Ranking Functions We now apply Algorithm 1 to the problem of learning ranking functions. We use preference data as well as labeled data for training the ranking function. For preference data, we use |®: to mean that - is preferred over : or should be ranked higher than : , where and : are the feature - vectors - for corresponding items to be ranked. We denote - the set of available preferences as ¯B ) : ) $I°w $4