Similarity Learning with Top-heavy Ranking Loss for Person Re ...

84

IEEE SIGNAL PROCESSING LETTERS, VOL. 23, NO. 1, JANUARY 2016

Similarity Learning with Top-heavy Ranking Loss for Person Re-identification Jin Wang, Nong Sang, Zheng Wang, and Changxin Gao

Abstract—Person re-identification is the task of finding a person of interest across a network of cameras. In this paper, we propose a new similarity learning method for person re-identification. Conventional metric learning methods generally learn a linear transformation by employing sparse pairwise or triplet constraints. Since a lot of negative matching pairs or triplets are abandoned, the discriminative information is not fully exploited. Similarity learning methods with AUC loss can utilize all valid triplet constraints. However, the AUC loss has its own limitation by treating all false ranks occured at different positions equally. To address this limitation, we propose to extend the AUC loss to the top-heavy ranking loss by assigning large weights to top positions of the ranking list. Moreover, we introduce an explicit nonlinear transformation function for the original feature space and learn an inner product similarity under the structured output learning framework. Our approach achieves very promising results on the challenging VIPeR, CUHK Campus and PRID 450S datasets. Index Terms—Learning to rank, person re-identification, structural learning.

I. INTRODUCTION

P

ERSON re-identification (re-id) refers to the task of matching a given probe image against a set of gallery images. It has drawn a lot of attentions in recent years [1]–[8] because of its important applications in video surveillance, such as cross-camera tracking, multi-camera behavior analysis and pedestrian search. Existing methods for person re-identification can be roughly divided into two categories: image descriptor constructing based and similarity/distance metric learning based. Methods of the first category aim to design or learn image descriptors which are robust to illumination and viewpoint changes. Several effective person image descriptors have been proposed, including SCNCD [9], kBiCov [10], saliency [11], MLF [4] and LOMO [3]. However, it is generally difficult to construct

Manuscript received August 25, 2015; revised October 14, 2015; accepted November 14, 2015. Date of publication November 20, 2015; date of current version November 25, 2015. This work was supported by the National Natural Science Foundation of China under Grant 61401170, and by the Open Foundation of first level Zhejiang key in key discipline of Control Science and Engineering. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Alexandre X. Falcao. J. Wang, N. Sang, and C. Gao are with the National Key Laboratory of Science and Technology on Multi-spectral Information Processing, School of Automation, Huazhong University of Science and Technology, Wuhan 430074, China (e-mail: [email protected]; [email protected]; [email protected]). Z. Wang is with the School of Computer, Wuhan University, Wuhan 430072, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/LSP.2015.2502271

Fig. 1. A toy example to illustrate the limitation of loss. Curves of loss value versus number of false ranks are shown in (c). Supposing that there are 10 irrelevant images (yellow triangle) for the probe (green circle), and are four possible ranking lists, and is the ground truth in which the to green circle is ranked in the top position. In (a), we lose a rank from at the top of the list, and the loss increases 0.1 and 0.64 for and respectively. In (b), we lose a rank at the bottom of the list and the increment is still 0.1, while for , the increment decreases to 0.01. From for loss treats false ranks occured at different this example we can see that positions equally, which is not expected in the re-id task.

descriptors maintaining not only robustness against various variations cased by lightings, poses, viewpoints, etc., but also discriminant power against different identities. Besides, given the constructed descriptors, the re-identification performance can be further improved by metric learning methods. Based on this consideration, this paper focuses on the second category, where a distance metric is learned to measure similarities of person images effectively. Conventional metric learning methods [12]–[15] generally learn a Mahalanobis distance by employing pairwise or triplet constraints. With the goal of balancing the number of positive and negative matching pairs, meanwhile, controlling the computation cost, only a small portion of training pairs or triplets are adopted to learn the metric [13], [14], [16]. Consequently, the discriminative information existed in the training set is not fully exploited [17]. To address this limitation, listwise constraints like AUC loss (see Fig. 1(c)) are introduced [11], [18]. The AUC loss measures difference between two partial orders, one is the ground-truth, the other is induced by sorting gallery images according to their similarity scores. The ground-truth partial order keeps all valid triplet constraints in its order information. Though all valid constraints are considered in the training stage, AUC loss has its own limitations. From its definition we can see that the loss value is in linear with the number of false ranks. Fig. 1 shows the limitation of this linearity. Specifically, for two false ranks, one occured at the top of the list (Fig. 1(a)), the other at the bottom (Fig. 1(b)), AUC loss increases the same amount for these two false ranks. However, in order to motivate relevant images to be ranked higher, we expect the loss function penalizes more for case (a). In this paper, we extend the AUC loss to the top-heavy ranking loss, which has large weights for

1070-9908 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

WANG et al.: SIMILARITY LEARNING WITH TOP-HEAVY RANKING LOSS FOR PERSON RE-IDENTIFICATION

top few false ranks. The penalty incurred by a false rank in top few positions is larger than that in bottom positions. By optimizing a top-heavy ranking loss, we not only can make full use of existed discriminative information but also can motivate relevant images to be ranked higher. Besides, conventional metric learning methods usually learn a linear transformation for the original feature space, which is not sufficient to capture the nonlinear manifold where person images usually lie on [19]. To better fit the data distribution, local metric learning [20], [21] and multi-task metric learning [7] are proposed. In addition, some nonlinear embedding techniques are proposed for person re-identification [22], [23], RGB-D sensor-based scene classification [24], saliency detection [25], etc.. In this paper, we explicitly embed features by a parameterized nonlinear function, and then learn an inner product similarity by optimizing the top-heavy ranking loss. Our main contribution is twofold. First, we propose to learn the similarity function by optimizing the top-heavy ranking loss, which is specifically designed for the re-id task. Second, we propose to use a nonlinear embedding function and learn an inner product similarity under the structural learning framework.

relevant part and irrelevant part: where

85

,

(3) In the following text, we will abuse the notation by simply write , for and respectively without causing any confusion. Given the similarity function and a certain probe image, we expect its relevant images to be ranked before all irrelevant ones. Thus the output of the re-id task is a partial order: (4) , , and where represents that is ranked before (after) . is a set with total elements. Formulation as a Structural Learning Problem. We define a compatibility function that measures how well the output matches the given input as follows:

II. METHODOLOGY In this section, we first introduce the proposed nonlinear embedding function, then formulate the similarity learning problem as a special case of structured output learning problem, next, we extend the AUC loss to the top-heavy ranking loss and finally, optimization details are discussed. A. Nonlinear Embedding Function We utilize a neural network to map representations of person images from original feature space to the target space, and employ the inner product as the similarity measure of person image pairs. Specifically, given a pair of person image features , the similarity between them is defined as:

(5) If the similarity function assigns higher score for positive than all other negative pairs as matching pair expected, the compatibility function will for the be maximized at the ground truth partial order given input . To learn such similarity function, we follow the commonly used -slack margin re-scaling approach [26] and formulate the objective function as follows:

(1) Here as:

is the nonlinear embedding function and is defined (6) (2)

where is a projection matrix, is a bias are dimensions of original feature space and vector and the target space respectively. can be viewed as a single layer neural network followed by a L2-normalization layer. This normalization operation is important since it makes the similarity score to be irrelevant with magnitudes of embedded features. For the activation function, we have tested relu, tanh and sigmoid, and found tanh demonstrated better performance. B. Similarity Learning with Structured Loss Notations. The probe and gallery set are denoted as and respectively. is the -th image, is its identity and is the total number of images in . Similar notations apply for variables of gallery set. Given a probe image , the gallery set is divided into its

is a regularization parameter, is the where structured loss and is space consisting of all possible partial orders. We will discuss in detail in the next section. The constraints in Eq. (6) mean that the compatibility of must be greater the ground truth input-output pair than that of all incorrect pairs by a margin, which is determined by a loss function and slack variable . In testing phase, the good ranking can be easily obtained by sorting gallery images . according to their similarity score C. Top-heavy Ranking Loss Review of AUC Loss. The Area Under the ROC Curve (AUC) loss is the commonly used structured loss for person re-identification [11], [18]. It is computed from the number of falsely ranked images: (7)

86


The AUC loss is defined as: (8) From the definition we can see, increases for each false rank regardless of where the false rank occurs, which means treats errors occured in different positions equally. However, for the re-id task, we hope the structured loss focuses more on the top of the list in order to motivate relevant images to be ranked higher. Top-heavy Listwise Loss. In [27], a class of general ranking error functions was defined which can be utilized to design the expected top-heavy listwise loss. While authors of [27] formulated the loss as an extension of the pairwise classification loss, we reformulate it in the listwise approach as follows: (9) where

transforms the number of false ranks into a loss:

(10) is the weight assigned to the false rank occured in position . With the defined as a decreasing sequence, we assign heavy weights for false ranks occured at top positions in the is a special case of the above general list. In addition, loss function with which is irrelevant with . In our experiments, we set , where is a normalization is 1. term to guarantee the sum of

Fig. 2. (a) CMC curves of different metric learning algorithms. (b) CMC curves of different loss functions on VIPeR.

III. EXPERIMENTS A. Experimental Settings To evaluate the effectiveness of the proposed SLTRL2 method, we perform extensive experiments on VIPeR [29], CUHK Campus [30] and PRID 450S dataset [31]. The newly proposed LOMO [3] feature is used for all the following experiments and the dimensionality of LOMO feature is reduced to 600 by PCA. For our SLTRL method, we fix and . The learning rate and the regularization and respectively. These parameter are set as hyper-parameters are selected by using cross-validation. The is initialized bias is initialized as and the weight matrix with 1 in diagonal and 0 otherwise. Both datasets are randomly divided into two subsets, one for training and the other for testing. Specifically, there are 316, 485 and 225 individuals in each of the training sets for the VIPeR, CUHK Campus and PRID 450S dataset respectively. This partition is repeated 10 times to report the average result. B. Experiments on VIPeR

D. Optimization We solve problem (6) with gradient descent algorithm. Specifically, the objective function (6) is reformulated as an unconstrained form:

(11) In the computation of gradients, we need to find the most violated constraint for each probe image: (12) Problem (12) is reduced to finding an optimal interleaving of relevant and irrelevant images in the gallery set, and is as the same loss [28]. After reformulating problem (6) to the as for unconstraint optimization form of (11), the gradient calculation is straightforward. Please refer to Section II in the supplementary material1 for more details. 1https://github.com/jinw1004/SLTRL/blob/master/sup/SLTL-sup-R1.pdf

VIPeR [29] is a challenging person re-identification dataset which was widely used for performance evaluation. It contains 632 individuals captured in outdoor scenarios, and each person has two images observed from different camera views. All imfor experiments. ages are normalized to Comparison with Metric Learning Algorithms. We first compare the proposed SLTRL algorithm with several conventional metric learning algorithms. Those methods include ITML [32], PCCA [13], LMNN [12], LFDA [20] and KISSME [14]. To make a fair comparison, we use the same LOMO feature and the same train/test split for each of the algorithms. The final results are listed in Table I. The CMC curves are shown in Fig. 2(a). It can be seen the SLTRL method performs better than the compared metric learning algorithms, especially at the top few ranks. Particularly, PCCA learns a linear transformation by optimizing the pairwise loss and LMNN optimizes the triplet loss. From the table we can see that our SLTRL method, which learns a nonlinear transformation and optimizes the topheavy listwise loss, consistently performs better than PCCA and LMNN. Comparison with Different Losses. To verify the effectiveness of the proposed top-heavy ranking loss, we conduct experiments on VIPeR dataset with different loss functions including 2Source

code available at https://github.com/jinw1004/SLTRL

WANG et al.: SIMILARITY LEARNING WITH TOP-HEAVY RANKING LOSS FOR PERSON RE-IDENTIFICATION

TABLE I COMPARISON WITH DIFFERENT METRIC LEARNING ALGORITHMS WITH THE SAME FEATURE SET ON VIPER DATASET

TABLE III WITH THE STATE-OF-THE-ART ON CUHK CAMPUS DATASET

COMPARISON

TABLE II COMPARISON WITH THE STATE-OF-THE-ART METHODS ON VIPER DATASET COMPARISON

, , and triplet loss (see Section III in the suploss is set with and all other plementary material). equal to 0. Fig. 2(b) shows the experimental results. It can be seen that among the four types of losses, learning with achieves the best results which is in accordance with our expectation. is the extreme case of the top-heavy loss with all weights concentrated on rank 1, however, it has the worst performance among the four loss functions possibly because it treats false ranks occured from position 2 to N with no difference. Comparison with the State of the Art. Finally, we compare the performance of the proposed SLTRL method with the stateof-the-art results reported on the VIPeR dataset. The results are summarized in Table II. The CMC matching rates of ImpDLA are read from figures in [36]. From Table II we can see that the best performance is achieved by the CBRA [37] method, in which several complementary ranking lists are combined using ranking aggregation method. The proposed SLTRL method is complementary to CBRA because SLTRL can be used as a weak ranker to generate the original ranking list. Furthermore, our model is very flexible and can be easily integrated with feature learning methods like CNN [38] in a single framework to learn feature representation and similarity metric jointly. C. Experiment on CUHK Campus The CUHK Campus [30] has 971 identities, with 2 images per person in each view. Images in CUHK Campus dataset are pixels for evaluation. Since SLTRL is a scaled to

87

METHODS

TABLE IV STATE-OF-THE-ART METHODS 450S DATASET

WITH THE ON PRID

listwise method, in which all irrelevant images are needed to compute the loss value, our method may run into the scalability problem as the number of irrelevant images grows. One possible solution is to randomly sample a fixed number of irrelevant images in each iteration. In this experiment, to accelerate the computation, we randomly sample 80 irrelevant images for each probe image in the training stage. We compare our approach against MLF [4], SalMatch [11], SDALF [39], LMNN [12], ITML [32], ImpDLA [36] and LOMO+XQDA [3] in Table III. From the table we can see, the proposed SLTRL method achieves better results except at rank1 in comparison with the state of the art method LOMO+XQDA. D. Experiment on PRID 450S The PRID 450S dataset [31] contains 450 single-shot person pairs captured by two disjoint cameras. Each image is rescaled pixels for evaluation. We compare our method to [40], EIML [41], SCNCD [9], ECM [18] and against CBRA [37]. Moreover, we compare with the state of the art method, i.e. LOMO+XQDA [3], by running the original implementation on PRID 450S dataset. As is shown in Table IV, we achieve 1.16% improvement at rank 1 over LOMO+XQDA and greatly outperforms other methods. IV. CONCLUSION In this paper, we propose an efficient nonlinear similarity learning method for person re-identification. Different from the conventional metric learning algorithms which optimize the pairwise, triplet or linear structured loss, we optimize a top-heavy listwise loss. Extensive experiments are conducted on VIPeR using the same set of features to show the superiority of the SLTRL method. Promising results are obtained on the challenging VIPeR, CUHK Campus and PRID 450S datasets.

88


REFERENCES [1] Y. Xie, H. Yu, X. Gong, Z. Dong, and Y. Gao, “Learning visual-spatial saliency for multiple-shot person re-identification,” IEEE Signal Process. Lett., vol. 22, no. 11, pp. 1854–1858, Nov. 2015. [2] G. Lisanti, I. Masi, A. Bagdanov, and A. Del Bimbo, “Person re-identification by iterative re-weighted sparse ranking,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 37, no. 8, pp. 1629–1642, Aug. 2015. [3] S. Liao, Y. Hu, X. Zhu, and S. Z. Li, “Person re-identification by local maximal occurrence representation and metric learning,” in CVPR, 2015, pp. 2197–2206. [4] R. Zhao, W. Ouyang, and X. Wang, “Learning mid-level filters for person re-identification,” in CVPR, 2014, pp. 144–151. [5] Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, and J. R. Smith, “Learning locally-adaptive decision functions for person verification,” in CVPR, 2013, pp. 3610–3617. [6] W. Zheng, S. Gong, and T. Xiang, “Towards open-world person re-identification by one-shot group-based verification,” IEEE Trans. Patt. Anal. Mach. Intell., vol. PP, no. 99, pp. 1–1, 2015. [7] L. Ma, X. Yang, and D. Tao, “Person re-identification over camera networks using multi-task distance metric learning,” IEEE Trans. Image Process., vol. 23, no. 8, pp. 3656–3670, 2014. [8] X. Wang, W. Zheng, X. Li, and J. Zhang, “Cross-scenario transfer person re-identification,” IEEE Trans. Circuits Syst. Video Technol., vol. PP, no. 99, pp. 1–1, 2015. [9] Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, and S. Z. Li, “Salient color names for person re-identification,” in Computer Vision–ECCV, 2014, pp. 536–551. [10] B. Ma, Y. Su, and F. Jurie, “Covariance descriptor based on bio-inspired features for person re-identification and face verification,” Image Vis. Comput., vol. 32, no. 6, pp. 379–390, 2014. [11] R. Zhao, W. Ouyang, and X. Wang, “Person re-identification by salience matching,” in ICCV, 2013, pp. 2528–2535. [12] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” J. Mach. Learn. Res., vol. 10, pp. 207–244, 2009. [13] A. Mignon and F. Jurie, “Pcca: A new approach for distance learning from sparse pairwise constraints,” in CVPR, 2012, pp. 2666–2672. [14] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof, “Large scale metric learning from equivalence constraints,” in CVPR, 2012, pp. 2288–2295. [15] M. Hirzer, P. M. Roth, M. Köstinger, and H. Bischof, “Relaxed pairwise learned metric for person re-identification,” in Computer Vision–ECCV, 2012, pp. 780–793. [16] W.-S. Zheng, S. Gong, and T. Xiang, “Reidentification by relative distance comparison,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 35, no. 3, pp. 653–668, 2013. [17] J. Chen, Z. Zhang, and Y. Wang, “Relevance metric learning for person re-identification by exploiting listwise similarities,” IEEE Trans. Image Process., vol. PP, no. 99, pp. 1–1, 2015. [18] X. Liu, H. Wang, Y. Wu, J. Yang, and M.-H. Yang, “An ensemble color model for human re-identification,” in WACV, 2015, pp. 868–875. [19] J. Hu, J. Lu, and Y.-P. Tan, “Discriminative deep metric learning for face verification in the wild,” in CVPR, 2014, pp. 1875–1882. [20] S. Pedagadi, J. Orwell, S. Velastin, and B. Boghossian, “Local fisher discriminant analysis for pedestrian re-identification,” in CVPR, 2013, pp. 3318–3325. [21] K. Liu, Z. Zhao, and A. Cai, “Datum-adaptive local metric learning for person re-identification,” IEEE Signal Process. Lett., vol. 22, no. 9, pp. 1457–1461, 2015.

[22] F. Xiong, M. Gou, O. Camps, and M. Sznaier, “Person re-identification using kernel-based metric learning methods,” in Computer Vision–ECCV, 2014, pp. 1–16. [23] H. Liu, M. Qi, and J. Jiang, “Kernelized relaxed margin components analysis for person re-identification,” IEEE Signal Process. Lett., vol. 22, no. 7, pp. 910–914, Jul. 2015. [24] D. Tao, L. Jin, Z. Yang, and X. Li, “Rank preserving sparse learning for kinect based scene classification,” IEEE Trans. Cybernetics, vol. 43, no. 5, pp. 1406–1417, 2013. [25] D. Tao, J. Cheng, M. Song, and X. Lin, “Manifold ranking-based matrix factorization for saliency detection,” IEEE Trans. Neural Netw. Learn. Syst., vol. PP, no. 99, pp. 1–1, 2015. [26] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, “Large margin methods for structured and interdependent output variables,” J. Machine Learning Research 2005, pp. 1453–1484. [27] N. Usunier, D. Buffoni, and P. Gallinari, “Ranking with ordered weighted pairwise classification,” in Proc. 26th Annu. Int. Conf. Machine Learning, 2009, pp. 1057–1064. [28] Y. Yue, T. Finley, F. Radlinski, and T. Joachims, “A support vector method for optimizing average precision,” in Proc. 30th Annu. Int. ACM SIGIR Conf. Research and Development in Information Retrieval, 2007, pp. 271–278. [29] D. Gray, S. Brennan, and H. Tao, “Evaluating appearance models for recognition, reacquisition, and tracking,” IEEE Int. Workshop on Performance Evaluation for Tracking and Surveillance (PETS), vol. 3, no. 5, 2007. [30] W. Li and X. Wang, “Locally aligned feature transforms across views,” in CVPR, 2013, pp. 3594–3601. [31] P. M. Roth, M. Hirzer, M. Köstinger, C. Beleznai, and H. Bischof, “Mahalanobis distance learning for person re-identification,” in Person Re-Identification. Berlin, Germany: Springer, 2014, pp. 247–267. [32] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Informationtheoretic metric learning,” in Proc. 24th Int. Conf. Machine learning, 2007, pp. 209–216. [33] D. Tao, L. Jin, Y. Wang, Y. Yuan, and X. Li, “Person re-identification by regularized smoothing kiss metric learning,” IEEE Trans. Circuits Syst. Video Technol., vol. 23, no. 10, pp. 1675–1685, 2013. [34] D. Tao, L. Jin, Y. Wang, and X. Li, “Person reidentification by minimum classification error-based kiss metric learning,” IEEE Trans. Cybernetics, vol. 45, no. 2, pp. 242–252, 2015. [35] L. An, M. Kafai, S. Yang, and B. Bhanu, “Person re-identification with reference descriptor,” IEEE Trans. Circuits Syst. Video Technol., vol. PP, no. 99, pp. 1–1, 2015. [36] E. Ahmed, M. Jones, and T. K. Marks, “An improved deep learning architecture for person re-identification,” in CVPR, 2015, pp. 25–34. [37] R. F. de C. Prates and W. R. Schwartz, “Cbra: Color-based ranking aggregation for person re-identification,” in IEEE Int. Conf. IEEE Image Processing (ICIP), 2015, 2015, pp. 4141–4145. [38] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advanced Neural Information Processing Systems, 2012, pp. 1097–1105. [39] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani, “Person re-identification by symmetry-driven accumulation of local features,” in CVPR, 2010, pp. 2360–2367. [40] P. M. Roth, M. Hirzer, M. Köstinger, C. Beleznai, and H. Bischof, “Mahalanobis distance learning for person re-identification,” in Person Re-Identification. Berlin, Germany: Springer, 2014, pp. 247–267. [41] M. Hirzer, P. M. Roth, and H. Bischof, “Person re-identification by efficient impostor-based metric learning,” in IEEE Ninth Int. Conf. Advanced Video and Signal-Based Surveillance (AVSS), 2012, 2012, pp. 203–208.