DOG BREED CLASSIFICATION VIA LANDMARKS ... - IEEE Xplore

7 downloads 63 Views 474KB Size Report
Xiaolong Wang, Vincent Ly, Scott Sorensen and Chandra Kambhamettu. Video/Image Modeling and Synthesis Laboratory,. Department of Computer and ...
DOG BREED CLASSIFICATION VIA LANDMARKS Xiaolong Wang, Vincent Ly, Scott Sorensen and Chandra Kambhamettu Video/Image Modeling and Synthesis Laboratory, Department of Computer and Information Sciences, University of Delaware, Newark, DE ABSTRACT Object recognition is an important problem with a wide range of applications. It is also a challenging problem, especially for animal categorization as the differences among breeds can be subtle. In this paper, based on statistical techniques for landmark-based shape representation, we propose to model the shape of dog breed as points on the Grassmann manifold. We consider the dog breed categorization as the classification problem on this manifold. The proposed scheme is tested on a dataset including 8, 351 images of 133 different breeds. Experimental results demonstrate the advocated scheme outperforms state of the art approaches by nearly 20%. Index Terms— Dog breed classification, geometry, machine learning. 1. INTRODUCTION Object recognition is a wide spread research topic in the area of computer vision and image processing. There are many significant potential applications including automatic robotic navigation and manipulation, scene understanding, etc. Many approaches have been proposed in the past few years. Previous works mainly focused on using texture information to discriminate categories. Among them, the bag-of-features model is one of the most widely used. This approach has shown to be very effective for object classification [5, 9, 24] . This method is adopted from the text document classification [17]. Local features extracted from the object can be regarded as “visual words”of a certain “topic”. This approach is effective at distinguishing different categories of objects, such as cars, airplanes and humans. The main reason for this is the presence of some highly discriminative feature, such as the wheel is highly predicative of a car. Recently, many works have focused on subordinate-level categorization. This problem features categories of similar objects within the same basic-level class, such as different kinds of dogs or cats. In this paper, our work targets dog breed categorization. Compared to previous works, this problem is more challenging. The main difficulty is that the difference between classes is subtle [26]. Fewer discriminative features can be used compared to the basic level categorization,

978-1-4799-5751-4/14/$31.00 ©2014 IEEE

Fig. 1. Illustration of different dog breeds. From the examples listed, we can see that the distinction between different dog breeds is often very subtle. which makes it a more challenging problem. However, it has many useful applications, such as helping to find lost dogs, correcting the mislabeled uploaded images, as well as breed determination of mixed breed dogs for health and safety. Our work can also provide new insight into related subordinatelevel categorization problems. As previously discussed, although the bag-of-features model is effective at dealing with basic-level category classification, it is not good at capturing the subtle differences between classes within the same basic category [26]. Another reason for this shortcoming is a weaker representation of geometry [8]. Considering these weaknesses, studies related to dog breed classification combine appearance and geometry information together to infer the class [15, 18]. In [18], deformable part model [10] is used to characterize the shape of the animals. This deformable model is constructed on the pictorial structures framework [11]. This framework uses a set of connected parts to represent an object. Within each part, HOG filter [6] is applied to extract the local appearance feature. The deformable structure is characterized by spring-like connection between parts. In addition, bag of words is used to characterize the appearance. Based

5237

ICIP 2014

on their results, the deformable part model has been proven effective at object detection, but not a good indicator of characterizing geometrical difference among various objects. In [15], Liu et.al advocated detecting the semantic landmarks in the dog face first, then extract the appearance feature within these specific locations. Their landmarks detection algorithm is based on the approaches advocated in [2]. First, they localize eyes and nose location, then hypothesize the location of other parts, such as ears. SIFT [16] is used to extract appearance features on eight points. Their experimental results show that using the feature localized at specific object parts can improve the classification performance. Although these works use geometry information [15, 18], appearance features still play a major role in the final classification result. As we know, appearance differences between various dog breeds are not easy to capture. Not only for computers, but even for human beings, as illustrated in Fig. 1, it is very difficult to distinguish Belgian Malinois from German Shepherd. Regarding pose and viewing angle variation, the problem becomes much more difficult relying on appearance information. In this paper, we attempt to characterize the geometry of dog faces to discriminate different breeds. Our contributions in this paper include: 1) Use only extracted landmark information to discriminate breeds. The geometrical interpretation of the landmark representation is affine-invariant to the head pose. This is more efficient and simpler than feature based schemes. 2) We implement the algorithms advocated in [18] and [15] as baselines to measure the performance of the proposed scheme. Experimental results illustrate simply using geometrical information outperform previous algorithms by a larger margin, which is nearly 20%. 3) We have evaluated the different landmarks and determined which are most discriminative. This is the first time that this analysis has been performed. The rest of the paper is organized as follows: Our proposed approach is illustrated in Section 2. Section 3 illustrates the experimental results. In Section 4, conclusions are presented. 2. APPROACHES In previous literature, fusion of shape characteristics with appearance features has improved performance [18]. But the improvement is not significant compared with using appearance features alone. Extracting and using shape information is still an open problem. We model the facial geometry of dog breeds based on 2-D landmarks. The general scheme of our algorithm is illustrated in Fig. 3. The difference among dog breeds can be observed from the face structure which is illustrated in Fig. 2. We show that by simply using shape spaces and their associated geometry, one can obtain significant performance improvements in dog breeds categorization. We propose to model the dog facial geometry as points on the Grassmann manifold, then map the dog breed categorization

Fig. 2. Illustration of different dog breeds and their associated landmarks. We find geometries associated with different breeds are discriminant. Within the same category, the shape constructed by the landmarks is similar. to the classification problem on the manifold. 2.1. Grassmann Manifold Recently, the Grassmann manifold has found many applications in computer vision and image processing. Lin et al. [14] applied the Grassmann manifold to extract geometrical traits, which can help optimize the performance of informative projections. Hamm and Lee [13] proposed a unifying view on subspace-based learning. Every subspace was treated as a point in the Grassmann space and a kernel was used in their work. In [4], Chang et al. illustrated that face images can be well represented using the Grassmann manifold, which is invariant to pose, illumination and expression. Wu et al. [25] used the Grassmann manifold to model the facial shapes for age estimation and cross age face recognition. Turaga et al. [20] performed a comprehensive analysis of applications using the Grassmann and Stiefel manifolds in computer vision, with applications such as action recognition, video-based face recognition. In this work, we apply the Grassmann manifold to represent the geometry of different dog breeds. The 2-D landmarks of dog face can be represented by p × 2 matrices. Assuming the facial landmark points of a given face are denoted as X = [(x1 , y1 ), (x2 , y2 ), (x1 , y1 ), . . . , (xp , yp )]. Considering the pose variations and view changes, we can regard the shape of a given face as a transformation from a pre-defined shape basis. Based on this analysis, we extract the tall-thin orthonormal matrix to represent the given subspace. SVD is first applied to X, X = AΣB. We can consider any arbitrary face shape as the spatial transformation of the base face. That means if two facial shapes corresponds to the same basis shape, they will span the same subspace. Z = AAT is used

5238

ICIP 2014

2.2. Classification In this work, SVMs [3] are applied to learn the classifier that maps the feature to the corresponding category. Histogram intersection kernel is used in this work for its good characteristics in image classification [1]. For feature vectors α, y ∈ Rn , n  min(α(i), y(i)). The the intersection kernel is k(α, y) = i=1

classification is based on the evaluation of the sign(d(α)), m  where d(α) = aj yj k(α, αj ) + b, where yj indicates the Fig. 3. Illustration of the proposed scheme. Given the dog face, landmarks are extracted. Eight landmarks are used. We project the landmarks onto the Grassmann manifold, then according to the equivalence property, the points in the Grassmann manifold are projected into the ambient space. Tangent vectors are obtained to represent the feature of the given shape as the input to the classifier.

j=1

label associated with the feature αj . In this work, we train one vs. all SVMs for each breed based on the tangent feature extracted from landmarks. To prevent the overlearning problem, a validation set is used to optimize the parameters. The testing data is not used for parameters optimization. 3. EXPERIMENTS

as the projector of X on the Grassmann manifold. A Grassmann manifold Gm,n [7] represents the space where points are n-dimentional linear subspace in Rm . Grassmann manifold can be viewed as the quotient group of orthogonal groups SO(n) [12]. This indicates that geodesics in Gm,n can be represented using one-parameter exponential flow t → exp(tE), where E is a skew-symmetric matrix. It is   0 DT (m−k)×k ,D ∈ R , where constructed as E = −D 0 sub-matrix D represents the tangent vector determining the direction and speed of the geodesic flow. Given a point X0 on the Grassmann manifold indicated by its orthonormal basis Z0 , The geodesic path starting from S0 is specified by Φ(t) = Q exp(tE)J, where Q ∈ SO(n), QT Z0 = J and J = Ik , 0m−k,k . Matrix D indicates the tangent vector associated with the point Z0 . We refer the reader to [19] for specific details. Given the landmarks of the dog breed faces, we model the face by applying a transformation to warp the average face to the given face. Given the set of sample points Xt on the Grassmann manifold represented by the projectors Zt , the avP 1 Zi , where P is the erage face is calculated as Zavg = P i=1 total number of training samples. The deformation warping the average to the given face is used as the representation of the geometry of the given shape. Given one subspace Z0 and Zavg , the problem becomes how to measure their “difference ”on the Grassmann manifold. In this work, we use the direction matrix D of the geodesic connecting the two subspaces as the feature to represent the shape of the given dog face. It can be computed using the inverse-exponential map on the Grassmann manifold. The details of the computation can be found in [12]. In our work, tangent-vectors D is used as the feature to represent the shape of the given dog face.

In this section, we describe the experiments in detail. [15, 18] are implemented as baselines to evaluate the proposed scheme. The analysis of experimental results is also given. 3.1. Dataset In this work, we use the dataset collected in [15]. The total number of images contained in this dataset is 8, 351, which covers 133 different categories (breeds). This dataset is collected from the popular image sharing websites such as Flicker, Image-Net, etc. Image samples belonging to the same categories vary significantly in ear position, color, pose, lighting, etc, which makes the categorization problem very challenging. Moreover, Amazon Mechanical Turk is used to label the dog breed images in this dataset. Eight points for each dog face are labeled by the workers. These points indicate the location of eyes, the nose, the top of the heads, ears’ tips and the inner ears. 3.2. Experimental Results In this paper, we follow the same experimental setting as [15]. One vs. all SVMs for each given breed are employed. We also keep the same training and testing divisions as [15]. We did not compare with previous popular algorithms (MKL [21], LLC [23], BOW [22]). Since the performance of these algorithms have been already evaluated in [15], which proves to be lower than the scheme they advocated. Given the eight landmarks extracted on the dog face, methods described in Section 2 are applied to get the corresponding tangent feature vector D. SVMs classification function f is used to classify the specific category = f (D). To visualize the performance of different matching schemes, we draw ROC curves, as shown in Fig. 4. The accuracy of different schemes is listed in Table 1. The accuracy indicates the percentage of samples correctly classified. It’s the same as the evaluation criteria –“first

5239

ICIP 2014

Table 1. Experimental results comparison. In this paper, we use the landmarks labeled in the dataset.(Liu et al.∗ indicates the results obtained using automatically detected landmarks, except that, all other algorithms use the labeled landmarks(ground truth). Method Parkhi et al. [18] Liu et al. [15] Liu et al.∗ [15] Our method

Accuracy 75.1% 77.2% 67.0% 96.5%

Fig. 5. Illustration of landmarks position on the given dog face.

Fig. 6. Comparison results using different landmarks. The number illustrated here corresponds to the points label illustrated in Fig. 5. Fig. 4. The ROC curves of dog breed classification using different approaches.

guess ”used in [15]. There is nearly 20% improvement compared to the state of art work. These results prove that shape feature is a good representation to discriminate different dog breeds. Feature extraction from the localized facial parts [15] also performs better than using the global images [18]. Using our proposed scheme is simpler and more efficient for dealing with this problem.

3.3. Effects of Landmarks on the Classification The proposed method uses eight landmarks, which are illustrated in Fig. 5. The question is if fewer landmarks are enough and which landmarks are most useful for classification. We have conducted several experiments using different landmarks to investigate this effect. The results are illustrated in Fig. 6. Based on these results, we find the relative placement of eyes is more helpful than the nose for categorizing breeds. Though dogs can move their ears, the ears’ relative location still preserves characteristics of the breed.

4. CONCLUSION In this work, we have proposed a new scheme to model the characteristics of 2-D landmarks extracted from dog breeds. The Grassmann manifold is applied to describe the geometry of a given breed structure. Our algorithm is tested on a large-scale dataset, and the experiments demonstrate significant improvement over the state of art work. The advocated algorithm is simple and more efficient. Experimental results demonstrate the projection of the face geometry to the Grassmann manifold preserves the category information. This feature is invariant to the unconstrained situation of the collected data. We show that by simply using geometry based features, excellent results can be obtained. This work not only provides a new algorithm for dog breed classification, but also gives insights to the subordinate-level object classification problem.

Acknowledgements The authors would like to thank J. Liu from Columbia University for providing the database used, and for helpful discussions.

5240

ICIP 2014

Reference [1] A. Barla, F. Odone, and A. Verri. Histogram intersection kernel for image classification. In ICIP, volume 3, pages III–513, 2003. [2] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar. Localizing parts of faces using a consensus of exemplars. In CVPR, pages 545–552, 2011. [3] C. J. Burges. A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2):121–167, 1998. [4] J.-M. Chang, M. Kirby, and C. Peterson. Set-to-set face recognition under variations in pose and illumination. In Biometrics Symposium, pages 1–6, 2007. [5] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In Workshop on statistical learning in ECCV, volume 1, pages 1–2, 2004. [6] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, volume 1, pages 886– 893, 2005. [7] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303– 353, 1998. [8] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010. [9] L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning natural scene categories. In CVPR, volume 2, pages 524–531, 2005. [10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627– 1645, 2010. [11] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition. International Journal of Computer Vision, 61(1):55–79, 2005. [12] K. A. Gallivan, A. Srivastava, X. Liu, and P. Van Dooren. Efficient algorithms for inferences on grassmann manifolds. In IEEE Workshop on Statistical Signal Processing, pages 315–318, 2003.

[14] D. Lin, S. Yan, and X. Tang. Pursuing informative projection on grassmann manifold. In CVPR, volume 2, pages 1727–1734, 2006. [15] J. Liu, A. Kanazawa, D. Jacobs, and P. Belhumeur. Dog breed classification using part localization. In ECCV, pages 172–185, 2012. [16] D. G. Lowe. Distinctive image features from scaleinvariant keypoints. International journal of computer vision, 60(2):91–110, 2004. [17] A. McCallum, K. Nigam, et al. A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization, volume 752, pages 41–48, 1998. [18] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar. Cats and dogs. In CVPR, pages 3498–3505, 2012. [19] A. Srivastava and E. Klassen. Bayesian and geometric subspace tracking. Advances in Applied Probability, 36(1):43–56, 2004. [20] P. Turaga, A. Veeraraghavan, A. Srivastava, and R. Chellappa. Statistical computations on grassmann and stiefel manifolds for image and video-based recognition. PAMI, 33(11):2273–2286, 2011. [21] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiple kernels for object detection. In CVPR, pages 606–613, 2009. [22] A. Vedaldi and A. Zisserman. Image classification practical. http://www.robots.ox.ac.uk/ vgg/share/practicalimage-classification.htm. 2011. [23] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In CVPR, pages 3360–3367, 2010. [24] J. Willamowski, D. Arregui, G. Csurka, C. R. Dance, and L. Fan. Categorizing nine visual classes using local appearance descriptors. illumination, 17:21, 2004. [25] T. Wu, P. Turaga, and R. Chellappa. Age estimation and face verification across aging using landmarks. IEEE Transactions on Information Forensics and Security, 2012. [26] S. Yang, L. Bo, J. Wang, and L. G. Shapiro. Unsupervised template learning for fine-grained object recognition. In NIPS, pages 3131–3139, 2012.

[13] J. Hamm and D. D. Lee. Grassmann discriminant analysis: a unifying view on subspace-based learning. In Proceedings of the 25th international conference on Machine learning, pages 376–383. ACM, 2008.

5241

ICIP 2014