Silhouette-based Object Phenotype Recognition using 3D ... - CiteSeerX

0 downloads 0 Views 2MB Size Report
nition studies in computer vision, and wild animal (such as horse and fish) tracking, etc. In this work, we address a novel challenging task of shape recognition ...
Silhouette-based Object Phenotype Recognition using 3D Shape Priors Yu Chen1 Tae-Kyun Kim2 Roberto Cipolla1 Department of Engineering, University of Cambridge, Cambridge, UK1 Department of Electrical Engineering, Imperial College, London, UK2 [email protected]

[email protected]

Abstract

[email protected]

nition studies in computer vision, and wild animal (such as horse and fish) tracking, etc. In this work, we address a novel challenging task of shape recognition, i.e. classifying phenotypes of the 3D object from a single 2D silhouette input (see Fig. 1 for an example of human body shapes). Here, phenotypes are referred to the intrinsic shape differences across given human instances, e.g., fat vs thin, tall vs short, muscular vs unmuscular. The major difficulty of this problem is that the query silhouette can undergo large pose and camera view-point changes. Traditional 2D-based approaches fail to capture the intrinsic shape (dis-)similarity between the query and gallery silhouettes. In view of this problem, we propose a novel generative+discriminative solution by using 3D shape priors, i.e. the knowledge learnt from previously-seen 3D shapes. Our approach is motivated by the observation that humans can perceive the 3D shape of an object from a single image, provided that they have seen similar 3D shapes. Once 3D shapes are estimated from single images (single view reconstruction), camera view-point/pose invariant object recognition is achievable. The problem we tackle, therefore, conjoins single view reconstruction with 3D object recognition. The novelties and main contributions lie in:

This paper tackles the novel challenging problem of 3D object phenotype recognition from a single 2D silhouette. To bridge the large pose (articulation or deformation) and camera viewpoint changes between the gallery images and query image, we propose a novel probabilistic inference algorithm based on 3D shape priors. Our approach combines both generative and discriminative learning. We use latent probabilistic generative models to capture 3D shape and pose variations from a set of 3D mesh models. Based on these 3D shape priors, we generate a large number of projections for different phenotype classes, poses, and camera viewpoints, and implement Random Forests to efficiently solve the shape and pose inference problems. By model selection in terms of the silhouette coherency between the query and the projections of 3D shapes synthesized using the galleries, we achieve the phenotype recognition result as well as a fast approximate 3D reconstruction of the query. To verify the efficacy of the proposed approach, we present new datasets which contain over 500 images of various human and shark phenotypes and motions. The experimental results clearly show the benefits of using the 3D priors in the proposed method over previous 2D-based methods.

• Going beyond pose recognition: object pose recognition and tracking by 3D template models has been widely studied [22, 32, 37, 40, 43]. This work attempts to capture more subtle 3D shape variations on the top of the estimated pose and camera view-points. • Recognising generic deformable objects: our framework does not require strong class-specific knowledge such as human body skeleton consisting of a number of joints in [3, 12, 16, 42] or face shape models defined by manual control points [53], and is thus applicable to different object categories. Previous studies [21, 25, 34, 49, 52, 24] are limited to rigid object classes. • Exploiting shape cues (vs textures): whereas a majority of existing 3D object recognition work relies on image appearance or textures (e.g., affine invariant

1. Introduction Recognizing 3D objects from one or more 2D views is a fundamental problem in computer vision. There have been increasing attempts to solve this problem, which embraces a number of research issues such as view-invariant object instance/category recognition [20, 21, 25, 35, 47, 49, 52, 24], object pose recognition [2, 9, 17, 22, 27, 32, 37, 43, 46, 44, 48, 50], object viewpoint classification [15], gait recognition [28], face recognition across pose and expression [30, 53], etc. However, to our best knowledge, the problem of classifying generic object phenotypes (shapes), under 3D object pose and camera view-point changes, has not been tackled. The successful solutions would be widely useful for potential applications such as automatic human body shape monitoring, in relation with recent food recog1

Query

Gallery Figure 1. Phenotype recognition problem. Given a silhouette gallery of different body shapes, the goal is to classify the body shape of a query silhouette in the presence of pose and/or camera viewpoint changes.

patches [35, 47]), we exploit shape cues, silhouettes, which are useful when there is no overlap in views between a model and a query, or no consistent textures e.g. changing clothes etc. • Transferring 3D models to images: we learn from 3D models and perform recognition of images, which contrasts previous work matching only among 3D models [8] or 2D images.

1.1. Related Work There has been a growing interest for view-invariant object instance or category recognition [35, 47]. Their building blocks are often the image patches that are invariant up to affine transformations, and the structural relations among the patches are then captured. Texture-based object recognition is useful manywhere though, it becomes inherently ambiguous when there are no consistent textures between a model and a query: no overlapping views, changing clothes, or textureless objects. Shape (silhouette or edge map) is another useful cue which has been long explored for object recognition, however most relevant studies have been done in 2D [11, 39, 45]. They do not explicitly capture 3D shapes, poses, camera view-points of objects, relying on a large number of model images. It basically fails when query images exhibit considerably different poses or view-points from those of models. On the other hand, studies on 2D shape representation [14, 26] have tackled the problem of recognizing articulated objects, but they model the articulation on a 2D basis and have difficulties dealing with self-occlusions and large 3D camera pose changes. 3D templates and shape models have been widely incor-

porated into object pose recognition problems for hands [37, 46] or human bodies [17, 27, 32, 37, 43, 48, 50], but their models are designed for pose, often without consideration of shape variations. Whereas they do not explicitly handle the classification problem of phenotypes or 3D shapes, we capture and discriminate 3D shape variations in an invariant manner to object poses and camera view-points. Single view reconstruction is an active research field. Just to name a few, Prasad et al. [29] reconstructed curved objects from wireframes; Han et al. [18] applied Bayesian reconstruction for polyhedral objects, trees, or grass. Black et al. [4, 3, 12, 16, 42] estimated detailed human body shapes using parametric morphable models. In [42], a discriminative+generative method was proposed to help initialise the body parameters for reconstruction. In [3], Bˇalan and Black have touched the phenotype classification/recognition problem in a preliminary way. They tested their shape estimation approach on a binary-class gender classification task and they have achieved good accuracy. In [16], shading cues are incorporated for single view reconstruction. Although they showed detailed shape recovery, it does not seem easy, in general, to solve the regression problem of the huge parametric space of joint angles, and to extend the approach to model other object categories. Chen et al. [7] tackled more general deformable object categories. The shape and pose generators need only a small number of latent variables to estimate, yet are able to capture complex 3D object shapes. Another close work to ours is [34], where an unified method to segment, infer 3D shapes and recognise object categories is proposed. They use a crude voxel representation for the shape prior and apply it to object categories such as cups, mugs, plates, etc. However, they are limited to simple and rigid objects. In [25, 49], 3D geometrical models are learnt to detect objects in images, but similarly, no articulation or deformation is considered. The following branches of studies have conceptual differences from our work. Studies for human gait recognition [28] perform human identification from video sequences (instead of images) in a fixed camera view-point. Image-based face recognition across pose is an intensively studied area [30, 53]. Representative methods exploit active face shape models for view-point invariance [53] or expression invariance [31], however, these models are specifically designed for faces, involving many control points manually defined. Studies for 3D object retrieval are quite different, as they match one 3D model with another.

2. Phenotype Recognition and Shape Reconstruction Based on Classifiers In the paper, the phenotype recognition problem is formulated as follows. We assume that a set of 2D phenotype Nc galleries G = {SG c }c=1 of Nc instances, which contains one

~q

S

Gallery Silhouette 9

W M FC

V

Query Silhouette

1

10

xS

6

FS 8

6

3

2

7

1

xA

FA

S Gc'

9

10

8

2

4 5

4

3

c* 7

5

c' = 1,2,...,NC

S

q

FS

FA

Figure 2. The graphical model for the 3D shape recognition and reconstruction problem (left). Example trees of Random Forests for the phenotype FS (middle) and pose FA (right): they show the class histogram at each split node and the phenotype/pose class at each leaf node. Note that the trees shown here are grown tiny for the visualisation purpose.

sample silhouette SG c for each phenotype class c (see Fig. 1 for examples), is provided as the reference, and all the silhouettes in G are in a common canonical pose. We hope to find the phenotype label c∗ ∈ {1, 2, · · · , Nc } for a query silhouette Sq in an arbitrary pose and camera viewpoint. To handle the difficulties caused by poses and camera view changes, our approach learns 3D shape priors M on available 3D data. Gaussian Process latent variable models (GPLVMs) [23] have been shown powerful in solving the problems of human pose estimation [2, 10, 38, 50, 51] and 3D shape modeling [7, 6, 36, 37]. We implement the framework in [7], in which two GPLVMs, the shape generator MS and the pose generator MA , are learned separately, and then jointly used to synthesize new 3D shapes. Each 3D shape V is then parametrized by a phenotype latent variable xS and a pose latent variable xA , as Fig. 2(left) shows. Please refer to [7] for details about the model training. Given a silhouette image Sq , we infer its embedding pose latent parameters xA , and the camera parameters γ so that we can neutralise the influence of pose and camera viewpoint changes in the recognition. Inferring these parameters can be done through the optimisation process of the generative model in [7]. However, the back-projection from 2D to 3D is usually multi-modal, and this results in a non-convex objective function with multiple local optima, which is usually difficult to solve. To avoid such a non-convex optimisation, some previous studies have tried combining generative and discriminative approaches. This strategy has been exploited for articulated pose recognition and tracking [9, 33, 44] or 3D shape recovery [37, 42]. Here, we introduce a discriminative approach based on random forest (RF) classifiers for fast hypothesizing shape (phenotype), pose, and camera parameters. Random forests are shown to have exceptional performance in solving pose recognition and multi-modal mapping problems [32, 40]. In our approach, three RFs, the phenotype classifier FS , the pose classifier FA , and the camera classifier FC , are learned based on a large number of silhouettes

synthesized by MS and MA with different camera parameters γ (see Section 3 for details of learning these RFs). FS predicts the shape parameter xS from a gallery silhouette SG c , while FA and FC predict the pose and camera parameters {xA , γ} from the query silhouette Sq . The silhouette Sq or SG c is passed down each tree in the forest, and the leaf nodes of these trees quickly provide multiple candidates of its corresponding shape, pose or camera parameters. (see Fig. 2 for examples). Finally, the 3D shape V of the query Sq is recovered by the estimated pose latent values xA of Sq and the shape latent values xS of each gallery instance SG c (see Section 2.2), and the recognition is achieved by a model-selection, i.e., assigning the phenotype class c∗ that yields the best matching between the query silhouette Sq and the projection of the reconstructed 3D shape V in camera viewpoint γ (see Section 2.1).

2.1. Phenotype Recognition Phenotype recognition is formulated as a modelselection problem. Based on the graphical model in Fig. 2(left), we infer the label c∗ of the query instance by maximizing a posteriori probability given pre-learned shape priors M = {MS , MA } and classifiers F = {FS , FA , FC } as Nc ˜q , {SG P (c∗ |Sq , S c }c=1 , M, F ) Nc ∗ ˜q |c∗ , Sq , {SG ∝P (S c }c=1 , M, F )P (c ) Z q P (xS |SG =P (c∗ ) c∗ , FS )P (xA |S , FA ) xS ,xA ,γ

˜q |xS , xA , γ, M)dxS dxA dγ, P (γ|S , FC )P (S q

(1)

˜q denotes the mirror node of Sq . Here, we assume where S that the class prior P (c∗ ) is subject to a uniform distribution, i.e., P (c∗ ) = 1/Nc . In (1), the first three terms describe the prior of shape and pose latent parameters (xS , xA ) and camera parameters γ from the random forest classifiers FS , FA , and FC ,

respectively. The shape classifier FS predicts NS candidate ∗ S phenotype shapes {xcS,i }N i=1 for the canonical posed gallery G silhouette Sc∗ of each class c∗ ; while the pose classifier FA and the camera viewpoint classifier FC predict NA candiA date poses {xA,j }N j=1 and NK candidate camera parameters q K {γk }N k=1 for the query silhouette input S . Mathematically, these three terms can be written as delta impulses. P (xS |SG c∗ , FS ) =

NS X

hcS,i δ(xS − xcS,i ),

NA X

hA,j δ(xA − xA,j ),





(2)

2.2. Single View 3D Shape Reconstruction As a by-product, our framework can also be used to quickly predict an approximate 3D shape V from the query silhouette Sq . This shape reconstruction problem can be formulated probabilistically as follows: Nc P (V|Sq , {SG c }c=1 , M, F )  Nc  Z X q G = P (V, xS , xA , c|S , Sc , M, F )dxS dxA c=1

i=1

P (xA |Sq , FA ) =

(3)

=

Nc X c=1

xS ,xA

Z P (c)

xS ,xA

q P (xS |SG c , FS )P (xA |S , FA )

j=1

P (γ|Sq , FC ) =

NK X

P (V|xS , xA , M)dxS dxA hC,k δ(γ − γk ),

(4)

k=1



∗ where hcS,i ,

hA,j , and hC,k are class histogram values voted by every tree and they PNS in FS ,PFNAA, and FC ,Prespectively, NK satisfy i=1 hS,i = j=1 hA,j = k=1 hC,k = 1.1 In the last term of the model, each combination of shape and pose latent parameters (xS , xA ), and the camera pose γ are verified by the silhouette likelihood of the query image ˜ q . It can be formulated as the following equation [7]. S ˜q |xS , xA , γ, M) P (S 1 −OCM ≈ q e 1 ZS det I + σ2 ΣW



˜ q /2σ2 µW ,S s

,

(5)

s

where W ∼ N (µW , ΣW ) is referred to the projected silhouette of the latent 3D shape V in the camera viewpoint γ and it is defined as a multi-variate Gaussian distribution; µW and ΣW are the mean and the covariance matrix of W, respectively2; σs2 and ZS are normalisation factors. We use oriented Chamfer matching (OCM) distance [46] to measure the similarity between the mean projected silhouette ˜ q . Detailed forµW and the silhouette of the query image S mulations of OCM are described in Section 4.2. Given all the probability terms, the final posterior probability in (1) can be computed as: Nc ˜ q , {SG P (c∗ |Sq , S c }c=1 , M, F ) NK NA X NS X ∗ ∗ 1 X ˜q |xcS,i hcS,i hA,j hC,k P (S , xA,j , γk , M). ≈ NC i=1 j=1 k=1

(6) 1 For the purpose of robustness and acceleration, we discard all the ∗ small-weighted candidates under the thresholds hcS,i < 0.05, hA,j < 0.05, and hC,k < 0.05 in the experiments. 2 Please refer to equations (12) and (15) in [7] for the details forms of µW and ΣW .



Nc X NS X NA  ∗ 1 X hcS,i hA,j N V|µV , ΣV . NC c=1 i=1 j=1

(7)

where µV = µV (xA,j , xcS,i ) and ΣV = ΣV (xA,j , xcS,i ) are referred to the mean and variance function of the 3D shape distribution V, respectively, and their detailed formulations can be found in [7]. Compared with the optimisation approach in [7], the classifiers-based approach in the paper provides fairly good qualitative results and is much more efficient in computation time (See Section 4.4).

3. Training Random Forest Classifiers In order to learn the random forest classifiers F = {FS , FA , FC }, we use the shape and pose generators {MS , MA } to synthesize a large number of silhouettes with different latent parameters {xS , xA } and camera viewpoints γ. The shape classifier FS , an ensemble of randomised decision trees, is used to encode the phenotype information of each gallery silhouette SG c in the canonical pose. It is trained on a dataset D1 consisting of canonical-posed silhouettes of N = 50 phenotype samples {xS,i }N i=1 which are uniformly sampled from the latent space of the shape generator MS . For each sample of phenotype label i ∈ {1, 2, · · · , N }, we generate R = 250 sample silhouettes from the 3D mesh model with minor pose perturbations and camera parameter changes, e.g., slight camera rotations and focal length changes. All N × R = 12500 binary images are aligned and normalised to have the same size. On the other hand, the pose classifier FA and the camera classifier FC are used to predict the pose and camera viewpoint of the query silhouette Sq . We train them on another dataset D2 with large pose and camera viewpoint variations as well as phenotype variations. We uniformly sample M = 50 pose samples {xA,j }M j=1 from the latent space of the pose generator MA , and K = 50 camera viewpoint samples {γk }K k=1 uniformly distributed in the 3D space, and

generate 3D shapes along with the same N = 50 phenotype samples {xS,i }N i=1 used in the shape classifier training stage. This generates N × M × K = 125, 000 silhouette instances, and each of them is labeled by (i, j, k) representing phenotype, pose and camera viewpoint, respectively. An ensemble of decision trees for FA and FC are grown by the pose label j and camera label k, respectively3. See below for the random features and split criteria used.

wd hd

0.00 0.62 0.99

ld,1

0.62 0.00 0.91

l d,2

3.1. Error Tolerant Features We generate D = 12000 random rectangle pairs {(Rd,1 , Rd,2 )}D with arbitrary centroid locations d=1 (ld,1 , ld,2), heights hd , and widths wd (see Fig 3(a) for example). For each binary silhouette image I for training, the difference of mean image intensity values within each pair of rectangles is then computed as the  split feature P P ′ fd = wd1hd I(l ) . In this way, I(l) − l′ ∈Rd,2 l∈Rd,1 each training instance I is hence converted to a 12000-D feature vector f = [fd ]D d=1 . These features are efficient to compute and capture spatial context [41]. When training the phenotype classifier FS , we also introduce a feature-correction scheme. Since FS is trained on synthetic silhouettes generated by the shape priors M, which are clean and unclothed, its discriminative power is usually reduced when working on noisy gallery silhouettes segmented from real images. To model the systematic errors between the synthetic and real silhouettes, we use the approach in [7] to create an extra silhouette set which consists of Ne pairs of real and synthetic silhouettes describing different clothing and segmentation errors and capturing different phenotypes. We then extract the features from ime ages using the same set of random rectangle pairs. Here, ˜ fm (m = 1, 2, · · · , Ne ) denotes the features extracted from the e real silhouette images, and fm denotes those from the corresponding synthetic silhouette images. The feature errors e e can be thus modeled by em = ˜ fm − fm . To compensate for the systematic silhouette errors when training FS , we correct all those synthetic training data with e these error vectors {em }N m=1 . For each feature vector f of instance I, we find its T nearest neighbor synthetic features in E (we choose T = 3), and use the corresponding error vectors et to correct f as ˜ ft = et + f , (t = 1, 2, · · · , T ). Finally, all N ×R×T corrected features vectors ˜ ft of N ×R training instances are used as training samples for FS .

3.2. Similarity-Aware Criteria Functions When training a random forest classifier F ∗ ∈ {FS , FA , FC }, the instance I is pushed through each tree in F ∗ starting from the root node. At each split node, the error-corrected random feature ˜ f = [f˜d ]D d=1 is evaluated for 3 In our implementation, we set the tree number N of all the forests T FS , FA , and FC to be 30, and the maximum depth dmax of a single tree to be 30.

0.99 0.91 0.00

(a)

(b)

Figure 3. (a) Random paired-rectangle features. (b) A 3 × 3 example of dissimilarity weighting matrix Π for human phenotype classes.

every single training instance (see Section 3.1). Then, based on the result of the binary test f˜d > τth , I is sent to the left or the right child node. The feature dimension index d and the split value τth at a split node n are chosen to maxL )+|nR |C(nR ) , where C imize ∆C(n) = C(n) − |nL |C(n |nL |+|nR | measures the distribution purity of the node, and nL and nR denote the left and right children of node n. For the criteria function C, we generalise Gibbs and Martin’s diversity index [13] and take the class similarity into account: C(n) =pTn Πpn ,

(8)

where pn = [pn,1 , pn,2 , · · · , pn,NB ] is referred to the class distribution of node n; NB denotes the number of class labels of the random forest F ∗ ; the weighting matrix Π = 2 2 {πij = 1 − e−k∆Vi,j k /σ }NB ×NB , which is defined by the average spatial mesh distance k∆Vi,j k2 between classes i and j (see Fig. 3(b) for an example of phenotype classes). When Π = 1 − I, equationP (8) is reduced to the standard NB 2 pn,c . Intuitively, a lower diversity index C(n) = 1 − c=1 misclassification penalty is assigned between two visually similar classes in (8). The experiment shows that such a similarity weighting scheme notably improves the recognition rate (see Section 4.3).

4. Experimental Results 4.1. Datasets We have verified the efficacy of our approach on two shape categories: humans and sharks. For the human data, we train the shape model MS on the CAESAR database, which contains over 2000 different body shapes of North American and European adults in a common standing pose, and train the pose model MA on 42 walking and jumpingjack sequences in CMU Mocap dataset. For the shark data, we learn MS on a shape data set that contains eleven

4.2. Comparative Methods

Canonical Pose

Arbitrary Pose

Figure 4. Examples of images and their corresponding silhouettes in our phenotype-class recognition datasets. Row 1,2: dataset structure: one canonical-posed instance and several arbitraryposed instances; Row 3: human jumping-jack motion; Row 4: human walking; Row 5: shark underwater swimming motion.

3D shark models of different shark species available from the Internet [19] , and MA on an animatable 3D MEX shark model to generate an 11-frame sequence of shark tailwaving motion [7]. The mesh resolutions are: 3678 vertices/7356 faces for the human data, and 1840 vertices/3676 faces for shark data, respectively. We empirically set the latent space dimension of the shape model MS to be 6 for human data and 3 for shark data, while for the pose model MA , we set the latent dimension to be 2 for both, similarly to [7]. As there is no suitable public datasets to evaluate the proposed approach, we have collected two new silhouette datasets which capture a wide span of phenotype, pose, and camera viewpoint changes(see Fig. 4 for examples). Human motion dataset mainly captures two different human motions: walking (184 images of 16 human instances) and jumping-jack motion (170 images of 13 human instances). The images are cropped from video sequences on YouTube and public available human motion datasets, e.g., HumanEva [43]. For each instance, a canonical standing pose image is provided (see Fig. 1 and Row 1, 2 of Fig. 4). All the instances are in tightly-fitting clothing. Shark motion dataset includes 168 images of 13 shark instances of 5 sub-species. These images are cropped from underwater swimming sequences downloaded from Internet. For each instance, a profile-view image is provided as the canonicalpose gallery image. The silhouettes are manually segmented from the images and all of them are normalised by their height and resized to the resolution 121 × 155. For both datasets, Ne = 20 additional images are collected for modeling the feature errors (in Section 3.1).

For the purpose of comparison, we also implemented three state-of-the-art methods based on 2D shape matching: 1) Shape contexts (SC) [1], 2) Inner-Distance Shape Context (IDSC) [26], and 3) the oriented chamfer matching (OCM) [46], and two methods using the 3D shape priors: 4) the single-view 3D shape reconstruction method by Mixture of Experts [42] and 5) the RF implementation directly using the shape class labels. Nearest Neighbor classification is performed in terms of the similarity provided by the compared methods. Histogram of Shape Context (HoSC). Shape contexts (SC) [5] are rich local shape-based histograms encoding contour information and they have been widely used for shape matching and pose recognition. Since SCs are defined locally on every single silhouette point, representing the whole shape can be expensive. To reduce the dimensionality of shape contexts, Agarwal and Triggs introduce a bag-of-features scheme called histogram of shape context (HoSC) [1] for human pose estimation. In HoSC, kmeans clustering is used to yield a L-dimensional codebook of the cluster means (L = 100 in the paper), and all its shape contexts are then softly binned to a quantized Ldimensional histograms. We implemented a 2D approach HoSC-χ2 , which compares the χ2 -distances of HoSC features extracted from the query and each gallery silhouette. Inner-Distance Shape Context (IDSC). Recent research on shape matching has addressed the problem of finding articulation invariant distance measurement for 2D shapes. Among them, a representative recent work is Inner-Distance Shape Context (IDSC) by Ling and Jacobs [26], which has been proved successful in 2D shape classification problems. The authors’ own code is used. 2D Oriented Chamfer matching (OCM). Chamfer matching and its variants have been widely used for shape matching and pose recognition.Among them, oriented Chamfer matching has been proved to be an effective method for shape-based template matching [46]. The query silhouette NG

N

q G c and gallery silhouettes SG Sq = {sqk }k=1 c = {sc,j }j=1 q G (c = 1, 2, · · · , Nc ), where sk and sc,j denote edge points, ch are divided into Nch orientation channels: {Sqt }N t=1 and G Nch {Sc,t }t=1 , respectively. In our implementation, we set Nch = 8. To minimise the allocation error of image edges in orientation, an edge point sG c,j is assigned to both adjacent channels when its orientation is around the border region. q The OCM distance between SG c and S is calculated as the sum of independent chamfer distance with each independent orientation channel, as the following equation shows:

q OCM (SG c ,S ) =

Nch 1 X Nq t=1

X

G sG c,j ∈Sc,t

2 min ksqk − sG c,j k , (9)

sq ∈Sq t k

Recognition Rate (%)

Recognition Rate (%)

Recognition Rate (%) 80 70

80

80

75

75

70

70

65

65

60 50 40 OCM HoSC−χ2

30 20 10

(a) 0

Human (jumping jack)

Human (walking)

Shark

60

60

HoSC−MoE−Chamfer IDSC Single shape RF G+D G+D−E G+D−S

Human (jumpingjack) Human (walking) Shark

55

(b)

50 20

25

30 Max Tree Depth d max

35

55

40

50 10

(c)

Human (jumpingjack) Human (walking) Shark 20 Tree Number N T

30

Figure 5. Phenotype recognition accuracy on human and shark datasets. (a) Comparison over 8 different approaches; (b) performance under different maximum tree depths dmax ; and (c) different tree numbers NT of the random forests FS , FA , and FC .

Mixture of Experts for the shape reconstruction. We implemented a 3D shape recognition approach, called HoSCMoE-Chamfer, based on the shape reconstruction framework proposed in [42], in which mappings from HoSC features to shape and pose parameters are learned using a Mixture of Experts (MoE) model. In their work, weighted linear regressors are used as mixture components. We fix the number of components to be 10 in our implementation. For a fair comparison, the same training sets D1 and D2 and shape priors M are used, and the recognition is also based on the OCM distance between the predicted shape and the query silhouette. Single Random Forest Shape Verification. We also compare our framework with a straightforward classification approach based on a single shape random forest, in which FS is directly learned on the large pose and camera viewpoint variation dataset D2 according to the phenotype label i (see Section 3). For an arbitrary input silhouette Sq , in the format of a binary image, we push it through each tree Tr (r = 1, 2, · · · , NF ) starting at the root and apply the corresponding sequence of binary test until it reaches the leaf node, which stores the phenotype label ir . the phenotype prediction from the forest FS is given by a histogram hq = {hq,i }D i=1 which summaries the phenotype vote from each tree Tr (r = 1, 2, · · · , NF ). The phenotype similarity between the query silhouette Sq and an gallery silhouette 2 SG c (c = 1, 2, · · · , NC ) can be measured by the χ -distance between their random forest prediction histograms hq and G D hG c = {hc,i }i=1 as follows: χ2 (Sq , SG c )=

D 2 X (hq,i − hG c,i ) . 2(hq,i + hG c,i ) i=1

(10)

PN ¯q = i=1 hq1/2 ,i 1N is referred to a vector conwhere h 1/2 sisting of the average bin value of the histogram hq1 (hq2 ).

4.3. Numerical Results of Phenotype Recognition We perform cross validations by randomly selecting 5 different instances, where we use their canonical posed im-

ages as the galleries and any other poses as the query. Fig. 5(a) provides the recognition rates of different approaches. In general, the 3D-based approaches (single RF, HoSC-MoE-Chamfer and the proposed method G+D) outperform those 2D-based ones (OCM, HoSC-χ2 , and IDSC) in the phenotype recognition tasks. The best 2D shape measurement IDSC achieves a close performance to that of 3D approaches. This indicates the benefit of using 3D shape priors to handle pose deformations and camera viewpoint changes. On the other hand, given the same training data, our approach (G+D) performs best among three 3D approaches under all contexts. Compared to the single shape RF, our framework that factorizes three types of variations in the training stage, better captures subtle shape variations. In most cases, object pose and camera viewpoint changes are more dominant factors that affect the silhouette appearance than phenotype variations, and hence they greatly distract the discriminative power of the single RF which is directly learnt on the mixed variation data set D2 with the shape labels. Instead, we learn the phenotype classifier FS on a canonical-posed dataset D1 , which does not include large pose and camera viewpoint changes. For the pose and camera classifiers, we use the the mixed variation data set D2 but with the pose and camera labels respectively. The pose and camera parameters are much more reliably estimated than the shape parameter for given the same training data. The comparison between our approach and HoSCMoE-Chamfer shows that given the same training data, the random forests and rectangle features we used also outperform the combination of MoE and HoSC features in the setting of phenotype discrimination. This could partially be owing to the feature selection process during the RF training stage and the scheme of generating multiple hypotheses for a single input in the RF prediction stage. We carried out more experiments to further understand how the recognition performance can be affected by the features and parameters of RFs, and the noise of input silhouettes. To evaluate the benefit of using the feature correction (Section 3.1) and similarity-based criteria function

Recognition Rate (%) 80 Human (jumping−jack) Human (walking) Shark

75

70

0%

1%

65

2.5%

60

55

50

5%

7 .5 %

10%

(a)

45

0

2

(b)

4 6 Noise Level (%)

8

10

Figure 6. (a) A silhouette corrupted by random pepper and salt noise of different levels. (b) Phenotype recognition performance under different noise levels.

(Section 3.2), we implemented multiple variants of the proposed approach (G+D): the framework without error modeling (G+D-E), and the framework using standard diversity index [13] as the criteria function instead (G+D-S). These results are also presented in Fig. 5. It shows that both schemes help improve the recognition performance of our approach to some extent in all three datasets. We then investigated the influence of random forests parameters maximum tree depth dmax and the tree number NT of FS , FA , and FC . As shown in Fig. 5(b) and 5(c), the accuracy does not vary much at over 25 depths, but increasing the number of trees of each forest gradually improves the recognition rate. Finally, we checked the noise sensitivity of the proposed phenotype recognition approach. In the experiment, all the testing silhouettes are corrupted by random pepper and salt noise of a set of levels: 0%, 1%, 2.5%, 5%, 7.5%, 10% (see Fig. 6(a)), and the algorithm (G+D) is then run on corrupted silhouette. The recognition rates achieved at different noise levels are given in Fig. 6(b). We can observe that the recognition performance degrades slowly when the noise level is below 5% and drops more rapidly when the noise level reaches 7.5%. This indicates a fairly good robustness when some noise is present in the input silhouette. We think that this robustness may come from the pair-wise rectangular features we used, which smooth out the effect of noise to some extent. We also find that the performance curves can sometimes be non-monotonic, which is partially due to the randomness of features and decision thresholds in those RF classifiers.

4.4. Approximate Single View Reconstruction In our framework, these intermediate 3D shape candidates V obtained during the recognition process can be

used for approximate 3D reconstruction from a single silhouette input, as mentioned in Section 2.2. In Fig. 7, we show some qualitative 3D outputs of different phenotypes using our framework in contrast with those generated using the approach in [7]. In general, these highest-weight shape candidates generated by random forest classifiers often include meaningful shapes which can be used as fairly good approximate reconstruction results, albeit relatively lower silhouette coherency and less accurate pose estimation. However, we also notice that some results may still be in wrong phenotype (e.g., instance 5) or in a wrong pose or camera viewpoint (e.g., instance 9). This is mainly due to the silhouette ambiguity or a limitation on the discriminative power of random forest classifiers given our training set. We also compute the running time of both approaches under a 2.8GHz CPU. The average time for generating a 3D shape using our new generative+discriminative framework is less than 10 seconds using unoptimised Matlab codes, while using the approach in [7] takes about 10 to 15 minutes for generating 10 candidates. This improvement in computational efficiency owes much to using RFs for hypothesizing xS , xA , and γ, which greatly narrows down the search space of the algorithm.

5. Conclusions combines both generative and discriminative cues for recognizing the phenotype class of an object from a single silhouette input and reconstructing its approximate 3D shape. We learn 3D probabilistic shape priors of the object category by GPLVM to handle the difficulties in the camera viewpoint changes and pose deformation, and use random forests for efficient inference of phenotype, pose, and camera parameters. Experiments on human and shark silhouettes have shown the advantage of our approach against both

1

4

2 5

6

8

7

9

(a)

(b)

(a)

(b)

10 3 11 (a)

(b)

(c)

(a)

(b)

(c)

(a)

(b)

Figure 7. Approximate single view reconstruction using the shape candidates from the random forest classifiers. (a) Input query images and silhouettes; (b) the highest-weight 3D shape candidates from FS and FA for each query silhouette (in two different views); (c) results generated by the approach in [7] ( in two different views).

standard 2D-based methods and relevant 3D-based methods. The present accuracy on the datasets we provide, especially on the shark dataset, is limited due to the descriptive power of the shape and pose generators we used to synthesize silhouettes and insufficient number of 3D shapes and motion data used for training. Using more extensive 3D training data would improve the accuracy. Another major problem which limits the application of the current framework is in the requirement of silhouette segmentation. This could be helped by e.g. Kinect camera which yields reliable foreground-background segmentation in real time. Also, as our future work, we plan to build up a larger-scale phenotype recognition dataset of different categories of objects and make it available to public. It would help evaluate our approach and do comparative studies.

References [1] A. Agarwal and B. Triggs. Recovering 3D human pose from monocular images. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 28(1):44–58, 2006. 6 [2] M. Andriluka, S. Roth, and B. Schiele. Monocular 3d pose estimation and tracking by detection. IEEE Conference on Computer Vision and Pattern Recognition, 2010. 1, 3 [3] A. Balan and M. Black. The naked truth: estimating body shape under clothing. European Conference on Computer Vision, LNCS 5303:15–29, 2008. 1, 2 [4] A. Balan, L. Sigal, M. Black, J. Davis, and H. Haussecker. Detailed human shape and pose from images. IEEE Conference on Computer Vision and Pattern Recognition, 2007. 2 [5] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 24(24):509–522, 2002. 6 [6] Y. Chen and R. Cipolla. Single and sparse view 3d reconstruction by learning shape priors. Computer Vision and Image Understanding, 115(5):586–602, 2011. 3 [7] Y. Chen, T.-K. Kim, and R. Cipolla. Inferring 3d shapes and deformations from single views. European Conference on Computer Vision, LNCS 6313:300–313, 2010. 2, 3, 4, 5, 6, 8, 9

[8] C. Cyr and B. Kimia. 3D object recognition using shape similaritybased aspect graph. International Conference on Computer Vision, pages 254–261, 2001. 2 [9] D. Demirdjian, L. Taycher, G. Shakhnarovich, K. Grauman, and T. Darrell. Avoiding the “streetlight effect”: Tracking by exploring likelihood modes. International Conference on Computer Vision, 2005. 1, 3 [10] C. Ek, P. Torr, and N. Lawrence. Gaussian process latent variable model for human pose estimation. Machine Learning for Multimodal Interaction, LNCS 4892:132–143, 2007. 3 [11] V. Ferrari, F. Jurie, and C. Schmid. From images to shape models for object detection. International Journal of Computer Vision, Mar. 2010. 2 [12] O. Freifeld, A. Weiss, S. Zuffi, and M. Black. Contour people: A parameterized model of 2D articulated human shape. IEEE Conference on Computer Vision and Pattern Recognition, 2010. 1, 2 [13] J. Gibbs and W. Martin. Urbanization, technology and the division of labor. American Sociological Review, 27:66–77, 1962. 5, 8 [14] R. Gopalan, P. Turaga, and R. Chellappa. Articulation-invariant representation of non-planar shapes. European Conference on Computer Vision, LNCS 6313:286–299, 2010. 2 [15] C. Gu and X. Ren. Discriminative mixture-of-templates for viewpoint classification. European Conference on Computer Vision, LNCS 6315:408–421, 2010. 1 [16] P. Guan, A. Weiss, A. Balan, and M. Black. Estimating human shape and pose from a single image. IEEE International Conference on Computer Vision, 2009. 1, 2 [17] A. Gupta, T. Chen, F. Chen, D. Kimber, and L. Davis. Context and observation driven latent variable model for human pose estimation. IEEE Conference on Computer Vision and Pattern Recognition, 2008. 1, 2 [18] F. Han and S. Zhu. Bayesian reconstruction of 3D shapes and scenes from a single image. HLK’03: Proceedings of the 1st IEEE International Workshop on Higher-Level Knowledge, 2003. 2 [19] T. Hassner and R. Basri. Example based 3D reconstruction from single 2D images. Beyond Patches Workshop at CVPR, page 15, 2006. 6 [20] D. Hoiem, A. Efros, and M. Hebert. Recovering surface layout from an image. International Journal of Computer Vision, 75(1):151–172, 2007. 1 [21] W. Hu and S. Zhu. Learning a probabilistic model mixing 3d and 2d primitives for view invariant object recognition. IEEE Conference on Computer Vision and Pattern Recognition, 2010. 1

[22] S. Johnson and M. Everingham. Clustered pose and nonlinear appearance models for human pose estimation. British Machine Vision Conference, 2010. 1 [23] N. Lawrence. Gaussian process latent variable models for visualisation of high dimensional data. Advances in Neural Information Processing Systems, 16:329–336, 2004. 3 [24] Y. Li, L. Gu, and T. Kanade. A robust shape model for multi-view car alignment. IEEE Conference on Computer Vision and Pattern Recognition, 2009. 1 [25] J. Liebelt and C. Schmid. Multi-view object class detection with a 3d geometric model. IEEE Conference on Computer Vision and Pattern Recognition, 2010. 1, 2 [26] H. Ling and D. Jacob. Shape classification through structured learning of matching measures. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 29(2):286–299, 2007. 2, 6 [27] R. Navaratnam, A. Fitzgibbon, and R. Cipolla. Semi-supervised joint manifold learning for multi-valued regression. IEEE International Conference on Computer Vision, 2007. 1, 2 [28] M. Nixon, T. Tan, and R. Chellappa. Human Identification based on Gait. Springer, 2005. 1, 2 [29] M. Prasad, A. Zisserman, A. Fitzgibbon, and L. Gool. Finding nemo: Deformable object class modelling using curve matching. IEEE Conference on Computer Vision and Pattern Recognition, 2010. 2 [30] S. Prince, J. Elder, J. Warrell, and F. Felisberti. Tied factor analysis for face recognition across large pose differences. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 30(6):970–984, 2008. 1, 2 [31] Z. Riaz, C. Mayer, M. Wimmer, and B. Radig. Model based face recognition across facial expressions. Journal of Information and Communication Technology, 2008. 2 [32] G. Rogez, J. Rihan, S. Ramalingam, C. Orrite, and P. Torr. Randomized trees for human pose detection. IEEE Conference on Computer Vision and Pattern Recognition, 2008. 1, 2, 3 [33] R. Rosales and S. Sclaroff. Combining generative and discriminative models in a framework for articulated pose estimation. International Journal of Computer Vision, 67(3):251–276, 2006. 3 [34] D. Rother and G. Sapiro. Seeing 3D objects in a single 2D image. IEEE International Conference on Computer Vision, 2009. 1, 2 [35] F. Rothganger, S. Lazebnik, C. Schmid, and J. Ponce. 3d object modeling and recognition using local affine-invariant image descriptors and multi-view spatial constraints. International Journal of Computer Vision, 66(3):231–259, 2006. 1, 2 [36] M. Salzmann, F. Moreno-Noguer, V. Lepetit, and P. Fua. Closedform solution to non-rigid 3D surface registration. European Conference on Computer Vision, LNCS 5305:581–594, 2008. 3 [37] M. Salzmann and R. Urtasun. Combining discriminative and generative methos for 3D deformable surface and articulated pose reconstruction. IEEE Conference on Computer Vision and Pattern Recognition, 2010. 1, 2, 3 [38] A. Shon, K. Grochow, A. Hertzmann, and R. Rao. Learning shared latent structure for image synthesis and robotic imitation. Advances in Neural Information Processing Systems, 16:1233–1240, 2006. 3 [39] J. Shotton, A. Blake, and R. Cipolla. Multi-scale categorical object recognition using contour fragments. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 30(7):1270–1281, 2008. 2 [40] J. Shotton, A. Fitzgibbon, M. Cook, and A. Blake. Real-time human pose recognition in parts from single depth images. IEEE Conference on Computer Vision and Pattern Recognition, 2011. 1, 3 [41] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture. International Journal of Computer Vision, 81:2–23, 2009. 5 [42] L. Sigal, A. Bˇalan, and M. Black. Combined discriminative and generative articulated pose and non-rigid shape estimation. Advances in Neural Information Processing System, 2007. 1, 2, 3, 6, 7

[43] L. Sigal, A. Bˇalan, and M. Black. Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1):4–27, 2010. 1, 2, 6 [44] M. Sminchisescu, A. Kanaujia, and D. Metaxas. Learning joint topdown and bottom-up processes for 3d visual inference. IEEE Conference on Computer Vision and Pattern Recognition, 2006. 1, 3 [45] P. Srinivasan, Q. Zhu, and J. Shi. Many-to-one contour matching for describing and discriminating object shape. IEEE Conference on Computer Vision and Pattern Recognition, 2010. 2 [46] B. Stenger, A. Thayananthan, P. Torr, and R. Cipolla. Model-based hand tracking using a hierarchical bayesian filter. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 28(9):1372–1384, 2006. 1, 2, 4, 6 [47] J. Sun, M. Ovsjanivkov, and L. Guibas. A concise and provably informative multi-scale signature based on heat diffusion. Eurographics Symposium on Geometry Processing, 28(5):1383–1392, 2009. 1, 2 [48] G. Taylor, L. Sigal, D. Fleet, and G. Hinton. Dynamical binary latent variable models for 3d human pose tracking. IEEE Conference on Computer Vision and Pattern Recognition, 2010. 1, 2 [49] A. Toshev, A. Makadia, and K. Daniilidis. Shape-based object recognition in videos using 3d synthetic object models. IEEE Conference on Computer Vision and Pattern Recognition, 2009. 1, 2 [50] R. Urtasun, D. Fleet, A. Hertzmann, and P. Fua. Priors for people tracking from small training sets. IEEE International Conference on Computer Vision, 2005. 1, 2, 3 [51] R. Urtasun, D. Fleet, A. Hertzmann, and P. Fua. 3D people tracking with gaussian process dynamical models. IEEE Conference on Computer Vision and Pattern Recognition, 2006. 3 [52] P. Yan, S. Khan, and M. Shah. 3d model based object class detection in an arbitrary view. International Conference on Computer Vision, 2007. 1 [53] X. Zhang and Y. Gao. Face recognition across pose: A review. Pattern Recognition, 42:2876–2896, 2009. 1, 2