Strict Pyramidal Deep Architectures for Person Re

0 downloads 0 Views 182KB Size Report
Abstract. We report a strict 3D pyramidal neural network model based on convolutional neural networks and the concept of pyramid images for person ...
Strict Pyramidal Deep Architectures for Person Re-Identification Sara Iodice1 , Alfredo Petrosino1 and Ihsan Ullah1,2 1

CVPR Lab, Department of Science and Technology, University of Naples Parthenope, Italy 2 Department of Computer Science, University of Milan, Italy

Abstract. We report a strict 3D pyramidal neural network model based on convolutional neural networks and the concept of pyramid images for person re-identification in videosurveillance. Main advantage of the model is that it also maintains the spatial topology of the input image, while presenting a simple connection scheme with lower computational and memory costs than in other neural networks. Challenging results are reported for person re-identification in real-world environments.

1

Introduction

Person re-identification (PRe-ID) is an open problem in computer vision, which tries to solve questions like: “Have I seen this person before?” [12], or ”Is this the same person?”. More formally, it is about recognizing an individual in different locations over a set of non-overlapping camera views. This is an important task of computer vision with applications ranging in many contexts from intelligent video surveillance, like people tracking and their behaviour analysis. PRe-ID usually needs to match the person images captured by surveillance cameras working in wide-angle mode. Therefore, the resolution of person images are very low (e.g., around 48X128 pixels) and the lighting conditions are unstable too. Furthermore, the direction of cameras and the pose of persons are arbitrary. These factors cause great challenge in re-identification of the person images under surveillance due to two distinctive properties: large variations in intra class, and ambiguities between inter classes. In addition, aspects like camera view change, pose variation, non-rigid deformation, unstable illumination and low resolution also play key role in making the task difficult. The majority of existing methods include two separate phases, extracting most discriminative feature and than looking for a better way to compare those features, called metric learning. The features usually come from separate sources, i.e. colour and texture, some of which are designed by hand, while others are learned. Finally they are collected together or fused by simple strategies to enhance the performance [5, 1, 9]. Recent methods have been developed by going deeper for learning more discriminative and different type of features. They combines two separate modules,

feature extraction and metric learning, in a unified framework, like the Siamese Convolutional Neural Network (SCNN) with Deep Metric Learning (DML) [11]. SCNN with DML has an extra edge due to direct learning of similarity metric from image pixels; parameters at each layers are continuously updated guided by a common objective function that better extracts discriminative features, as compared to hand-crafted features in traditional computer vision models. Furthermore, using a multichannel kernel, like in convolutional neural networks (CNN), different kind of features are more naturally put together, clearly obtaining a reasonable superiority over computer vision fusion models, e.g. feature concatenation, sum rule, etc. CNNs extract features of low power at lower layers; and increases power as the layer increases. However, one of the limitations in deep CNN models resides in the ambiguity at the feature extraction due to the increase number of feature maps extracted from low power features. In contrast, it would be better to use such a structure that extracts more features from real information, and than refine them in higher layers to make them more distinctive. A viable model consists in the Strict Pyramidal Structure we propose that can refine and reduce ambiguity in the features for better discrimination. In this context, our main contributions can be summarized as follows: 1. modifying the existing SCNN with DML model following a Strict Pyramidal Structure; 2. providing a more deep analysis of the impact of batch size on performances; 3. demonstrating that unbalanced datasets can be contrasted by using a regularization parameter into the cost function; 4. finally, showing that our proposed pyramid structure model benefits for enhancing Rank1 performance. Section 2 of the paper overview the basics of the strict pyramidal model, while section 3 describes the architecture of the SCNN model with pyramidal structure. Section 4 reports results on the well known VIPeR dataset, together with specific analysis of the behaviour of the proposed architecture.

2

Strictly Pyramidal Structure

To adopt pyramidal structure for decision making as done in brain, we decided to use a strict 3d pyramidal architecture by starting from a big input and first layer and then refining the features at each higher layer until we achieve a reduced most discriminative set of features. This model takes inspiration from early pyramidal neural network model [3] for the strict pyramidal structure and its more recent version PyraNet that learns parameters from input till output. The objective is to show that strictly following pyramidal structure can enhance performance compare to unrestricted models even with simple structure, fewer feature maps and hidden layers.

2.1

PyraNet

PyraNet model [8] was insipred by the pyramidal neural network (PNN) model reported in [3] with 2D and 1D layers. Differently from the original model in [3], the coefficients of a receptive fields were adaptive and they perform feature extraction and reduction in those lower 2D layers followed by 1D layer at the top for classification of an image. This model is almost similar to CNN if we remove the pooling layers [10]. However there were two main differences: firstly it does not perform convolution rather weighted sum operation or correlation, and secondly the weights are not in the form of a kernel, but rather each output neuron has a local unique kernel specifically assigned, based on input neurons and their respective weights. This results in a unique kernel for each output neuron. In addition, they did not use any pooling layers in their model for reducing the dimension; rather the dimensions were reduced by the stride of the kernel at each layer. They both follow the same second order back propagation technique for learning the parameters with cross entropy loss function. Their technique achieved 96.3% accuracy similar to SVM for gender recognition and 5% more than CNN with same input size images. An important part in popular convolutional deep models is their weight-sharing concept that give them an edge over other neural network models. This property reduces large amount of learning parameters but increases burden on those few parameters. 2.2

Architecture

Our model takes inspiration by the CNN-DML[11], a siamese neural network, whose structure is well suitable for person re-identification problem. As known, neural networks work in standalone mode, where the input is a sample and the output is a predicted label, as in several pattern recognition problems, i.e handwritten digit recognition [10], object recognition [4], human action recognition [7], and others, when the training and test set are characterized by the same classes. On the other hand, the siamese architecture provides a different pattern, where the input is an image pair and the output is binary value, revealing if the two images comes from the same subject, or not. Another main aspect is that ,the weights between the two sub-networks, which are a part of the siamese architecture, can be shared or not. For person re-identification problem, the best choose is clearly sharing parameters, to find out the peculiarity of an individual in different pose, or from images acquired by different views. Our architecture, named Strict Pyramidal DML, is composed by two siamese Strict Pyramidal CNN blocks, B1 and B2, a connection function, C and a cost function, J. Driven by a common objective function, this model can learn, simultaneously, both the features and an optimal metric to compare them. The Strictly Pyramidal CNN block is composed by 2 convolutional layers, C1 and

C3, 2 max pooling layers, S2 and S4, and a fully-connected layer F . In particular, C1 layer exploits N F 1 = 32 filters with kernel size 7x7; while C3 layer N F 3 = 25 filters, having size 5x5. Finally, the Strictly Pyramidal CNN block gives in output a vector of x ∈ R500 components, containing the salient features of an individual. Specifically, the proposal architecture is different from the CNN-DML for: 1. replacing each CNN block by a strictly Pyramidal CNN block; 2. using the full image and not dividing the image in three parts and giving as input to three CNN’s; 3. using simple hyperbolic tangent function instead of ReLU [6] as activation function for each layer; 4. not using cross-channel normalization unit; 5. not padding by zero the input before a CNN block. In contrast to CNN, the Strictly Pyramidal CNN has a biggest number of filters in first layer, then the number of filters decreases going deeply, layer by layer. This in agreement with our conjecture that, due to its direct interaction with the input, the outer layers require a wide set of filters to extract salient features from the input images; on the contrary, going deeply, the number of filters should be decreased to avoid redundant features. We opt for the architecture based on full image in order to reduce the number of parameters and make the learning process faster. Indeed, using three CNN’s the learning process becomes slower due to the higher number of parameters to update. Furthermore, we choose as activation function the Hyperbolic tangent rather then ReLU, because it provides better results in our experiments. Finally, the cross-channel normalization unit and the padding by zero do not give any improvement in our results, thus we avoid to use them, reducing the complexity of the whole architecture.

3

Experiments and analysis

In this section, the results achieved on the state-of-art VIPeR dataset will be explained. The VIPeR dataset contains 632 pedestrian image pairs taken from arbitrary viewpoints under varying illumination conditions. The data were collected in an academic setting over the course of several months. Each image is scaled to 128×48 pixels. We took 11 divisions by randomly selecting 316 disjoint subjects for training and the remaining 316 for testing set. The first split called (Dev. view) is used for parameter tuning, such as the number of training epoch, kernels size, number of kernel to use for each layer and so on; while the other 10 splits (Test view) for reporting the results. We adopted the same learning value, weight decay approach of the ConvNet model as in [6]. All training images from Camera A and B are merged for training purpose, randomly shuffled and given as input to both our modified version of Deep Metric Learning and to our Strict Pyramidal Deep Metric Learning (SP-DML).

The impact of two main factors in the model will be described: (1) batch size used for the training stage, through stochastic minibatch gradient descent technique; (2) the regularization parameter.

3.1

Batch Analysis

We adopted the stochastic gradient descent approach for updating learning parameters in our experiments, as being done in [6]. Fig.1(a) shows the increase in performance as we increase batch size and reduce the difference among negative and positive pairs. After a certain limit, the performance clearly decreases, as can be seen in Figure 1(b). This shows that the batch size should not be too large, as it becomes too unbalanced, i.e. negative pairs are a huge number compare to positive pairs. Therefore, the performance decreases. As result, the best ratio between n1 and n2, could be 28/378, setting sizeBatch = 28.

(a)

(b)

Fig. 1. Figure (a) CMC for increasing batch size; (b) CMC for increasing batch size over threshold that reduces performance

3.2

Regularization term: Asymmetric Cost analysis

After selecting the best batch size, we can improve the performance with regularization parameters that gives different emphases to positive and negative pairs. So doing, the dataset appears to be more well balanced. In our experiments we tried to use it for different values of c i.e. ranging between 2.0 to 3.5 with an increase of 0.5 each time. We found that increasing this value better results are provided. It can be judged from fig. 2 that c = 3.5 gives optimal results for a greatest range rank values, as can be seen in fig. 2.

Fig. 2. CMC performance by varying regularization parameter ’c’

4

Conclusions

We propose a strictly pyramid deep architecture model, ideal for PReID in real world scenarios. The model not only improved Rank1 performance by learning different view information with shared kernels. Furthermore, it shows efficient or similar results even with learning parameters less than other similar stat-of-art methodologies. Some effective training strategies are adopted to train network well for targeted application. We plan to extend our results also to other datasets to further check efficiency of the proposed model. However, for our experiments we do not utilize data augmentation, that in principle strongly enhances the results.

References 1. A survey of approaches and trends in person re-identification. Image and Vision Computing 32(4), 270–286 (2014) 2. Bromley, J., Bentz, J.W., Bottou, L., Guyon, I., LeCun, Y., Moore, C., S¨ ackinger, E., Shah, R.: Signature verification using a ”siamese” time delay neural network. IJPRAI 7(4), 669–688 (1993) 3. Cantoni, V., Petrosino, A.: Neural recognition in a pyramidal structure. IEEE Transactions on Neural Networks 13(2), 472–480 (2002) 4. Christian Szegedy, A.T., Erhan, D.: Deep Neural Networks for Object Detection. Advances in Neural Information Processing Systems (26), 553–2561 (2013) 5. Iodice, S., Petrosino, A.: Salient feature based graph matching for person re-identi fi cation. Pattern Recognition 48(4), 1070–1081 (2014), http://dx.doi.org/10.1016/j.patcog.2014.09.011 6. Krizhevsky, a., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems pp. 1097–1105 (2012) 7. M. Baccouche, F. Mamalet, C.W.C.G.A.B.: Sequential Deep Learning for Human Action Recognition. Human Behavior Unterstanding Proceedings (2) (2011)

8. Phung, S.L., Bouzerdoum, A.: A pyramidal neural network for visual pattern recognition. IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council 18(2), 329–43 (Mar 2007) 9. Vezzani, R., Baltieri, D., Cucchiara, R.: People reidentification in surveillance and forensics. ACM Computing Surveys 46(2), 1–37 (2013) 10. Y. LeCun, L. Bottou, Y.B., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 11(86), 2278–2324 (1998) 11. Yi, D., Lei, Z., Li, S.Z.: Deep Metric Learning for Practical Person Re-Identification 11(4), 1–11 (2014) 12. Zajdel, W., Zivkovic, Z., Kr¨ ose, B.J.a.: Keeping track of humans: Have I seen this person before? Proceedings - IEEE International Conference on Robotics and Automation 2005(April), 2081–2086 (2005)