Multi-layered Deep Convolutional Neural Network for

0 downloads 0 Views 197KB Size Report
Black GPUs for two to three weeks. VGGNet performed the tasks of both classification and localization. JASC: Journal of Applied Science and Computations.
JASC: Journal of Applied Science and Computations

ISSN NO: 0076-5131

Multi-layered Deep Convolutional Neural Network for Object detection D.Hema1, S.Kannan2 1

Assistant Professor, Department of Computer Science, Lady Doak College, Madurai-02,India

2

Professor, Department of Computer Applications, Madurai Kamraj University, Madurai-21,India [email protected] [email protected]

Abstract--Image Classification and object recognition using Deep learning is the state of art technology in the field of Computer Vision and robotics. Deep Convolutional Neural Networks are used for the task of Image detection and classification. This research work focuses to build an efficient and robust multi-layer Deep Convolutional Neural Network (DCNN) to classify images in a binary class. The various stages in DCNN like convolution, activation functions, pooling, flattening and full connection along with its operation is discussed in detail. This work also compares different values for the hyper-parameters such as epochs and hidden units of the neural network. These values are implemented in order to increase the accuracy and decrease the logarithmic loss of the deep learning model. Keywords-- Convolution,activation function,loss function,Image augmentation

I. DEEP LEARNING AND OBJECT DETECTION

Image Classification in large scale is challenge in the recent decades. Researchers come up with state of art technique in Image Processing and computer vision to raise the accuracy in detecting objects of particular class. Deep learning is one such field which has innate potential that can train and detect enormous number of images relative to machine learning. Machine learning requires hand crafted features extracted manually in images whereas deep learning gains attention in its own way of learning the images. Deep learning is a category of machine learning algorithms which uses a cascade of multiple layers of processing units (nonlinear) for feature extraction. Each successive layer makes use of the output from the previous layer as input. It also learns in supervised (e.g., classification) and/or unsupervised (e.g., pattern analysis) techniques. Deep learning models learn multiple levels of representations that correspond to various levels of abstraction. Most of the deep learning models are based on an Artificial Neural Network (ANN), Deep Belief Networks and Deep Boltzmann Machines. There is no doubt that the deep learning will transform the field of Artificial Intelligence to a greater level. Deep learning can also be implemented for speech recognition, social network filtering, natural language processing, bioinformatics [1] and drug design. II. CONVOLUTIONAL NEURAL NETWORK

Yann LeCun, the father of CNN implemented the deep learning for various applications like Digits recognition in MNIST dataset, Document recognition [2] using 7 layers, Image recognition, Hand written zip code recognition etc. AlexNet[3] brought a drastic improvement in deep learning for image classification for 1000 image classes. They were trained on two GTX 580 GPUs for five to six days. A slightly modified AlexNet model called ZFNet[4] evolved in 2013,which was trained on a GTX 580 GPU for twelve days. In [4], the feature activations uses a visualization technique named Deconvolutional Network (DeConvNet).The idea behind ZFNet was to examine what type of structures stimulates a given feature map. In 2014, VGGNet[5] came into existence. VGGNet used 3x3 sized filters whereas AlexNet used 11x11 filters in the first layer and ZFNet used 7x7 filters.VGGNet was trained on 4 Nvidia Titan Black GPUs for two to three weeks. VGGNet performed the tasks of both classification and localization.

Volume 5, Issue 6, June /2018

Page No:93

JASC: Journal of Applied Science and Computations

ISSN NO: 0076-5131

In 2015, GoogLeNet [6] with 22 layer CNN came with an error rate of 6.7%. GoogleNet made use of average and max pooling layers along with an inception module. In 2015, ResNet[7] with 152 layer network architecture has set new records in classification, detection, and localization with an incredible error rate of 3.6%. III. ARCHITECTURE: MULTI-LAYERED DEEP CONVOLUTIONAL NEURAL NETWORK

Convolutional Neural Networks are the most popular deep learning architecture for large scale image recognition. This research paper is about a multi-layer DCNN implemented for image detection in binary class. Deep Convolutional Neural Networks (DCNN) has various operations being performed on multilayered networks starting from a convolutional layer to a dense/full connected layer. The DCNN implemented in this research work has 2 convolutional layers, 1 pooling layer, 1 convolutional layer, 1 pooling layer, flatten layer, 2 full connection layers. The Architecture of this 8 layered DCNN is given in Fig.1.This 8 layered network still can be modelled deeper to increase the efficiency. Building a deeper model will elevate the accuracy level in detecting objects.

Fig. 1. 8 layered DCNN architecture

A. Convolution The first convolutional layer filters the 64X64X3 input image with 20 kernels of size 5X5X3.The Convolution operation as given in the equation (1) is performed on the entire input image and 20 different kernels are applied to obtain 20 feature maps. The feature map is the output of a single filter applied to the preceding layer. A single filter is drawn across the entire preceding layer, moved one pixel at a time. The Filter gets convolved along with the input from the preceding layer. 

( f * g )(t ) 

 f ( ).g (t   )d

(1)



Volume 5, Issue 6, June /2018

Page No:94

JASC: Journal of Applied Science and Computations

ISSN NO: 0076-5131

Multiple filters can be applied to create multiple feature maps in convolutional layers. The second convolutional layer performs convolution on the output image of the first convolutional layer with 30 kernels of size 5X5X3 to produce 30 feature maps. Third Convolutional layer is introduced after the first pooling layer. The third convolutional layer makes use of 50 Kernels of size 5X5X3 to produce 50 feature maps. Kernels in DCNN such as blur, emboss, edge, smoothen, sharpen are selected by the model itself. The prominently used kernel is an edge filter. While performing the convolution, the DCNN preserves the spatial relations between pixels. The tiny features are not eliminated but it is retained. The output size (O) of an image after performing convolution is given by the formula as in (2) O = ((W-K+2P)/S) +1

(2)

Where W is the input height/width of an image, K is the Kernel size, P is the Padding and S is the Stride. For Instance, if an input image height (W) is 64, kernel (K) is 3X3 with no padding (P) and a Stride (S) of 1 is used, then the output image size (O) would be 62X62. B. ReLU-Rectified Linear Unit The output of the convolutional layer has negative values in the obtained feature map. In order to remove all negative values and save only positive value, an activation function called ReLU is applied on the feature maps. This also increases the non-linearity of the model. Y

 ( x)  max( x,0)

m

 wixi i 1

Fig. 2. Rectifier Function

ReLU function is applied on all convolutional layers to increase non-linearity but at the same time to retain the precise features on the image. In [8], ReLU is applied to restricted Boltzmann Machine to improve its performance. But ReLU can be applied in DCNN’s as well to improve the model efficiency. C. Average Pooling/Subsampling Feature map images are in different directions, occluded or rotated. Pooling reduces the parameters, thereby reducing overfitting. Pooling also reduces the processing time for the network. Pooling permits Spatial Invariance and doesn’t care even if the features are rotated, scaled or occluded. Features are preserved and there is no loss of data in this pooling layer. The evaluation of Different Pooling operations in convolutional networks are discussed in [9]. In Lenet[1], Max Pooling/Down sampling is used which finds the maximum in the neighborhood and replaces it. The research work discussed here makes use of an Average Pooling/subsampling which replaces the pixel by the weighted average of its neighborhood. Few model combines both maxpooling and average pooling techniques.Two Pooling Layers are added after the second and third convolutional layers respectively. A 2X2 average pooling filter is applied on the output of the second and third convolutional layers to produce a reduced feature map. The output size(O) of an image after performing Pooling is given by the formula as in (3) O = ((W-K)/S) +1

(3)

Where W is the height/width of an image from convolutional layer, K is the Kernel size and S is the Stride. For Instance, From the convolutional layer the image height (W) is 62, kernel (K) is 3X3, and a Pool Stride (S) of 2 is used, then the output image size (O) would be 31X31. D. Flattening The output of pooling layer is a reduced feature map which should be converted to a single set of vectors. Flattening helps to perform this task. Flattening feeds the converted feature map to the neural network. E. Full /Dense Connection

Volume 5, Issue 6, June /2018

Page No:95

JASC: Journal of Applied Science and Computations

ISSN NO: 0076-5131

Full Connection is the neural network layer where the Neurons have full connections to all activations in the previous layer. Their activations are a matrix multiplication followed by a bias offset. There are two full connection layers in this model. The model calculates the loss function as given in equation (4) and performs a backward propagation to adjust weights to train the model using the training images. The test images are then fed into the model to find the loss/cost function and accuracy is calculated for the multi-layered architecture. The output layer of a full connection implements a sigmoid activation function to detect the object in two classes. In case of multiple class object detection and classification, a softmax activation function can be implemented.

H ( p, q)   p( x). log q( x)

(4)

x

IV. IMAGE AUGMENTATION DCNN’s require huge amount of training data to achieve good performance. In order to build a powerful image classifier using very little training data, image augmentation is usually instigated to boost the performance of deep networks. Image augmentation is the process of artificially creating training images through different ways of processing or it combines multiple processing, such as random rotation, shifts, shear and flips, etc. In this model, rescale, zoom, shear and flipping are used for Image Augmentation process. V. TRAINING AND TEST DATASET The Training and testing data should be the ratio of 4:1. Care should be taken to avoid overlapping of training and testing data. In this research work, 3000 training data (1500+1500 in each class) and 750 (375+375 for each class) testing data are being used. The dataset used is extracted from Caltech-UCSD Birds 200 and INRIA Person Dataset. The Training and Testing data should be of same size. The size of the images implemented in this model is 64X64. VI. COMPILING AND FITTING The DCNN is compiled using adam optimizer and loss function is a binary cross entropy. The model is made to fit for the hyper-parameters like epochs, batch size and number of hidden units. The model is trained for various numbers of epochs like 1,5,10,15,20,25 and the batch size is 32. Hence, the entire dataset is split as batch of 32 and a total of 93 batches are fed for one epoch. Different image sets of 93 batches are passed for each epoch. VII.

RESULTS

The model is trained for two models with 64 and 128 hidden units until it has reached 25 epochs. The model is trained on a dual core CPU running at 2.3-2.8 GHz. In [10], The CNNs are optimized on Embedded FPGA (Field Programmable Gate Array) for object detection. The loss of training and validation images implemented for 64 and 128 hidden unit model (25 epochs) is given in Fig.3 (a) and (b) respectively. The accuracy of training and validation images implemented for 64 and 128 hidden unit model (25 epochs) is given in Fig.4 (a) and (b) respectively.

Fig.3.(a) Loss for training and validation of 64 hidden unit

Volume 5, Issue 6, June /2018

Fig.3.(b) Loss for training and validation of 128 hidden unit

Page No:96

JASC: Journal of Applied Science and Computations

Fig.4.(a) Accuracy for training and validation of 64 hidden unit

ISSN NO: 0076-5131

Fig.4.(b) Accuracy for training and validation of 128 hidden unit

While training the model for different number of epochs, the loss should decrease and accuracy should increase. After a particular number of epochs, the accuracy becomes stable. At this point the model training could be terminated. Underfitting and Overfitting might occur during the training of a DCNN model. Overfitting happens when the training loss is less than the validation loss and Underfitting happens when the training loss is greater than validation loss. A good fit is when the training loss is equal to the validation loss. Overfitting can be prevented by providing sufficient amount of dataset or by designing a small neural network model or by introducing regularization techniques in the network. To avoid underfitting, try the opposite methods like small regularization, increasing the size of neural network and so on. A spot between underfitting and overfitting is said to be a good fit for the model where the error percent is nearly zero. From the Fig.3 and 4 it is clear that increasing the model’s hidden unit to 128 decreases the accuracy below 80% and the model with 64 hidden units has accuracy above 80% .The model with 64 hidden units in the neural network has a good fit because the training and validation accuracy is nearly equal. Increasing the number of epochs to 25 has also yielded an accuracy of above 80 % whereas from15-20 epochs it is below 80%.Hence the hyper-parameters like epoch and hidden unit optimization increases the DCNN’s efficiency and robustness. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

Seonwoo Min, Byunghan Lee, Sungroh Yoon,2016, “Deep Learning in Bioinformatics”, Briefings in Bioinformatics, Volume 18, Issue5, p.851-869 Yann LeCun,Leon Bottou,Yoshua Bengio and Patrick Haffner, 2013, “Gradient based learning applied to document recognition,” Proceedings of the IEEE. 86 (11): p. 2278–2324. Alex Krizhevsky , Ilya Sutskever , Geoffrey E. Hinton, 2012, “ImageNet Classification with Deep Convolutional Neural Networks” Matthew D. Zeiler, Rob Fergus, 2013, “Visualizing and Understanding Convolutional Networks” Karen Simonyan & Andrew Zisserman, 2014,” VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION” Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, CVPR 2015,” Going Deeper with Convolutions” Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun,2015,” Deep Residual Learning for Image Recognition” V. Nair and G. E. Hinton.,2010, “Rectified linear units improve restricted boltzmann machines” In Proc. 27th,International Conference on Machine Learning, Dominik Scherer, Andreas Muller, and Sven Behnke, September 2010,” Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition” , 20th International Conference on Artificial Neural Networks (ICANN) Ruizhe Zhao,Xinyu Niu,Yajie Wu,Wayne Luk,Qiang Liu, March 2017, “Optimizing CNN-Based Object Detection Algorithms on Embedded FPGA Platforms” , International Symposium on Applied Reconfigurable Computing.

Volume 5, Issue 6, June /2018

Page No:97