Character Recognition of License Plate Number Using Convolutional

Character Recognition of License Plate Number Using Convolutional Neural Network Syafeeza Ahmad Radzi1,2 and Mohamed Khalil-Hani2 1

Faculty of Electronics & Computer Engineering Universiti Teknikal Malaysia Melaka (UTeM) 2 VLSI-eCAD Research Laboratory (VeCAD) Faculty of Electrical Engineering Universiti Teknologi Malaysia (UTM) [email protected], [email protected]

Abstract. This paper presents machine-printed character recognition acquired from license plate using convolutional neural network (CNN). CNN is a special type of feed-forward multilayer perceptron trained in supervised mode using a gradient descent Backpropagation learning algorithm that enables automated feature extraction. Common methods usually apply a combination of handcrafted feature extractor and trainable classifier. This may result in sub-optimal result and low accuracy. CNN has proved to achieve state-of-the-art results in such tasks such as optical character recognition, generic objects recognition, real-time face detection and pose estimation, speech recognition, license plate recognition etc. CNN combines three architectural concept namely local receptive field, shared weights and subsampling. The combination of these concepts and optimization method resulted in accuracy around 98%. In this paper, the method implemented to increase the performance of character recognition using CNN is proposed and discussed. Keywords: Character recognition, convolutional neural network, back propagation, license plate recognition, parallel architecture, Visual Informatics.

1 Introduction Convolutional Neural Network (CNN) is a special type of multilayer perceptron; a feed-forward neural network trained in supervised mode using a gradient descent Backpropagation learning algorithm that minimizes a loss function [1, 2]. It is one of the most successive machine learning architectures in computer vision and has achieve state-of-the-art results in such tasks as optical character recognition [1], generic objects recognition, real-time face detection [3] and pose estimation, speech recognition, license plate recognition [4-6] etc. Its strategy is to extract simple features at higher resolution and transform them to complex features at lower resolution. Lower resolution is obtained by applying subsampling at previous layer of such feature maps. The CNN like other artificial neural networks could benefit in speed by implementing parallel architectures. A parallel implementation helps to speed up CNNs simulation,

H. Badioze Zaman et al. (Eds.): IVIC 2011, Part I, LNCS 7066, pp. 45–55, 2011. © Springer-Verlag Berlin Heidelberg 2011

46

S.A. Radzi and M. Khalil-Hani

allowing to use more complicated architectures in real-time. It also significantly speeds up training process, which could take days using unparallel architectures. This work is the initial step towards license plate recognition in which the individual characters are manually extracted and recognized using CNNs. The outline of this paper is as follows: Section 2 discusses the theory of CNN. The methodology of the proposed architecture is discussed in Section 3. Section 4 represents the results and discussion, and the final section gives the conclusion of the overall work.

2 Theory The CNN consist layer of neurons and it is optimized for two-dimensional pattern recognition. CNN has three types of layer namely convolutional layer, subsampling layer, and fully connected layer. These layers are arranged in a feed-forward structure as shown in Fig. 1.

Input 1@32x32

C1:feature maps 6@28x28

S2:feature maps 6@14x14

C3:feature maps 16@10x10

S4:feature maps 16@5x5

F5:layer 120

2x2 subsampling 5x5 convolution

F6:layer 10

Full connection 5x5 convolution

2x2 subsampling

Fig. 1. Architecture of CNN [2]

A convolutional layer consists of several two dimensional planes of neurons known as feature maps. Each neuron from a feature map is connected to a neighborhood of neurons from previous layer, resulting to a so called receptive field. Convolution between input feature maps and the respective kernel is computed. These convolution outputs are then summed up together with a trainable bias term which is then passed to a non-linear activation function such as hyperbolic tangent to obtain a new feature value [2]. Weights are shared in convolution matrices, so that large images can be processed with a reduced set of weights. Convolutional layer acts as a feature extractor that extracts salient features of the inputs such as corners, edges, endpoints or nonvisual features in other signals using the concept of local receptive field and shared weights [1, 3]. Shared weights have several advantages as it reduces the number of free parameters to train on, reduce the complexity of the machine, and reducing the gap between test error and training error. An interesting property of convolutional layers is that if the input image is shifted, the feature map output will be shifted by the same amount, but it will be left unchanged otherwise. This property is the basic robustness of CNN for shifts and distortions of the input. One feature map in a layer will have identical weight vectors. A complete convolutional layer is composed of several feature maps (with different weight vectors) so that multiple features can be extracted at each location. Clear view of the statement is represented in Fig. 2. The output yn(l) of a feature map n in a convolutional layer l (as described in [2, 3]) is given by

Character Recognition of License Plate Number Using Convolutional Neural Network     y n (l ) ( x , y ) = φ (l )  wmn (l ) (i, j ) ⋅ y m (l −1) ( x ⋅h (l ) + i, y ⋅ v (l ) + j ) + bn (l )    m∈M ( l ) (i , j )∈K ( l ) n  

(1)

 

where

{

K (l ) = (i, j ) ∈ Ν 2 | 0 ≤ i < k x (l ) ;0 ≤ j < k y (l )

},

47

k x (l ) and k y (l ) are the width and the height of

(l )

the convolution kernels

wmn of layer l and bn (l ) is the bias of feature map n in layer l.

The set M n (l ) contains the feature maps in the preceding layer l-1 that are connected to feature map n in layer l. The values h (l ) and v (l ) describe the horizontal and vertical step size of the convolution in layer l while φ (l ) is the activation function of layer l.

φ

(a)

(b)

Fig. 2. (a) Spatial convolution [1] (b) CNN convolutional layer [2]

Once a feature has been detected, its exact location becomes less important. Only its approximate position relative to other features is relevant. A simple way to reduce the precision is to reduce the spatial resolution of the feature map. This can be achieved with a so-called subsampling layer which performs local averaging. The subsampling layer reduces the resolution of the image thus reduces the precision of the translation (shift and distortion) effect since the feature maps are sensitive to translation in input [1]. This layer reduces the output of adjacent neuron from previous layer (normally 2x2) by averaging it into a single value. Next, the value is multiplied by a trainable weight (trainable coefficient), adds a bias and then passes the result to a non-linear activation function; such as hyperbolic tangent. The trainable coefficient and bias control the effect of the sigmoid nonlinearity. If the coefficient is small, the unit operates in a quasi-linear mode and the subsampling merely blurs the input. If the coefficient is large, subsampling units can be seen as performing a “noisy OR” or a “noisy AND” function depending on the value of the bias. The illustration can be viewed in Fig. 3. The output yn(l) of a feature map n in a subsampling layer l (as described in [2, 3]) is given by   yn (l ) ( x, y) = φ (l )  wn (l ) ⋅ yn (l −1) (x ⋅ s x + i, y ⋅ s y + j ) + bn (l )    (i , j )∈S ( l )  



(2)

where S (l ) = {(i, j ) ∈ Ν 2 | 0 ≤ i < s x (l ) ;0 ≤ j < s y (l ) }, s x (l ) and s y (l ) define width and height of the subsampling kernel of layer l and bn (l ) is the bias of feature map n in layer l.

48


The value wn (l ) is the weight of feature map n in layer l and φ (l ) is the activation function of layer l.

φ

(a)

(b)

Fig. 3. (a) Example of subsampling process [1] (b) CNN subsampling layer [2]

The final layer is a fully connected layer. In this layer, the neurons from the previous layer are fully connected to every neuron in the current layer. The fully connected layer acts as a normal classifier similar to the layers in traditional Multilayer Perceptron (MLP) networks. The equation of fully connected layer (as described in [2]), is given by  N l −1  y ( l ) ( j ) = φ (l )  y (l −1) (i) ⋅ w(l ) (i, j ) + b (l ) ( j )     i =1 



(3)

where N (l −1) is the number of neurons in the preceding layer l-1, w(l ) (i, j ) is the

weight for connection from neuron i in layer l-1 to neuron j in layer l and b ( l ) ( j ) is the bias of neuron j in layer l, and φ (l ) represents the activation function of layer l. The number of layers depends on the application and each neuron in a layer is an input to the following layer. Besides, the current layer only receives input from the preceding layer plus a bias that is usually 1. Each neuron applies weights to each of its inputs and summed up all the weighted inputs. The total weighted value is subject to a non-linear function; such as sigmoid function to limit the neuron’s output to a range of values. Multiple planes are used in each layer so that multiple features can be detected.

3 Methodology This section describes the research methodology to develop the character recognition system. 3.1 Preparing the Training and Test Data Set

The data that are available for training are divided into two different sets: train set and validation set. There should not be any overlapping between these two datasets in order to improve generalization capacity of a neural-network. This technique is called cross validation [3]. The true performance of a network is only revealed when the network is tested with test data to measure how well the network performs on data that were not seen during training. The testing is designed to access the generalization capability of

Character Recognition of License Plate Number Using Convolutional Neural Network

49

the network. Good generalization means that the network performs correctly on data that are similar to, but different from the training data. The training and test data are limited to Malaysian license plate. The alphanumeric characters involved are all the alphabets except for ‘I’, ‘O’ and ‘Z’ as those three characters are not common for Malaysian license plate. The numeric characters are from 0 to 9. Therefore, the total characters are 33. Initially, character recognition is performed to test the algorithm. For that purpose, the characters are extracted from different-angle of license plate and it is binarized, resized from 22x12 pixels to 24x14 pixels and labeled. When padding the input, the feature units are centered on the border and each convolution layer reduces the feature size from n to (n-4)/2. Since the initial input size is 22x12, the nearest value which generates an integer size after two layers of convolution is 24x14.This is to ease the training process. The total numbers of training and test datasets involved are 750 images and 434 images, respectively. CNN can actually deal with raw data but to make it simple, it is suggested to perform simple image processing on train and test data as done in [1]. Fig. 4 shows a sample of test set images.

Fig. 4. Sample of testing set images

3.2 Developing a MATLAB Model

CNN algorithm is implemented by constructing a Matlab program to simulate and evaluate the features extraction process performance upon character image database. Fig. 5 shows the CNN architecture for character recognition. CNN are trained with gradient-based backpropagation method. The purpose of implementing this learning algorithm is to minimize the error function after each training example by adjusting the weights of neural network. The simplest error function used is Mean Square Error. All training patterns along with the expected outputs are fed into the network. Next, the network error (the difference between actual and expected output) is backpropagated through the network and the gradient of the network error is computed with respect to the weights. This gradient is then used to update the weight values according to specific rules such as stochastic, momentum, appropriate learning rate and activation function, etc. The training process will stop until the network is well trained [1]. The supervised training is shown in Fig. 6. Once the network is trained, a test image will be fed to the trained system to perform pattern classification. This statement describes Fig. 7. 3.3 The Proposed Architecture

The architecture shown in Fig. 5 comprises 5 layers, excluding the input, all of which contains trainable weights. The actual character size is 22x12 and padding it to 24x14 to extract the feature in the border of the character image. Layer C1 is a convolutional layer with 6 feature maps. The size of the feature maps is 20x10 pixels. The total number of neurons is 1200 (20x10x6). There are ((5x5+1)x6)=156 trainable weights. The "+1" is for the bias. Each 1200 neurons have

50


Fig. 5. The proposed architecture

Fig. 6. Supervised training diagram

Fig. 7. Testing process diagram

26 connections which make up 31200 total connections from layer C1 to prior layer. At this point, one of the benefits of a convolutional "shared weight" neural network should become clearer: because the weights are shared, even though there are 31200 connections, only 156 weights/parameters are needed to control those connections. As a consequence, only 156 weights need training. In comparison, a traditional "fully connected" neural network would have needed a unique weight for each connection, and would therefore have required training for 31200 different weights. Layer S2 is a subsampling layer with 6 feature maps of size 10x5 pixels. Each unit in each feature map is connected to a 2x2 neighbourhood in the corresponding feature map in C1. The 2x2 receptive fields are non-overlapping, therefore feature maps in S2 have half the number of rows and columns of feature maps in C1. Therefore, there are a total of 10x5x6 = 300 neurons in layer S2, 2x6 = 12 weights, and 300x(2x2+1) = 1500 connections.


51

Layer C3 is a convolutional layer with 16 feature maps. Each unit in each feature maps is connected to several 5x5 neighbourhoods at identical locations in a subset of S2 feature maps. The reason for this is to keep the number of connections with reasonable bounds. There are therefore 6x1x16 = 96 neurons in layer C3, 1516 weights, and 9096 connections. Table 1 shows the connection between S2 and C3. Table 1. Each column indicates which feature map in S2 are combined by the units in a particular feature map of C3 [1]

Layer C4 is a fully-connected layer with 120 units. The choice for this number is due to optimal capacity reached with 120 hidden units for 33 classes. Since it is fullyconnected, each of the 120 neurons in the layer is connected to all 96 neurons in the previous layer. Therefore, there are a total of 120 neurons in layer C5, 120x(96+1) = 11640 weights, and 120x97 = 11640 connections. Layer Output represents the output layer. This layer is a fully-connected layer with 33 units. Since it is fully-connected, each of the 33 neurons in the layer is connected to all 120 neurons in the previous layer. There are therefore 33 neurons in layer F6, 33x(120+1) = 3993 weights, and 33x121 = 3993 connections. Neurons with the value of “+1” corresponds to the “winning” neurons while “-1” corresponds to other neurons. No specific rules were given on deciding the number of layers and feature maps in order to obtain optimum architecture. As long as sufficient information could be extracted for classification task, the architecture is considered accepted. This could be determined by the misclassification rate obtained by the network. However, minimum number of layers and feature maps are much preferred to ease the computation process. Each unit in each feature map is connected to a 5x5 kernel in the input. This kernel size is chosen to be centered on a unit (odd size) in order to have sufficient overlap (around 70%) for not losing information. The kernel size of 3x3 would be too small with only one unit overlap while 7x7 kernel size would be too large that would add the computation complexity [7]. 3.4 Optimizing the CNN Algorithm

There are several actions that could be taken to improve the CNN performance. An extension to backpropagation algorithm could be considered to improve the convergence speed of the algorithm, avoiding local minima, improve the generalization of neural network and finally improves the recognition rate/accuracy. A character database consisting of 33 classes representing alphanumeric characters was created. The system sensitivity to angles and distortions was reduced by taking 10 samples for each class. There are several factors that affect the system performance as discussed below. The parameter value were taken from [1]. Among these techniques, weight decay is not implemented by the writer.

52


Training Mode. There are two principle training modes which determine the way the weights are updated. The first mode is online training (stochastic gradient). This mode represents a single example that is chosen randomly from the training set at each iteration t and the error is calculated before the weights are updated accordingly. Second mode is the offline training (batch training). The whole training example is fed into the network and the accumulated error is calculated before updating the weights. Between these two modes, online learning is much faster than batch learning and results in better generalization solution for large datasets [1, 3]. Learning Rate. Learning rate is used during the weight update of such architecture. This parameter is crucial in determining the successful of convergence and generalization of such neural network. A too small learning rate leads to slow convergence and oppositely leads to divergence. For this architecture, the values of the global learning rate η is adapted from [1]. The value was decreased using the following schedule: 0.0005 for the first two passes; 0.0002 for the next three; 0.0001 for the next three; 0.00005 for the next four; and 0.00001 thereafter. Activation Function. The activation function is pre-conditioned for faster convergence. The squashing function used in this convolutional network is f (a) = A tanh( Sa) . In [1], A is chosen as 1.7159 and S=2/3. With this choice of parameters, the equalitites f (1) = 1 and f (−1) = −1 are satisfied. Symmetric functions are believed to yield faster convergence, although the learning can become extremely slow if the weights are too small. Second Order Backpropagation. Second order methods has the greatest impact on speeding convergence of the neural network, in that it dramatically reduced the number of epochs needed for convergence of the weights. All second order techniques aims to increase the speed with which backpropagation converges to optimal weights. However, most second order techniques are designed for offline mode which is useless with neural network training. Neural network training works considerably faster with online mode (stochastic gradient) where the parameters are updated after every training sample. Hence, [1] has proposed a stochastic version of the Levenberg-Marquardt algorithm with a diagonal approximation of the Hessian. The Diagonal Hessian (square matrix of second-order partial derivatives of a function) in neural network was shown to be very easy to compute with backpropagation. During the simulation process, the number of subsamples chosen for diagonal hessian estimation has different result as shown in Table 2. Size of Training Set. The size of training set also affects the system performance in term of accuracy [7]. The training set should be as large as possible by adding a new form of distorted data. By this, the network could learn different training patterns that will result to accuracy increment. Momentum. A momentum is added to improve the convergence speed. It controls the previous weight change on the current weight change [1, 3] from oscillating. The following shows the weight update formula.

Δwk (n) = −λ

∂E p ∂wk (n)

+ αΔwk (n − 1)

(4)


53

Where α is the momentum rate (0 ≤ α < 1) and λ is the learning rate. Table 3 represents the misclassification rate for different momentum value. Weight Decay. This is a regularization technique where the term

α

2

2

x

w x is added

to the error function as shown in the equation below: Ep =

1 α || o p − t p || 2 + 2 2

w x

2 x

(5)

This term avoids large weights; reduce the flexibility of the neural network and avoid overfitting the data. By performing gradient descent to this function may lead to the following update formula: ∂E p Δw x = − λ − λαwx (6) ∂wn Where wx refers to all the weights and biases and α refers to a small positive constant. Cross Validation. This technique separates the data into two disjoint parts representing training and validation set. The purpose of this technique is to improve the generalization capacity of a neural network. Theoretically, the error on training and validation set should decrease during the training process. However, at some point the validation error remain constant or even increases. In this case, the increase shows that the neural network might have stop learning the common pattern in train and validation sets, and started to learn noise contained in the training set [3].

4 Results and Discussions This section discusses the results obtained from 4 hours duration of simulation. This network has gone through 30 iteration of simulation. 4.1 Misclassification Rate

In order to analyze the results obtained, the performances of the architecture are measured by misclassification rate. Misclassification rate refers to the number of samples being misrecognized. Thirty iterations through the entire training data were performed for each session. The values of the global learning rate µ was decreased using the following schedule: 0.0005 for the first two passes; 0.0002 for the next three; 0.0001 for the next three; 0.00005 for the next four; and 0.00001 thereafter. There is a vast value of momentums available for CNN. In order to select the optimum value for momentum, several experiments were conducted resulting to the table below. From Table 2, it can be seen that the optimum momentum is approximately 0.5. According to Fig. 8, the produced result is 1.21% of misclassification rate for 434 test data and 750 training data, and for momentum = 0.5. This means the neural network correctly recognized 428 patterns and misrecognized 6 patterns.

54

S.A. Radzi and M. Khalil-Hani Table 2. The reading of misclassification rate according to different momentum values Momentum 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Misclassification rate (percent) Train (percent) 0.9091 0.9091 0.6061 0.3030 0.9091 0.0000 1.2121 0.3030 0.0000 0.3030

Test (percent) 2.4242 4.8485 3.0303 1.8182 3.6370 1.2121 3.0303 1.8182 3.0303 3.0303

As shown from the diagram, the train and test error stabilize after 20 iterations. In certain case, the graph increases after meeting a stable condition. It is assumed that at this point the neural network starts to overtrain the data and generalization decreases. This is due to the fact that neural network have learned the noise specific to that dataset. It may also due to overfitting problem of the classifier and the sample of training set is not large enough to improve the classification accuracy. In order to avoid this situation, “early stopping” technique could be implemented once the graph starts to stabilize [3].

Fig. 8. Misclassification rate graph for momentum=0.5

5 Conclusion Several factors that affect the system performance have been discussed. The system performance could be increased by implementing online mode, choosing appropriate learning rate and activation function, using second order backpropagation (stochastic version of the Levenberg-Marquardt algorithm), expanding the size of training set with different form of distorted images, applying momentum and weight decay, and ultimately implementing cross validation. The number of training and test images used


55

in this research is 750 and 434, respectively with 1.21% misclassification rate or 98.79% accuracy. In comparison to [4-6, 8], the approach is similar except for some, they applied geometrical rule to extract the characters. The numbers of training and test data sets are more than 2000 samples which are much higher than the one proposed in this work. Moreover, the accuracy is more or less the same with this research work. As a conclusion, this research work is better in terms of the reduced number of samples. Acknowledgments. This work is supported by the Ministry of Science, Technology & Innovation of Malaysia (MOSTI) under TECHNOFUND Grant TF0106C313, UTM Vote No. 79900.

References 1. 2.

3. 4.

5.

6.

7.

8.

LeCun, Y., et al.: Gradient-Based Learning Applied to Document Recognition. In: Intelligent Signal Processing, pp. 306–351. IEEE Press (2001) Strigl, D., Kofler, K., Podlipnig, S.: Performance and Scalability of GPU-Based Convolutional Neural Networks. In: 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2010), pp. 317–324 (2010) Duffner, S.: Face Image Analysis with Convolutional Neural Networks. [Dissertation], [cited Doctoral Thesis / Dissertation; 192 ] (2007) Zhihong, Z., Shaopu, Y., Xinna, M.: Chinese License Plate Recognition Using a Convolutional Neural Network. In: Pacific-Asia Workshop on Computational Intelligence and Industrial Application, PACIIA (2008) Chen, Y.-N., et al.: The Application of a Convolution Neural Network on Face and License Plate Detection. In: 18th International Conference on Pattern Recognition, pp. 552–555 (2006) Han, C.-C., et al.: License Plate Detection and Recognition Using a Dual-Camera Module in a Large Space. In: 41st Annual IEEE International Carnahan Conference on Security Technology, pp. 307–312 (2007) Simard, P.Y., Steinkraus, D., Platt, J.C.: Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis. In: Seventh International Conference on Document Analysis and Recognition. Institute of Electrical and Electronics Engineers, Inc. (2003) Johnson, M.: A Unified Architecture for the Detection and Classification of License Plates. In: 2008 10th International Conference on Control, Automation, Robotics and Vision, Hanoi, Vietnam (2008)