Advanced Image Classification using Wavelets and Convolutional ...

7 downloads 249016 Views 812KB Size Report
Dec 20, 2016 - Official Full-Text Paper (PDF): Advanced Image Classification using Wavelets and Convolutional Neural Networks. ... (UFLDL) tutorial at Stanford [27]. ..... http://ufldl.stanford.edu/wiki/index.php/Exercise:_Implement_deep_net.
2016 15th IEEE International Conference on Machine Learning and Applications

Advanced Image Classification using Wavelets and Convolutional Neural Networks Travis Williams

Robert Li

Department of Electrical & Computer Engineering North Carolina A&T State University Greensboro, NC, USA [email protected]

Department of Electrical & Computer Engineering North Carolina A&T State University Greensboro, NC, USA [email protected] much faster than shallower forms of machine learning [2,3]. This paper emphasizes Convolutional Neural Networks (CNN), a type of deep neural network that has a structure and approach that differs from other deep neural networks. Their strength and design is usage on two-dimensional data, like images and videos. It is compared to Stacked Denoising Autoencoders (SDA), which follows the more traditional fully connected structure of neural networks.

Abstract— Image classification is a vital technology many people in all arenas of human life utilize. It is pervasive in every facet of the social, economic, and corporate spheres of influence, worldwide. This need for more accurate, detailoriented classification increases the need for modifications, adaptations, and innovations to Deep learning algorithms. This paper uses Convolutional Neural Networks (CNN) to classify handwritten digits in the MNIST database, and scenes in the CIFAR-10 database. Our proposed method preprocesses the data in the wavelet domain to attain greater accuracy and comparable efficiency to the spatial domain processing. By separating the image into different subbands, important feature learning occurs over varying low to high frequencies. The fusion of the learned low and high frequency features, and processing the combined feature mapping results in an increase in the detection accuracy. Comparing the proposed methods to spatial domain CNN and Stacked Denoising Autoencoder (SDA), experimental findings reveal a substantial increase in accuracy.

Typically, image classification using CNN, SDA, etc. is performed on the raw image pixels. However, this paper proposes an algorithm that converts the data into the wavelet domain. The subbands in the first-order decomposition are processed via CNN. The classification results of each CNN are combined using the OR operator, generating a higher classification accuracy than CNN on the spatial image data, and compared to the same approach using SDA [4]. All simulations are done in MATLAB R2016b. In this paper, handwritten digits from the Mixed National Institute of Standards and Technology (MNIST) database [5] and natural scenes from the Canadian Institute for Advanced Research (CIFAR-10) [6] are used to be classified by each deep learning approach.

Keywords— CNN, SDA, Neural Network, Deep Learning, Wavelet, Classification, Fusion, Machine Learning, Object Recognition

I

The rest of this paper is organized as follows: Section 2 gives the background, Section 3 describes the proposed methods, Section 4 discusses the experimental results, and Section 5 gives the summary and conclusion.

I. INTRODUCTION

mage classification is one of the many application of machine learning that is impacting many spheres of influence in business, medicine, technology, research, finance, etc. [1]. The desire to create faster, more efficient machine learning algorithms to classify images is driving research in universities, corporations, and start-up businesses around the world. Many of these machine learning algorithms fall under the umbrella of Deep Learning, which is a subset of machine learning.

II. BACKGROUND A. Wavelet Transform Wavelet theory involves representing general functions in terms of the simpler, fixed building block, which is referred to as a ‘wavelet’, at different scales and positions. The Discrete Wavelet Transform (DWT) can be regarded as a sequence of numbers which sample a certain continuous function [7-9].

Deep learning trains computers to differentiate patterns in data, in numerous layers of nonlinear information processing. Each layer represents new learned features, and builds upon them for the next layer. At each layer, more complex, higherlevel features derive from the lower-level ones. These discoveries further allows for distinguishing between different classes the deeper the network. The result is the classification and organization of huge, messy, disorderly data, in speeds 978-1-5090-6167-9/16 $31.00 © 2016 IEEE DOI 10.1109/ICMLA.2016.36

When digital images are handled at multiple resolutions, the DWT is a viable mathematical tool. In addition to its efficient, highly intuitive framework for representation and storage of multiresolution images, the DWT provides powerful insight into an image’s spatial and frequency characteristics.

233

Let an image, f(x,y), have dimensions M x N. We define the two dimensional DWT transform pair as ܹఝ ሺ݆଴ ǡ ݉ǡ ݊ሻ ൌ

ଵ ξெήே

ܹట௜ ሺ݆ǡ ݉ǡ ݊ሻ ൌ

addressing each subband independently, if desired. Once processing occurs in each subband, the image is reconstructed using the inverse discrete wavelet transform (IDWT).

ேିଵ σெିଵ ௫ୀ଴ σ௬ୀ଴ ݂ሺ‫ݔ‬ǡ ‫ݕ‬ሻ߮௝బ ǡ௠ǡ௡ ሺ‫ݔ‬ǡ ‫ݕ‬ሻ

(1)

௜ ேିଵ σெିଵ ௫ୀ଴ σ௬ୀ଴ ݂ሺ‫ݔ‬ǡ ‫ݕ‬ሻ߰௝ǡ௠ǡ௡ ሺ‫ݔ‬ǡ ‫ݕ‬ሻ

(2)

ଵ ξெήே

B. Convolutional Neural Networks Convolutional Neural Networks (CNN) are unlike most neural networks structurally. Most neural networks take the input data and convert it into a one-dimensional vector of neurons. CNNs, however, use a spatial structure that matches the structure of the data it captures. This spatial structure makes CNNs very well suited to classify images, videos, etc. [5,15].

We define the Inverse Discrete Wavelet Transform (IDWT) as ଵ σ σ ܹ ሺ݆ ǡ ݉ǡ ݊ሻ߮௝బ ǡ௠ǡ௡ ሺ‫ݔ‬ǡ ‫ݕ‬ሻ ξெήே ௠ ௡ ఝ ଴ ଵ ௜ ௜ σ   σஶ ௝ୀ௝బ σ௠ σ௡ ܹ ట ሺ݆ǡ ݉ǡ ݊ሻ߰௝ǡ௠ǡ௡ ሺ‫ݔ‬ǡ ‫ݕ‬ሻ ξெήே ௜ୀுǡ௏ǡ஽

݂ሺ‫ݔ‬ǡ ‫ݕ‬ሻ ൌ

൅ (3)

CNNs use the same tools as other neural networks, i.e. gradient descent, backpropagation, non-linear activation functions, dropout, etc. However, the change in structure also leads to changes in how weights are learned, shared, and how the dimensions are reduced from one layer to the next.

where Wij are the approximation coefficients, Wȥ are the detail coefficients, m & n are the subband dimensions, j is the resolution level, and i is the subband set {H,V,D}.

Most neural networks use fully connected layers, where every neuron is connected to each neuron in the next layer. In CNNs, instead of each input neuron being connected to each neuron in the next hidden layer, a region of neurons in the input layer are connected to one neuron in the hidden layer. This region is called a local receptive field, and it is a window. Within this field, each connection learns a weight, and an overall bias for the neuron in the hidden layer. The local receptive field is a window of a set determined size, usually square [15].

The Fast Wavelet Transform (FWT) can be expressed below: ܹట ሺ݆ǡ ݊ሻ ൌ σ௠ ݄ట ሺ݉ െ ʹ݇ሻܹఝ ሺ݆ ൅ ͳǡ ݉ሻ

(4)

ܹఝ ሺ݆ǡ ݊ሻ ൌ σ௠ ݄ఝ ሺ݉ െ ʹ݇ሻܹఝ ሺ݆ ൅ ͳǡ ݉ሻ

(5)

where k is the parameter about the position. Equations (4) and (5) expose a useful relationship between the DWT coefficients of adjacent scales. It is observed that the scale j approximation and detail coefficients can be iteratively computed by convolving W‫(׋‬I + 1,n) with the time reversed scaling and wavelet vectors h‫(׋‬-n) and h߰(-n), and subsampling the outcomes.

The local receptive field and bias is the same for each neuron in the hidden layer. Unlike typical neural networks, CNN shares weights and biases for the whole input layer to hidden layer [15]. The local receptive field moves across the input layer in a typical window filter fashion to create the activations for the next layer. These calculations are performed in what is called the convolutional layer. This process occurs for every feature map in this layer. A mathematical representation of this weight sharing filter is shown in the equation below:

Similar to one-dimensional FWT, the two-dimensional FWT filters the approximation at resolution level j + 1 to acquire approximation and details in the jth resolution level. In the two-dimensional case, however, we expect three sets of detail coefficients residing in the horizontal, vertical, and diagonal directions [10-13]. Figure 1 shows a representation of the multilevel wavelet decomposition at level 3:

௡ିଵ ௡ିଵ

‫ݕ‬௜௝ ൌ ߪ ൭ܾ ൅ ෍ ෍ ܹ௟ǡ௠ ܽ௝ା௟ǡ௞ା௠ ൱ሺ͸ሻ ௟ୀ଴ ௠ୀ଴

where Wl,m represents the shared weights, b represents the bias, aj+l,k+m is the input activation at a certain position, and n is the window size of the convolution filter. An example representation of a local receptive field and shared weights is shown in Figure 2:

Figure 1: Level 3 Wavelet Decomposition

The subbands LHj, HLj, and HHj, j = 1, 2… J are the detail coefficients, as noted above, where j is the scale and J denotes the largest or coarsest scale in the decomposition [14]. The wavelet decomposition detail coefficients allow image processing applications to be performed in each subband,

Figure 2: 5x5 Local Receptive Field + Shared Weights

234

III. PROPOSED METHODS

The basic structure of CNNs alternate between a convolutional layer, which is detailed above, and a pooling layer [5,16]. This pairing continues until the data is reduced enough to combine all of the feature maps into a fully connected layer, which is usually a softmax classifier.

Applying CNN on the raw pixels of images generates accurate results. However, the size and complexity of these images in the spatial domain causes the efficiency of the algorithm to decrease. By converting the images into the wavelet domain, they can be processed at a lower dimension, with faster processing times [21].

Pooling is another term for subsampling. In this layer, the dimensions of the output of the convolutional layer are condensed. The dimensionality reduction happens by summarizing a region into one neuron value, and this occurs until all neurons have been affected. The two most popular forms of pooling are max-pooling and mean pooling [15]. There are other forms of pooling, i.e. stochastic [16], mixed [17], etc., that improve upon the strengths and weaknesses of the aforementioned pooling methods. Max-pooling involves taking the maximum value of a region and selecting it for the condensed feature map. Mean-pooling involves calculating the average value of a region and selecting it for the condensed feature map. The max-pooling function is expressed as: ܽ௝ ൌ ƒš ൫ܽ௞௣௤ ൯ ሺ௣ǡ௤ሻ‫א‬ோ೔ೕ

Furthermore, given the varying frequencies represented in each subband, multiple CNNs performed on each subband, or a combination of them, can increase the accuracy of the classification. The main steps are outlined below: 1.) Convert the raw images into the wavelet domain 2.) Perform Z-score normalization on subbands [21,22] ܼൌ

where M is the input, and mean and std represent the 2-D mean and standard deviation of the input.

(7)

3.) Normalize all subbands [0,1], except for the LL band 4.) Perform CNN on selected subbands 5.) Combine all results using the OR operator to get final classification [23]

While mean-pooling is shown by the following equation: ܽ௞௜௝ ൌ

‫ ܯ‬െ ݉݁ܽ݊ሺ‫ܯ‬ሻ ሺͻሻ ‫݀ݐݏ‬ሺ‫ܯ‬ሻ

ͳ

෍ ܽ௞௣௤ ሺͺሻ หܴ௜௝ ห ሺ௣ǡ௤ሻ‫א‬ோ ೔ೕ

The application of the wavelet subbands is presented in two different ways. The first way (hereafter called CNN-WAV2) fuses the detail coefficients (LH, HL, HH) together prior to processing the images according to this formula [21]:

An illustration of both of these pooling methods is expressed in Figure 3:

‫ ܨܪ‬ൌ ߙ ή ‫ ܪܮ‬൅ ߚ ή ‫ ܮܪ‬൅ ߛ ή ‫ܪܪ‬

(10)

where ᅇ, ȕ, and Ȗ are the weight parameters of each subband, whose values are determined below [23]: ߙൌ

ܶ‫ܣ‬௅ு ሺͳͳሻ ܶ‫ܣ‬௅ு ൅ ܶ‫ܣ‬ு௅ ൅ ܶ‫ܣ‬ுு

ߚ ൌ

ܶ‫ܣ‬ு௅ ሺͳʹሻ ܶ‫ܣ‬௅ு ൅ ܶ‫ܣ‬ு௅ ൅ ܶ‫ܣ‬ுு

ߛ ൌ

ܶ‫ܣ‬ுு ሺͳ͵ሻ ܶ‫ܣ‬௅ு ൅ ܶ‫ܣ‬ு௅ ൅ ܶ‫ܣ‬ுு

Figure 3: Example of Max & Mean Pooling with Stride of 2

A complete CNN connects an alternating flow of convolutional layers to pooling layers. Other layers of processing exist, i.e. rectified linear units (ReLU) [18], dropout [19], batch normalization [20], etc. The purpose of these types of layers varies, from creating stronger activations, regulating networks, etc. for peak performance. Figure 4 shows an example CNN architecture:

where TA is the test accuracy for each individual subband after CNN processing. The Results section shows the formula for test accuracy. The CNN-WAV2 method is shown in Figure 5:

Figure 5: CNN-WAV2 Method Figure 4: Example CNN Architecture

235

The other application (hereafter called CNN-WAV4) uses all of the first-level wavelet decomposition subbands, and it is exacted according to the diagram in Figure 6:

ܶ‫ ܣ‬ൌ

͓‫݂݀݁݅݅ݏݏ݈ܽܿݕ݈ݐܿ݁ݎݎ݋݂ܿ݋‬ ൈ ͳͲͲΨሺͳͶሻ ͓‫ݏ݈݁݌݉ܽݏ݀݁ݐݏ݁ݐ݂݋‬

‫ ܴܯ‬ൌ ͳͲͲΨ െ ܶ‫ܣ‬ ‫ ܴܨ‬ൌ

(15)

͓‫݂݀݁݅݅ݏݏ݈ܽܿݕ݈݁ݏ݈݂݂ܽ݋‬ ൈ ͳͲͲΨሺͳ͸ሻ ͓‫݂݀݁݅݅ݏݏ݈ܽܿݕ݈ݐܿ݁ݎݎ݋݂ܿ݋‬ ‫ ܴܧ‬ൌ ‫ ܴܯ‬൅ ‫ܴܨ‬

(17)

Below is a sample of the training and testing data in Figure 8: Training Data

Testing Data

Figure 6: CNN-WAV4 Method

The trade-off between the two approaches is time efficiency versus accuracy, which are discussed in the Results section. IV. RESULT AND DISCUSSION A. MNIST All CNN experiments use MatConvNet [24] to model the algorithms and receive results. The network architecture is based on the example MNIST structure from MatConvNet, with batch normalization inserted. All training is done using the ADAM stochastic optimization method [25]. All other parameters are the same. The basic structure used is:

Figure 8: MNIST Training and Testing Data Samples

Out of all of the approaches, the CNN-WAV4 method achieves the highest classification accuracy. This high accuracy stems from having four subbands that detect a high number of digits on their own, as well as uniquely detecting digits that no other subband detects. Thusly, these many contributions result in CNN-WAV4 surpassing all others. The CNN-WAV2 method produces a slightly less accurate classification score, but processes the data more than twice as fast as the CNN-WAV4 method. The fusion of the detail coefficients (LH, HL, HH) aid in the speedup from the CNNWAV4 method, given that only two subbands needs processing, rather than four. Also, they retain most of the accuracy of the individual subbands, but lose the individual contributions and unique digits CNN-WAV4 method detects.

Figure 7: CNN Structure Block Diagram

For the CNN-WAV2 & CNN-WAV4 methods, the first two convolution layers use a 3x3 kernel. All SDA [26] experiments originate from the code provided in the Unsupervised Feature Learning and Deep Learning (UFLDL) tutorial at Stanford [27]. The code is modified for this research. All parameters, with the exception of the hidden layer sizes, are the same. The structure of the SDA is 784-625784-10, while SDA-WAV2 and SDA-WAV4 have a structure of 196-169-196-10.

In both proposed approaches, the LL subband contributes the highest number of correctly classified digits. This subband has the most image detail and similarity to the original image. Below are two tables that display the detection performance and time efficiency of the proposed methods compared to CNN, as well as the wavelet approach applied to SDA:

The wavelet basis is the Haar wavelet, mainly for its even, square subbands. For this paper, only the first-level wavelet decomposition is used. All experiments are run on a 64-bit operating system, with an Intel® Xeon® CPU E31225 @ 3.10 GHz processor, with 8.0 GB of RAM.

Accuracy Miss False Ratio Error Ratio (%) Ratio (%) CNN 99.11 0.89 0.90 1.79 CNN-WAV2 99.40 0.60 0.60 1.20 CNN-WAV4 99.67 0.33 0.33 0.66 SDA 98.07 1.93 1.97 3.90 SDA-WAV2 98.18 1.82 1.85 3.67 SDA-WAV4 99.09 0.91 0.92 1.83 Table 1: MNIST Detection Performance of CNN and SDA Methods

The input training data and test data come from the MNIST database of handwritten digits. The full training set of 60,000 images is used, as well as the full testing set of 10,000 images. The test accuracy (TA), miss ratio (MR), false rate (FR), and error ratio (ER) are measured for each method, and these evaluations are represented by [21]: 236

Training (s) Testing (s) Overall (s) CNN 1007.10 2.34 1009.44 CNN-WAV2 1361.57 3.01 1364.58 CNN-WAV4 2832.45 6.52 2838.97 SDA 4467.46 1.07 4468.53 SDA-WAV2 1283.57 0.33 1283.9 SDA-WAV4 2400.35 0.67 2401.02 Table 2: MNIST Time Efficiency for Different CNN and SDA Methods

All of the CNN approaches trump the SDA approaches, confirming its strength in handling two-dimensional data, as compared to the vector approach of traditional neural networks. The weakest CNN method, spatial CNN, still outperforms the strongest SDA method, SDA-WAV4.

Figure 10: CNN-WAV2 Missed Digits and Subband Predictions

One advantage of the proposed methods over operations in the spatial domain comes from the strength of the individual subbands. When one or more subband has an error, another subband can correctly predict a digit. This allows for a system of error catching not provided for in the spatial domain.

The only downfall in the CNN proposed methods is that they perform slower than their SDA counterparts. Part of this slowdown is that the CNN-based wavelet methods have a greater depth than the SDA. Perhaps adding more depth to SDA would reveal a more efficient CNN wavelet method versus the SDA wavelet method.

CNN-WAV4, having four possibilities for correct detection, has a higher chance of positive prediction than CNN-WAV2 and the spatial CNN. Each subband solely detects digits its counterparts miss.

CNN-WAV4 misses 33 digits. A significant number of these misses are understandable, as the writing of some of the digits is confusing. Some of the misses appear in line with what the value seems to be, even though they are wrong. For example, a ‘2’ that looks like a ‘7’, or an ‘8’ that looks like a ‘9’. Without context, most people would also miss these digits.

For CNN-WAV4, this adds 89 correct classifications. The bar graph in Figures 11 shows the unique detections per subband for CNN-WAV4. Unique Digits Detected Per Subband 60

In many of the missed cases, all of the subbands would be in unanimous agreeance with the predicition (i.e. predicting an oddly written ‘6’ as a ‘4’). In other cases, there is unity with the prediction, except for the HH subband, which is the least accurate. Even still, there are numbers where each subband predicts a different erroneous digit. Figure 9 shows these disparities.

59

# of Digits Detected

50

40

30

20 13 11 10 6

0

LL

LH

HL

HH

Figure 11: Unique Digits Detected Per Subband (CNN-WAV4)

In comparison, for CNN-WAV2, this adds 382 correct classifications. The huge gap in numbers is partially due to the shared number of digits some or all of the four subbands in CNN-WAV4 detect, versus the pairing of the two in CNNWAV2. The bar graph in Figures 12 shows the unique detections per subband. Unique Digits Detected Per Subband 350

334

300

Figure 9: CNN-WAV4 Missed Digits and Subband Predictions # of Digits Detected

250

Comparatively, the CNN-WAV2 method misses 60 digits. For most of the missed digits, the two subbands (LL and HF) classify them the same. There is less variance in the missed classification because the fusion of the detailed subbands increases its likelihood of being in agreeance with the more accurate LL subband, even though both are incorrect. Figure 10 shows the missed digits for the CNN-WAV2 method.

200

150

100 48

50

0

LL

HF

Figure 12: Unique Digits Detected Per Subband (CNN-WAV2)

237

The unique digits each subband detects further show the effectiveness of the proposed methods. Of the 89 unique digits for CNN-WAV4, 25 digits are detections CNN misses. For CNN-WAV2, 22 are detections CNN misses. Table 3 breaks down unique digits detected per subband.

Like the MNIST results, the CNN-WAV4 method achieves the highest classification accuracy. The CNN-WAV2 method produces a less accurate classification score, but processes the data more than twice as fast as the CNN-WAV4 method. Similar to the MNIST results, the LL subband contributes the highest number of correctly classified images. This subband has the most image detail and similarity to the original image.

LL LH HL HH HF Accuracy (%) 98.92 95.86 95.65 88.57 96.06 Unique CNN-WAV4 59 13 11 6 Unique CNN-WAV2 334 48 Detected vs CNN Miss 11 4 6 4 10 CNN Miss (%) 12.4 4.5 6.7 4.5 11.2 Useful CNN (%) 18.6 30.8 54.5 66.7 20.8 Table 3: Breakdown of Unique Digits Detected per Subband

Below are two tables that display the detection performance and time efficiency of the proposed methods compared to CNN, as well as the wavelet approach applied to SDA: Accuracy Miss False Ratio Error Ratio (%) Ratio (%) CNN 77.53 22.47 28.98 51.45 CNN-WAV2 76.42 23.58 30.86 54.44 CNN-WAV4 85.67 14.33 16.73 31.06 SDA 48.64 51.36 100.06 151.42 SDA-WAV2 50.65 49.35 97.43 146.78 SDA-WAV4 67.45 32.55 48.26 80.81 Table 4: CIFAR-10 Detection Performance of CNN and SDA Methods

For clarification, ‘Detected vs CNN Miss’ refers to the unique digits each subband detects that CNN misses. ‘CNN Miss (%)’ refers to the ratio of unique detections that CNN misses to the total number of misses for CNN (i.e. 11/89). ‘Useful CNN (%)’ refers to the ratio of unique detections CNN misses to the total number of unique detections per subband (i.e. 11/59). B. CIFAR-10 All CNN experiments use MatConvNet. The network architecture is based on the CIFAR-10 structure from Zeiler’s stochastic pooling work [16]. All training is done using stochastic gradient descent [28]. All other parameters are the same. The basic structure used is:

Training (s) Testing (s) Overall (s) CNN 8,545.24 25.83 8,571.07 CNN-WAV2 10,800.20 78.13 10,878.33 CNN-WAV4 21,696.66 100.80 21797.44 SDA 41,847 2.19 41,849.19 SDA-WAV2 9,155.52 0.41 9,155.93 SDA-WAV4 8,977 0.87 8,977.87 Table 5: CIFAR-10 Time Efficiency for Different CNN and SDA Methods

Like the MNIST results, CNN approaches outclass SDA approaches. The weakest CNN method, CNN-WAV2, still outperforms the strongest SDA method, SDA-WAV4. The CNN proposed methods perform slower than their SDA counterparts. Part of this slowdown is that the CNN-based wavelet methods have a greater depth than the SDA.

Figure 13: CNN Structure Block Diagram

All SDA experiments use the autoencoder example in the Neural Network Toolbox. The structure of SDA is 3072-1024512-10, while SDA-WAV2 and SDA-WAV4 have a structure 1024-400-100-10. All other parameters are the same.

The multiple subbands of the proposed methods allow error correction of the incorrect predictions. For CNN-WAV4, this adds 1767 correct classifications. The bar graph in Figures 15 shows the unique detections per subband for CNN-WAV4.

The input training data and test data come from the CIFAR-10 dataset. The full training set of 50,000 images is used, as well as the full testing set of 10,000 images. Below is a sample of the training and testing data in Figure 14:

Unique Scenes Detected Per Subband 700 641 600

# of Scenes Detected

500

Training Data

Testing Data

421 400

368 337

300

200

100

0

LL

LH

HL

HH

Figure 15: Unique Scenes Detected Per Subband (CNN-WAV4)

In comparison, for CNN-WAV2, this adds 3100 correct classifications. The bar graph in Figures 16 shows the unique detections per subband.

Figure 14: CIFAR-10 Training and Testing Data Samples

238

[6]

Unique Scenes Detected Per Subband 2000

1953

# of Scenes Detected

1800

[7]

1600

[8]

1400 1200

1147

[9]

1000 800

[10]

600 400 200 0

[11] LL

HF

[12]

Figure 16: Unique Scenes Detected Per Subband (CNN-WAV2)

[13]

The unique scenes each subband detects further show the effectiveness of the proposed methods. Of the 1767 unique digits for CNN-WAV4, 652 scenes are detections CNN misses. For CNN-WAV2, 527 are detections CNN misses. Table 6 breaks down unique scenes detected per subband.

[14] [15] [16]

LL LH HL HH HF Accuracy (%) 64.95 56.72 54.21 48.32 56.89 Unique CNN-WAV4 641 421 368 337 Unique CNN-WAV2 1953 1147 Detected vs CNN Miss 171 184 142 155 356 CNN Miss (%) 7.6 8.2 6.3 6.9 15.8 Useful CNN (%) 26.7 43.7 38.6 46.0 31.0 Table 6: Breakdown of Unique Scenes Detected per Subband

[17]

[18] [19]

V. CONCLUSION

[20]

The proposed methods prove to produce more accurate classification results than spatial CNN and SDA, because of the versatility of the wavelet subbands. CNN-WAV2 is more time efficient than CNN-WAV4, because of the smaller amount of subbands to process. CNN-WAV4 achieves the highest accuracy because of having four subbands to correct any errors. The unique detections enhance the vitality of the proposed method over the spatial CNN approach. Using GPUs possibly would reduce or even out the processing time for both CNN proposed methods.

[21]

[22] [23] [24]

Future research in this area could be to expand the algorithm to multiple decomposition levels. Also working with a dataset with larger images would strengthen the claim of time efficiency, while maintaining higher accuracy.

[25] [26]

Criterion”, J. Mach. Learn. Res. 11 (December 2010), pp. 3371-3408. [27] Andrew Ng, “Exercise: Implement Deep Networks for Digit Classification” (UFLDL Tutorial), [online] May 26, 2011, http://ufldl.stanford.edu/wiki/index.php/Exercise:_Implement_deep_net works_for_digit_classification (Accessed: 8 September 2015). [28] Leon Bottou, “Large-scale machine learning with stochastic gradient descent”, International Conference on Computational Statistics, pp 177187, 2010.

REFERENCES [1] [2]

[3] [4] [5]

A. Krizhevsky. “Learning multiple layers of features from tiny images”, Technical Report TR-2009, University of Toronto, 2009. Chui, C. K., “An Introduction to Wavelets”, New York: Academic Press, 1992 Strang, G., Strela, V., “Short Wavelets and Matrix Dilation Equations”, IEEE Transactions on Signal Processing, vol. 43, no. 1, January 1995. Rieder, P., Gotze, J., and Nossek, J. A., “Multiwavelet Transforms Based on Several Scaling Functions”, Proc. IEEE Int. Sym. on Time-Frequency Time-Scale, October 1994. Mallat, S. G., “A theory for multiresolution signal decomposition: the wavelet representation”, IEEE Trans. Patt. Anal. Mach. Intell., 11, pp 674-693, 1989 Nason, G. P., Silverman, B. W., “The stationary wavelet transform and some statistical applications”, Lect. Notes Statist., pp 103, 281-300, 1995 Strang, G. and Nguyen, T., “Wavelets and Filter Banks”, Wellesley: Wellesley – Cambridge Press, 1996 Burrus, C. S., Gonipath, R. A., Guo, H. “Introduction to Wavelets and Wavelet Transforms: A Primer”, Englewood Cliffs: Prentice Hall, 1998 R. Sihag, R. Sharma, and V. Setia, “Wavelet Thresholding for Image De-noising”, International Conference on VLSI, Communication & Instrumentation, pp. 21-24, 2011 Michael A. Nielsen, “Neural Networks and Deep Learning”, Determination Press, 2015 M. Zeiler and R. Fergus, “Stochastic pooling for regularization of deep convolutional neural networks”, In Proceedings of International Conference on Learning Representations (ICLR), 2013 Dingjun Yu, Hanli Wang, Peiqiu Chen, and Zhihua Wei, “Mixed pooling for convolutional neural networks,” in Rough Sets and Knowledge Technology, vol. 8818 of Lecture Notes in Computer Science, pp. 364– 375. Springer International Publishing, 2014. M. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q. Le, P. Nguyen, A. Senior, V. Vanhoucke, J. Dean, and G. Hinton, “On rectified linear units for speech processing,” in Proc. ICASSP, 2013, pp. 3517–3521 N. Srivastava, “Improving Neural Networks with Dropout”, Master’s Thesis, Univ. of Toronto, Toronto, ON, Canada, 2013. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, In Proceedings of the 32nd International Conference on Machine Learning, 2015 J. Tang, C. Deng, G.-B. Huang, B. Zhao, “Compressed-Domain Ship Detection on Spaceborne Optical Image using Deep Neural Network and Extreme Learning Machine”, IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 3, March 2015. L. A. Shalabi, Z. Shaaban, and B. Kasabeh, “Data Mining: A Preprocessing Engine”, Journal of Computer Science, vol. 2, no. 9, pp 735-739, 2006. C. Doukim, J. Dargham, A. Chekima, and S. Omatu, “Combining Neural Networks for Skin Detection”, Signal & Image Processing: An International Journal, vol. 1, no. 2, December 2010. A. Vedaldi and K. Lenc, “MatConvNet – Convolutional Neural Networks for MATLAB”, In ACMMM, 2015. D. Kingma, J. Ba, “Adam: A Method for Stochastic Optimization”, International Conference for Learning Representations, 2015 Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol, “Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising

Wernick, Yang, Brankov, Yourganov and Strother, “Machine Learning in Medical Imaging”, IEEE Signal Processing Magazine, vol. 27, no. 4, July 2010, pp. 25-38. Itamar Arel, Derek C. Rose, Thomas P. Karnowski, “Deep Machine Learning – A New Frontier in Artificial Intelligence Research”, Computational Intelligence Magazine, IEEE, vol. 5, no. 4, November 2010, pp. 13-18. Li Deng and Dong Yu, "Deep Learning: Methods and Applications", Foundations and Trends® in Signal Processing, Vol. 7: No. 3–4, pp 197-387, 2014 Williams, T., Li, R., “SDA-Based Neural Network Approach to Digit Classification”, IEEE SoutheastCon 2016 Proceedings, pp. 1-6, Norfolk, VA, Mar. 30 - Apr. 3, 2016. Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P., “Gradient-based learning applied to document recognition”, Proceedings of the IEEE, 86, 2278–2324, 1998.

239