adaptive parallel elm with convolutional features for

0 downloads 0 Views 2MB Size Report
Tantangan aliran data besar membutuhkan mesin pembelajaran khusus. Ragam, variabilitas dan kompleksitas berkaitan dengan masalah pergeseran konsep ...
ADAPTIVE PARALLEL ELM WITH CONVOLUTIONAL FEATURES FOR BIG STREAM DATA Arif Budiman, M. Ivan Fanany, T. Basaruddin Faculty of Computer Science, University of Indonesia E-mail: [email protected]

Abstrak Tantangan aliran data besar membutuhkan mesin pembelajaran khusus. Ragam, variabilitas dan kompleksitas berkaitan dengan masalah pergeseran konsep (CD). Jumlah dan kecepatan berkaitan dengan masalah skalabilitas. Kami mengusulkan pendekatan integrasi jaringan syaraf konvolusi (CNN) dengan mesin pembelajaran ekstrem (ELM) yang menggunakan banyak CNNELM secara paralel. Solusi CD pendekatan pertama, CNNELM adaptif (ACNNELM-1) menggunakan ELM tunggal dengan banyak CNN dengan menerapkan Adaptif Online Sequential ELM. Pendekatan ke- dua (ACNNELM-2) menggunakan paduan penggabungan matriks dari banyak CNNELM. Solusi skalabilitas, Distributed averaging (DA) CNNELM bekerja dengan konsep MapReduce. CNNELM mulai dengan cetakan bobot yang sama kemudian dilatih secara asinkronus menggunakan partisi dari data pelatihan. Hasil akhir didapat dengan merata-ratakan bobot kernel dan ELM output. Ini menghemat waktu pelatihan dibandingkan satu CNNELM dengan pelatihan keseluruhan data. Kami mempelajari metoda pelatihan propagasi balik untuk memperbaiki akurasi dengan iterasi. Kami melakukan verifikasi dengan data extended MNIST, Not-MNIST dan CIFAR10. Kami men-simulasikan pergeseran virtual, pergeseran nyata , dan pergeseran hibrid. Pelatihan DA membagi data pelatihan menjadi beberapa himpunan partisi lebih kecil. Perangkat yang dipakai adalah Deep Learning toolbox dengan CPU parallel, dan Matconvnet dengan GPU. Kelemahan metoda ini memerlukan pemilihan penambahan parameter pembelajaran dan distribusi data pelatihan.

Abstract Big stream data challenges need special machine learning. Variety, variability and complexity are related with concept drift (CD) problem. Volume and velocity are related with scalability problem. We proposed integration approach Convolutional Neural Network (CNN) with Extreme Learning Machine (ELM) that used multi parallel CNNELM. For CD, the 1st approach, the Adaptive CNNELM (ACNNELM-1) used single ELM with multi CNN by employing Adaptive Online Sequential ELM. The 2nd approach (ACNNELM-2) used matrices concatenation ensemble from multi CNNELM. For scalability solution, the distributed averaging (DA) CNNELM worked with MapReduce concept. CNNELM started with the same weight template afterward trained asynchronously using the partition of training data. Final result obtained by averaging all kernel and ELM output weights. This saved training time instead of single CNNELM trained by the whole data. We studied the backpropagation method to improve the accuracy through iterations. We verified using extended MNIST, not-MNIST and CIFAR10 data set. We simulated virtual drift, real drift, and hybrid drift. The DA training divided the training data set to be some smaller partition set. The tools are Deep Learning toolbox with CPU parallel enhancement, and Matconvnet with GPU. The drawbacks need additional learning parameters and the distribution of training data selection. Keywords : big data; concept drift; convolutional; extreme learning machine; map reduce

1.

Introduction

Machine learning design in big stream data is how to deal with infinitely and rapidly data stream that has some challenges: variety of data format, variability of data distribution, complexity of data sources, volume of data and velocity of data processing [1]. We categorized dynamic assumptions on the variety, variability or complexity as concept drift (CD) problem. And the volume and velocity of data that closely related with computation resources are categorized as computation scalability (CS) problem. The CD refers to the statistical properties of the input attributes and target classes shifted over time [2]. CS is the ability of machine to adapt against increased demands.

The CD and CS challenges are very difficult for traditional machine learning to handle i.e., Extreme Learning Machine (ELM) [3, 4]. The promising method is deep learning (DL) that known as feature representation learning [5]. The most popular DL is Convolutional Neural Network (CNN) [6]. CNN has better accuracy and scalability instead of ELM for big stream data. CNN needs longer time for iterations (epoch) and parameters tuning. To speed up, CNN uses a lot of convolution operations using many parallel processors by graphics processing units (GPUs) card (scale up approach) or distributed the load to many machines (scale out approach) based on MapReduce model [7, 8]. CNN used more efficient resources with single precision or lower precision that improves scalability. However, CNN is not intended to address concept drift. Based on our previous research [9, 10], Adaptive Online sequential ELM (AOS-ELM) is simple solution to handle concept drift with many types of consecutive drifts. Different from CNN, ELM has fast learning time with no iterative training scheme using pseudo inverse and less parameters. However, ELM has less scalability to make better consistent for big stream data. To the best of our knowledge, some literature focused on one of the selected issues: CD or CS. We implemented AOS-ELM for CD. However, we cannot use ELM as shallow learning to answer big stream data challenges. Most of shallow learning is nonparametric machine learning algorithms [11] which have some difficulties for big stream data. For that reason, CNN as the parametric machine learning is the only answer for big stream data. The integration of CNN and ELM will be the best solution for CD and CS as our focus problem of big stream data challenges. Our aim of this research is to give state-of-the-art contribution focused on CD and CS unified solution by integrating the CNN with ELM (CNNELM). We focused on CNNELM models that work together in parallel asynchronously. We used common CNN architecture with ELM [12, 13, 14] to have the beneficial effects of each method. We enhanced the CNN as a hierarchical feature representation learner in the front combined with Elastic ELM (E2LM) or Parallel ELM [15, 16] as a parallel supervised classifier. We used single precision for CNN and ELM. We presented the Adaptive Convolutional ELM (ACNNELM) scheme for CD either changes in the number of feature inputs named virtual drift (VD) or the number of classes named real drift (RD) or consecutive VD and RD at the same time named hybrid drift (HD) [10]. We enhanced ACNNELM using single ELM with multi CNN in classifier level (ACNNELM-1) by enhanced E2LM with AOS-ELM approach and multi CNNELM models using matrices concatenation ensembles in ensemble level (ACNNELM-2). Multi ACNNELM models started with different initial parameters and all worked for the same training data. Using different approach from ACNNELM, we presented new CNNELM scale out based on MapReduce to overcome scale up limitation. The map process is the multi CNNELM train certain partition of data set using the same weight template asynchronously. The reduce process is the averaging of all final CNNELM weights to be one distributed averaging (DA) CNNELM. This approach can save a lot of training time instead of single CNNELM trained a whole set of training data. We employed parallel stochastic gradient descent (SGD) algorithm [17] to fine tune the weights. To the best of our knowledge, the previous integration CNN and ELM literatures [12, 13, 14] did not discuss any iterative training yet and it against ELM tenet. In Section 1, we started the background problems of big stream data and following by our research aims. In Section 2, we described related literature studies in CNN, ELM and CNNELM integration focusing on two issues: CD and CS. In Section 3, we described further our proposed methods. We explained how our methods work to solve each problem. In Section 4, we designed the empirical experiments to prove the proposed methods for each problem.

In Section 5, we presented the performance result and our analysis to explain the result. Section 6 discussed conclusions and some future plans to enhance our methods. 

    2.

Notations We used symbol - for layer separation. Convolution layer denoted by "c". The number in the front of c is to show the number of feature map (channel). The number in the rear of c is to show the kernel size. Pooling layer denoted by "s". The number is the front of s is to show the pool size. Pooling layer used down sampling (down), maximum (max) or mean (avg) value of each selected pool. I.e.,”12c5-reLU-2s-avg” means 1 st convolution layer with 12 maps and 5x5 kernel size, following by reLU activation layer and the last layer is pooling layer with 2 pool size and use average function. The concatenation operator is |. Model 1|2 is Model 1 is concatenated with Model 2. The subscript font with parenthesis is to show the sequence number. The X (0) is the data at time k = 0 (initialization) following by X (1) ,X (2) ,...,X(k) in sequential series. The subscript font without parenthesis to show the concept number. The Xs is the data from concept (source or context) s. The concept drift event is using symbol

, where the "VD" shows the drift type.

Literature Review

1. Extreme Learning Machine (ELM) ELM is a supervised machine learning algorithm based on single hidden layer feedforward Neural Network (NN) architecture [3, 4]. It required training input matrix Xd×N with d attributes number and N number of observations. The target output matrix T N ×m with m classes number. SLFN architecture needs only hidden layer matrix H N ×L with L hidden nodes number. The matrix H is computed using g nonlinear activation function of the summation matrix result from bias vector b and the input weight vector a of randomly generated input weight matrix A. Training result is the output weight matrix β that approximated by ̂ = H†T, which H† known as Moore Penrose pseudo inverse of H. The solution is using orthogonal projection method with the ridge regression positive value 1/λ to be : (1)

ELM training time is fast and can avoid local minima because simply equivalent with least-squares solution ̂ of the linear system H β = T while H weight remaining fixed. ELM has capability for sequential learning by solving HTH using two methods: a. Sequential series using block matrices inverse. Online sequential ELM (OS-ELM) [18] used β( k ) as a function of β( k −1) . If ̂ (0) is from N0 training data and N1 is the next batch, then ̂ (1) is approximated by :

(2)

b. Parallelization using MapReduce framework. Elastic ELM (E2LM) [15] or Parallel ELM [16] used MapReduce framework [7, 8]. Map is to transform the intermediate matrix multiplications for each training data

portion in parallel. Reduce is to aggregate or to sum the Map result. Simply, U = HT H and V = HT T are actually decomposable matrices as :

We categorized above ELM methods in sequential learning as non-adaptive ELM. 2. Convolutional Neural Network (CNN) A CNN used feed forward multi layers NN architecture that typically consists of three types of layers: convolutional layer, pooling layer and fully connected layer [19] (See Fig. 1).

Figure 1: CNN Architecture[19]

The input layer is designed to exploit 2D structure with d×d× r of image where r is the number of image channels. The convolutional layer has c filters (or kernels) of size k × k × q. The filters have locally connected structure which is each convolved with the image by convolution operations to produce feature maps. Each map is then pooled using either down, avg or max sampling over s × s × s contiguous regions. An additive bias and activation function (i.e. sigmoid, tanh, or reLU) can be applied to each feature map either before or after the pooling layer. At the end of CNN, there may be any densely connected NN layers for supervised learning [19]. CNN used back propagation (BP) to propagate back the learning errors and SGD to optimize the kernel weight and bias parameters. Iteratively, CNN worked batch by batch of training data until converged to the acceptance criteria. CNN has some inherited drawbacks from NN, i.e., slow learning speed, low generalization performance from limited training data, and trivial human-tuned parameters [20]. 3. Concept Drift (CD) CD is a sequential learning problem that related with uncertainties in data. CD has some types according to Bayesian decision theory for class c and incoming data X are: a. Real Drift (RD) [2, 21, 22] refers to changes in P(c|X), that may be caused by a change in the class boundary (the number of classes) or the class conditional probabilities (likelihood) P(X|c). If the number of classes expanded and different classes of data may come alternately, known as recurrent context. b. Virtual Drift (VD) [2, 21, 22] refers to the changes in the distribution of the incoming data (i.e. P(X) changes), that may be due to incomplete or partial feature representation of the current data distribution. The trained model is built with additional data from the same environment without overlapping the true class boundaries. c. Hybrid Drift (HD) refers to the RD and VD occurred consecutively [10].

The aim of CD handling is to boost the generalization accuracy when the drift occurs [2]. To understand CD handling strategies, a versatile sequential learning algorithm must be in the following sense (Desiderata for online classifiers [23]) : a. The training are sequential observations with varying or fixed chunk length presented to the learning algorithm. b. At any time, only the newly arrived single or chunk of observation (instead of the entire past data) are seen and learned. c. A single or a chunk of training observation is discarded as soon as the learning procedure for that particular (single or chunk of) observation(s) is completed. d. The learning algorithm has no prior knowledge as to how many training data will be presented. Each type employed different strategies. It is hard to combine simultaneously many strategies to solve many types of CD in just a simple platform. Ensemble learning is the common approaches to handle CD [24, 25]. Moreover, another simple approach is using single classifier [10, 26, 27]. Common E LM approaches for CD handling are based on ensemble approach [28, 29]. However, they have some difficulties for handling consecutive types of CD, i.e., HD either as recurrent drift or context replacement. It may not following desiderata for online classifiers [23]. Most ELM ensembles work for specific drift case which may be impractical for another cases. We pursue simple unified platform that has the capability to handle consecutive drift types. Some papers [10, 30, 26] discussed how single ELM approach in adaptive environment. Schaik, et.al. [30] proposed Online Pseudo Inverse Update Method (OPIUM) that only tackled the RD case with discriminant function boundary shift in the streaming data. Mirza et. al. [26] proposed OS-ELM for imbalanced and concept drift tackling named meta-cognitive OS-ELM (MOS-ELM). MOS-ELM used an additional weighting matrix to control the CD adaptivity, however, it works for RD with concept replacement only. Budiman, et.al. [10] proposed Adaptive OS-ELM (AOS-ELM) as simple solution for VD, i=1 RD, and HD using simple matrix adjustment. To keep the consistency of minimized square error because of drift, the learning model needs a transition map from the former space to the new space. The learning model ̂ 1 needs a transition space before it converges to the new learning model ̂ 2 . Budiman, et.al. used two approaches: i) Assign the random coordinates in the new concept space. ii) Assign the equivalent projection coordinates in the new design space. The 1st approach is suitable for VD scenario, in which the new random coordinates as the new input weight parameters. The solution is by adjusting the input weight and bias pair. The 2nd approach is suitable for RD situation, by setting the equivalent projection coordinates in the new space (i.e., The (X1) in 1-D coordinate has corresponding 2-D projection coordinates as (X1, 0) ). The solution is by concatenating the output weight matrix with zero block matrix to change the dimension without changing the value. Both solutions can be combined because they are not dependence to solve HD situation. However, single classifier approach has some drawbacks. Its accuracy may not exceed the adaptive ensemble or full batch approach because the shared weight changes could impact to all notions. CNN implementation to address CD is still emerging research. The handling approach is similar with another machine learning strategies, i.e. modifying the parameters in single CNN [27], and expanding the CNN structures [31] that much similar with expanding decision tree or ensemble structures, named Adaptive CNN (ACNN). ACNN used the structure global expansion until meet the average error criteria and local expansion to expand the network structure. According to Zhang et. al. no such theory about how CNN structure constructed i.e. the number of layers, the number of feature maps per layer.

4. Computation Scalability (CS) To improve CS is not only related with upgrading hardware. It is related with how to distribute the computation problem and how to run the computation resource efficiently while still keeping good accuracy. First approach is to distribute the process in parallel. Parallel computing is a simultaneous use of multi computing resources by breaking down the process into simpler series of instructions that can be executed simultaneously on different processing units and then employ an overall control management [32]. There are some forms of parallel computing, but in this paper, we focused on task parallelism on Asynchronous Single Instruction, Multiple Data (SIMD), i.e., MapReduce model. Google has introduced a parallel framework programming model named MapReduce [7] as the solution for horizontal scaling. MapReduce provides two essential functions: 1) Map function, it processes each sub problems to another nodes within the cluster; 2) Reduce function, it organizes the results from each node to be a cohesive solution [8]. MapReduce also can be implemented in SGD. It was interesting research to implement in parallel computation since the earlier years of SGD development [33] before MapReduce term introduced. Zinkevich et.al [34] proposed a parallel model of SGD that highly suitable for parallel and large-scale machine learning. In parallel SGD, the training data is accessed locally by each model and only communicated when it finished. The idea of SGD weight averaging was developed by Polyak et.al [35]. The averaged SGD is ordinary SGD that averages its weight over time. When optimization is finished, the average weight replaces the ordinary weight from SGD. CNN uses a lot of convolution operations that inherently parallel, which getting beneficial from a graphical processing unit (GPU) implementation [36]. However, the CNN size is still limited mainly by the amount of memory available on current GPUs [37]. Wang et. al [38] used MapReduce on Hadoop platform to take advantage of multi core CPU. However, the number of multi core CPU is far less than GPU can provide. Second approach is by limiting the real number precision in algorithm and data representation. Larger precision means larger number of bits to express the real number (closer to detail). However, it means larger capacity to store the bits and longer computation time. I.e., single precision floating-point format occupies 4 bytes and double precision occupies 8 bytes. Rounding is a procedure for choosing the representation of a real number in a floating point number system. According to Higham [39], rounding errors are an unavoidable consequence of working in finite precision arithmetic. The floating-point operations, i.e. pseudo inverse (pinv) may give different result and ill conditioned warning when we used single precision. However, not all matrix operations affected, i.e., convolution operation that based on multiplication and summation. CNN used a lot of convolution that suitable for GPU working in single precision. Unfortunately, ELM needs double precision for doing pseudo inverse. 5. CNN and ELM Integration Guo et. al. [12] introduced an integration model of CNN-ELM, and applied to handwritten digit recognition. Guo et. al. used CNN as an automatic feature extractor and ELM to replace the original classification layer of CNN. Pang et. al. [13] implemented deep convolutional ELM (DC-ELM) that used CNN for high level features abstraction from input images. Then, the abstracted features are classified by an ELM classifier. Pang et. al. did not use sequential learning approach. Huang et. al. [14] explained the ELM theories are

not only valid for fully connected ELM architecture but are also actually valid for local connections, named local receptive fields (LRF) or Kernel in CNN term. 3.

Proposed Methods

We used common CNNELM integration [12, 13, 14] architecture when the last convolution layer output is fed as hidden nodes weight H of ELM. Comparing with regular ELM, we do not need input weight and bias hidden nodes parameters (See Fig. 2). Our method used optimal tanh for the final H to have better generalization accuracy and parallel E2LM as supervised classifier to naturally support online learning in parallel process. We implemented CNN global expansion approach [31] to improve the performance accuracy.

Figure 2: CNNELM architecture : The last CNN layer output is submitted as H of E2 LM

We also studied the BP to improve the performance. It is similar with densely connected NN, except we propagated back the ELM error on the lth layer to CNN Layer with: (3)

Following by SGD method to tune up all weight kernels of convolution layers. We implemented for CD as Adaptive CNNELM (ACNNELM) and for CS as Distributed Averaging (DA) CNNELM. 1. Adaptive CNNELM (ACNNELM) We developed ACNNELM to be two models: a. ACNNELM-1 : ACNNELM-1 used single ELM classifier for multi CNN features using global expansion approach. It works on single classifier level based on the transition map of AOS-ELM (See Fig. 3). We developed ACNNELM-1 for CD scenarios as belows.

Figure 3: ACNNELM-1 for concept drift handling in classifier level

i.

ii.

iii.

Virtual drift (VD). Let’s X2 is additional features from new concept and concatenated together the last layer of all CNNs to be single H matrix to compute β. The new H ( k ) has larger column size than previous H ( k −1) . Consequently, it needs to adjust the dimension of previous square matrix U ( k −1) by padding zero block matrix in row and column to have the same row and column with U ( k ) and adjust the dimension of previous V( k −1) by padding zero block matrix in row only to have the same row with V ( k ) because the row of V is related with matrix H. Real Drift (RD). Let’s T2 is additional output classes expansion as a new concept. Consequently, it needs to adjust the column dimension of matrix V with zero block matrix because The column of V is related with matrix T. Hybrid Drift (HD). We adjusted both row and column dimension of square matrix U and matrix V. The adjustment processes of matrix U and matrix V are not dependent each other.

b. ACNNELM-2 :

ACNNELM-2 combines some CNNELMs and aggregated as matrices concatenation in ensemble level. (See Fig. 4). This idea basically is similar with decomposable characteristic of Matrix Y = H β. The training processes for each CNNELM are asynchronous and may started using different learning parameters from different compu- ting resources. After completed, we aggregated all CNNELMs to be ACNNELM-2 to boost the performance instead of single model. It needs minimum two members and no need additional multi-classifier strategies. We developed ACNNELM-2 for CD scenarios as belows. i. Virtual drift (VD) Let’s X2 is additional features from new concept and assigned to the new CNNELM. After that, we concatenated the new CNNELM to the existing CNNELM without disturbing the previous CNNELM (See Fig. 5a). ii. Real Drift (RD) Let’s T2 is additional output classes expansion by incrementing the order of class number as a new concept and assigned to the new CNNELM until its training completed. First, we need to modify the β1 of previous CNNELM concatenated with zero block matrix 0, so that it has the same class number and order with the β2 of new CNNELM. After that, we concatenated the new CNNELM to the existing CNNELM (See Fig. 5b). iii. Hybrid Drift (HD) Each drift is assigned to each CNNELM until all trainings completed. We concatenated all CNNELMs to be one ACNNELM-2 ensemble.

2 Distributed Averaging (DA) Different with ACNNELM, DA CNNELM used many CNNELMs with the same initialization weight parameters but each model works for different partition of training data set. After all training completed, we averaged all CNN kernel weights including bias and ELM Output weights. Detail algorithm is explained on Algorithm 1.

Figure 4: The aggregation concatenation ensemble of ACNNELM-2 to boost the performance and concept drift handling in ensemble level

(a) Virtual drift

(b) Real Drift Figure 5: ACNNELM-2 Virtual Drift and Real Drift Solution

4 Experiment Method We used some popular image data set: MNIST [40], Not-MNIST [41] and CIFAR10 [42] (See Table 1). MNIST data set is a balanced data set that contains numeric handwriting (10 classes) with size 28 × 28 gray scale pixel. We enhanced additional Histogram of oriented gradient (HOG) of images attributes with size 9 × 9. To simulate big stream data, we developed extended MNIST by duplicating regular MNIST with additional random gaussian, salt&pepper, poisson noise, therefore has the same distribution with regular MNIST. Not-MNIST dataset used gray scale 28 × 28 image size with many random assigned foolish images, therefore has unknown distribution data (See Fig. 6). we selected only 20 classes (0-9 and A-J). CIFAR10 is RGB 32 × 32 pixel images across 10 classes that are completely mutually exclusive. We extended by duplicating with additional 4 types of image noises: gaussian, salt&pepper, speckle, poisson.

Figure 6: Example of Not-MNIST for A and B class

We used DL toolbox and enhanced it with CPU Matlab Parallel computing toolbox for simple CNN architecture. We used also Matconvnet with GPU for complex CNN architecture. We used NVIDIA GeForce GTX 950 that has 768 GPU cores and 2GB GPU memory. Because we employed asynchronous parallel, it is not necessary to have simultaneous parallel computers in the experiments. We measured the performance using testing accuracy and Cohen’s Kappa. We used single precision for CNN and ELM running in Matlab. Table 1: Data set and evaluation Method Data Set

Evaluation Method

Training

Testing

Extended MNIST

Holdout (5× trials on different computers)

240,000

40,000

Not-MNIST

720,000

180,000

CIFAR10

Cross Validation 5 Fold Holdout

50,000

10,000

Extended CIFAR10

Holdout

250,000

10,000

In CD experiment, we designed simulated scenario on extended MNIST data set and notMNIST on the following tables 2. Table 2: Concept Drift Simulation (a) Data Set Concepts and Quantity Data Set Concepts

Inputs

Outputs

Data

MNIST1

784

10 (0-9)

240,000

MNIST2

865

10 (0-9)

240,000

MNIST3

784

6 (0-5)

150,000

MNIST4

784

4 (6-9)

100,000

NotMNIST1

784

10 (0-9)

360,000

NotMNIST2

784

10 (A-J)

540,000

NotMNIST3

865

10 (0-9)

360,000

NotMNIST4

865

20 (A-J,0-9)

900,000

(b) CD Scenario Scenario Virtual Drift Real Drift Hybrid Drift

Drift Event

In DA experiment, we partitioned the training data set and assign each partition to different CNNELMs that started with the same initial weight. We compared between no partition, 2 partitions, 4 or 5 partitions on Extended MNIST, Not-MNIST and CIFAR10. Our objective is to prove the effectiveness of DA CNNELM for various number of training partitions. In supporting experiment, we focused on the effect of error back propagation, iterations and what parameters affected. We used complex architecture from Matconvnet. Performance Verification We used the testing accuracy performance of full batch training of CNN and CNNELM as benchmark (See Table 3). The performance of CNNELM can be improved by inserting the optimal tanh function (See table 4) between CNN and ELM.

Table 3: The accuracy performance of full batch training version of CNN and CNNELM Testing Accuracy %

Model and CNN parameter MNIST CNN 6c5-reLU-2s-down-12c3-reLU-2s-down, e = 50

90.32%

Not-MNIST CNN 12c5-reLU-2s-down-18c3-reLU-2s-down, e = 50

78.18%

CIFAR10 CNN 6c5-reLU-2s-down-12c3-reLU-2s-down, e = 50

36.13%

MNIST CNNELM 6c5-reLU-2s-down-12c3-reLU-2s-down, e = 0

94.16%

Not-MNIST

CNNELM

12c5-reLU-2s-down-18c3-reLU-2sdown, e = 0

81.18%

CIFAR10 CNNELM 6c5-reLU-2s-down-12c3-reLU-2s-down, e=0

40.25%

1 Concept Drift Our objective is no accuracy performance decreased after the drift event. We compared with its full batch (offline) version. 1. Virtual drift (VD) handling (a) ACNNELM-1 We used a single Model 1 and trained with MNIST1 concept. We build the new Model 2 for HOG attributes. The last layers of all CNNs combined together to modify the ELM layer. (b) ACNNELM-2 We trained CNNELM Model 1 and Model 2 in MNIST1 concept. For MNIST2 concept, we build the new CNNELM Model 3 and Model 4 6c3-reLU-1s-down12c3-1s, 300 Hidden nodes to ELM using the additional HOG attributes. After training completed, we concatenated all to be ACNNELM-2 ensemble (Table 6).

The CNNELMs have no dependencies with another model (No shared parameters). ACNNELM-1 shared the same ELM parameters. 2. Real drift (RD) handling. (a) ACNNELM-1 First, we used single Model 1 trained with MNIST3 concept. For the same model, we continued with next MNIST4 and MNIST3 concept without building any new CNNs. We tested against the complete class of testing data set. (b) ACNNELM-2 First, we build the ACNNELM-2 Model 5 and Model 6 in MNIST3 concept with only 6 classes (1 to 6). Following by MNIST4 concept, we build the new CNNELM Model 7 and Model 8 using 10 classes (continuing from 7 to 10). We concatenated all models to be one concatenation ensemble (Table 9). In the matrices concatenation, we adjusted the β of Model 5 and Model 6 (only have 6 columns) by using additional zero block matrix to pad the columns to be 10 columns to have the same matrix dimension. Then we can concatenate with Model 7 and Model 8. We tested against the complete class of testing data set. ACNNELM-2 has better performance instead of ACNNELM-1. 3. Hybrid drift (HD) handling (a) ACNNELM-1 First, we used single model trained with MNIST3 concept. For the same model, we just continued the training with the next MNIST4 concept in the same time with building 1 CNN model for additional attributes. We tested against the complete class of testing data set. (b) ACNNELM-2 Simply, We combined the VD and HD models to be one concatenation ensemble. For both ACNNELM-1 and ACNNELM-2 models, we have the performance improvement after HD event. ACNNELM-2 has better performance instead of ACNNELM-1. Table 4: The effectiveness of non linear activation function for H in performance improvement. 5 × trials in extended MNIST with one 6c5-reLU-2s-down ACNNELM-1 model. Model

Function

Testing Accuracy %

Cohen Kappa %

ACNNELM-1

No

91.32±0.52

90.36 (0.16)

ACNNELM-1

Sigmoid

78.58±1.27

75.29 (1.61)

ACNNELM-1

Sof tma x

90.38±0.97

89.25 (0.84)

ACNNELM-1

tanh

91.46±0.37

90.52 (0.43)

Table 5: VD handling. The ACNNELM-1 used Model 1 6c5-reLU-2s-down and Model 2 HOG 6c3-reLU-1sdown model for MNIST and Model 1 12c5-reLU-2s-down-18c3-reLU-2s-down and Model 2 HOG 12c3-reLU1s-down for Not MNIST Numeric. 5 × trials. Model

Concept

Testing Accuracy %

Cohen Kappa %

Model 1

MNIST1

91.46±0.37

90.14 (0.16)

Model 2

MNIST2

94.54 ± 0.20

93.10(0.26)

Model 1

Not − MNIST1

81.17±0.46

80.14 (0.54)

Model 2

Not − MNIST3

82.95 ± 0.42

81.06(0.16)

Table 6: VD handling in ACNNELM-2 for extended MNIST. 5 × trials. Model

Concept

Testing Accuracy %

Cohen Kappa %

Model 1

MNIST1

93.19±0.24

92.58 (0.26)

Model 2

MNIST1

91.65±0.56

90.45 (0.22)

Model 1|2

MNIST1

93.77±0.28

92.78 (0.25)

Model 1|2|3

MNIST2

95.29±0.17

93.55 (0.13)

Model 1|2|3|4

MNIST2

95.57 ± 0.73

94.01(0.14)

Table 7: VD handling in ACNNELM-2 for Not-MNIST Numeric. Model 1,2 12c5-reLU-2s-down-18c3-reLU-2sdown and Model 3 12c3-reLU-1s-down. 5 × trials. Model

Concept

Testing Accuracy %

Cohen Kappa %

Model 1

Not − MNIST1

81.17±0.46

80.14 (0.54)

Model 1|2

Not − MNIST1

83.47±0.27

81.28 (0.29)

Model 1|2|3

Not − MNIST3

86.34 ± 0.59

84.38(0.27)

Table 8: RD handling. The ACNNELM-1 used Model 1 6c5-reLU-2s-down model for extended MNIST and Model 1 12c5-reLU-2s-down-18c3-reLU-2s-down for Not-MNIST. 5 × trials. Model

Concept

Testing Accuracy %

Cohen Kappa %

Model 1

MNIST3

58.46±0.53

53.27 (0.17)

Model 1

MNIST3,4

92.45 ± 0.63

91.12(0.25)

Model 1

Not − MNIST1

34.46±1.23

30.42 (1.17)

Model 1

Not − MNIST1,2

79.12 ± 0.58

77.22(0.45)

Table 9: RD handling. The ACNNELM-2 used 6c5-reLU-2s-down-12c3-reLU-2s-down model for extended MNIST. 5 × trials. Model

Concept

Testing Accuracy %

Cohen Kappa %

Model 5

MNIST3

58.57±0.12

53.96 (0.27)

Model 6

MNIST3

58.16±0.28

54.00 (0.27)

Model 5|6|7

MNIST3,4

93.53±0.19

92.26 (0.25)

Model 5|6|7|8

MNIST3,4

93.61 ± 0.52

92.55(0.23)

Table 10: RD handling. The ACNNELM-2 used 12c5-reLU-2s-down-18c3-reLU-2s-down model for Not-MNIST. 5 × trials. Model

Concept

Testing Accuracy %

Cohen Kappa %

Model 1

Not − MNIST1

34.46±1.23

30.42 (1.17)

Model 1|7

Not − MNIST1,2

81.99 ± 0.47

81.01(0.10)

Table 11: HD handling. The ACNNELM-1 used Model 1 6c5-reLU-2s-down and Model 2 HOG 6c3-reLU-1sdown model for MNIST and Model 1 12c5-reLU-2s-down-18c3-reLU-2s-down and Model 2 HOG 12c3-reLU1s-down for Not MNIST. 5 × trials. Model

Concept

Testing Accuracy %

Cohen Kappa %

Model 1

MNIST3

58.46±0.53

53.27 (0.17)

Model 2

MNIST2

93.42 ± 0.32

91.10(0.16)

Model 1

Not − MNIST1

34.46±1.23

30.42 (1.17)

Model 2

Not − MNIST4

84.29 ± 0.42

82.10(0.36)

Table 12: HD handling. The ACNNELM-2 used Model 5,6,7,8 6c5-reLU-2s-down-12c3-reLU-2s-down model and Model 3 and 4 6c3-reLU-1s-down for extended MNIST. 5 × trials. Model

Concept

Testing Accuracy %

Cohen Kappa %

Model 5

MNIST3

58.57±0.12

53.96 (0.27)

Model 5|6|7|3

MNIST2

95.30±0.26

94.78 (0.24)

Model 5|6|7|8|3|4

MNIST2

95.94 ± 0.17

95.49(0.22)

Table 13: HD handling. The ACNNELM-2 used Model 1 12c5-reLU-2s-down-18c3-reLU-2s-down model and Model 3 12c3-reLU-1s-down for Not-MNIST. 5 × trials. Model

Concept

Testing Accuracy %

Cohen Kappa %

Model 1

Not − MNIST1

34.46±1.23

30.42 (1.17)

Model 1|3

Not − MNIST4

85.14 ± 0.22

83.35(0.29)

6 Distributed Averaging (DA) Our objective for DA CNNELM is to reduce training waiting time. To verify DA, we partitioned the training data and assign each partition to particular CNNELMs that have the same initialization weight. We expect the DA CNNELM performance is near to the performance of single CNNELM trained by the whole set of training data (CNNELM 1). We divided the training data to be 2 sets and 5 sets of not-MNIST data. We compared the testing accuracy of CNNELM 1 with the accuracy from average CNNELM 2-set model and average CNNELM 5-set model (See table 14). Our initial conjecture is the performance of average CNNELM should be in average of individual models. However, the experiment showed different result. The performance of DA CNNELM exceed the average value of individual models, even it is still lower than CNNELM 1. Different result from extended MNIST (See table 15). Our data set showed the extended MNIST has been built from the same data distribution of regular MNIST with additional variant of noises. The performance of average CNNELM model is very closed to the average value of individual models. However, the goal of DA CNNELM is to reduce training waiting time because we can spread the computation process for each CNNELM asynchronously in parallel way. We explained DA CNNELM in CIFAR10 experiments. We plot the norm values of output weight β of ELM for each class (See Fig. 7a). We compared the chart for original, noises, and DA CNNELM versions. We observed some chart similarities between the DA CNNELM, original and noises versions. Table 14: Testing Accuracy for 3c5-reLU-2s-down-9c3-reLU-2s-down iteration e=5, α = 5 , batch=75,000 on e not-MNIST. 5 × trials. Model

Testing Accuracy %

Training Time (s)

CNNELM 1

73.72±1.32

1680.24±7.27

CNNELM 1/2

41.45±1.25

839.51±2.40

CNNELM 2/2

41.19±0.73

839.50±3.73

CNNELM Average 2

66.85±2.43

N/A

CNNELM 1/5

20.56±0.24

336.83±0.63

CNNELM 2/5

20.09±0.96

336.02±0.79

CNNELM 3/5

21.22±0.86

336.65±0.24

CNNELM 4/5

31.71±0.52

336.15±0.45

CNNELM 5/5

31.70±0.52

336.76±0.88

CNNELM Average 5

59.59±0.24

N/A

Table 15: Testing Accuracy for 6c5-reLU-2s-down-12c3-reLU-2s-down at iteration e =5, α = 1e , batch=60,000 on MNIST. 5 × trials.

Model

Testing Accuracy %

Training Time (s)

CNNELM 1

92.41±0.36

3535.68±20.55

CNNELM 1/2

92.40±0.25

1762.28±3.45

CNNELM 2/2

92.35±0.34

1765.26±4.20

CNNELM Average 2

92.49±0.35

N/A

CNNELM 1/4

92.26±0.13

882.76±1.88

CNNELM 2/4

92.37±0.56

881.41±1.24

CNNELM 3/4

92.20±0.31

881.90±1.77

CNNELM 4/4

92.28±0.17

881.45±1.39

CNNELM Average 4

92.40±0.26

N/A

(a) Noise Versions Vs. DA Vs. Original. (b) Splitted Vs. DA Vs. Original. Figure 7: The norm of β for each class comparison chart on CIFAR10.

(a) Extended MNIST 20c5-Bnorm-reLU-2s-max50c5-reLU-2s-max-ELM-softmaxloss, Bat ch = 10000, Weight decay = 10−5 , using learning rate

(b) Not-MNIST 20c5-Bnorm-reLU-2s-max-50c5- reLU2s-max-ELM-softmaxloss, Batch=10000, W eight deca y = 10−5 , using learning rate α = 0.5/e. Accuracy

α = 0.5/e. Accuracy 95.75%.

81.88%.

Figure 8: Testing Accuracy for Complex CNNELM Matconvnet

(a) CNNELM 64c5-2s-avg-reLU-64c5-reLU-3savg-96c3-reLU-2s-avg-96c3-reLU-128c1-Bnormzscore-tanh-ELM-softmax, Bat ch = {500, 25000} , Weight decay = 10−5 . Smaller Batch size has better accuracy.

(b) CNNELM 64c5-2s-avg-reLU-64c5-reLU-3savg-96c3-reLU-2s-avg-96c3-reLU-128c1-Bnormzscore-tanh-ELM-softmax, Batch=500. Smaller and dynamic weight update learning parameters have better accuracy.

Figure 9: Testing Accuracy for CNNELM in CIFAR10.

We explained DA CNNELM in CIFAR10 experiments. We plot the norm values of output weight β of ELM for each class (See Fig. 7a). We compared the chart for original, noises, and DA CNNELM versions. We observed some chart similarities between the DA CNNELM, original and noises versions. The performance of CNNELM can be improved by using back propagation algorithm. However, it depends on the appropriate learning rate parameters, i.e., number of batch and number of iteration. The wrong parameter could trap into local minima. To avoid that, we can use dynamic learning rate. We studied the effect of weight update learning parametes (batch size, learning rate and weight decay (See Fig. 8 and 9).

Conclusions and Future Works We have verified our methods work to address two main issues in big stream data: con- cept drift and computation scalability. Using common CNNELM integration architecture, we employed multi CNNELM models work in parallel way. For CD, our CNNELM : ACNNELM-1 and ACNNELM-2 have retained the performance accuracy after the drift event (VD,RD, and HD). ACNNELM-1 used multiple CNN layers and

single ELM classifier based on AOS-ELM enhancement with E2LM. We verified the CNN structure expansion of ACNNELM-1, combining various CNN structure expansion models and integrated into single ELM contributed to performance improvement and concept drift handling. ACNNELM-2 worked based on ELM matrices concatenation ensembles using multiple CNNELM models trained with the same data set and started from different learning parame- ters. ACNNELM-2 has better performance accuracy and flexibility than ACNNELM-1 because ACNNELM-2 has no shared parameters between its models. For CS, our DA CNNELM used multiple CNNELM models trained with different partitions of training data set but started from the same learning parameters. After all training completed, we aggregated all models by averaging the weight parameters of CNN kernels and output weight of ELM to be averaged CNNELM model. Thus, it can safe a lot of training waiting time instead of single model trained by the whole data set. The averaging CNNELM result has accuracy better than the average accuracy of each CNNELM model. The performance result, however, depends on the distribution of training data. If each partition has the same balanced class and data distribution, the averaging CNNELM model may have same accuracy with the single model trained by the whole data set. CNNELM takes the benefit of stochastic gradient descent error back propagation to improve the performance. However the learning parameters need to be carefully selected to avoid local minima problem. We plan some future works using ACNNELM with meta-cognitive framework [43], so it has a self-regulatory learning mechanism in a meta-cognitive framework. We will study DA CNNELM for active learning if it is allowed to choose the data from which it learns [44]. References [1] D. Laney, 3D data management: Controlling data volume, velocity, and variety, Tech. rep., META Group (February 2001). [2] J. a. Gama, I. Žliobaite˙, A. Bifet, M. Pechenizkiy, A. Bouchachia, A survey on concept drift adaptation, ACM Comput. Surv. 46 (4) (2014) 44:1–44:37. [3] G.-B. Huang, D. Wang, Y. Lan, Extreme learning machines: a survey, International Journal of Machine Learning and Cybernetics 2 (2) (2011) 107–122. [4] G. Huang, G.-B. Huang, S. Song, K. You, Trends in extreme learning machines: A review, Neural Networks 61 (0) (2015) 32 – 48. [5] Y. Bengio, Learning deep architectures for ai, Found. Trends Mach. Learn. 2 (1) (2009) 1–127. [6] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, in: Intelligent Signal Processing, IEEE Press, 2001, pp. 306–351. [7] J. Dean, S. Ghemawat, Mapreduce: Simplified data processing on large clusters, Commun. ACM 51 (1) (2008) 107–113.

[8] C. Tsang, K. Tsoi, J. H. Yeung, B. S. Kwan, A. P. Chan, C. C. Cheung, P. H. Leong, Mapreduce as a programming model for custom computing machines, Field-Programmable Custom Computing Machines, Annual IEEE Symposium on (2008) 149–159. [9] A. Budiman, M. I. Fanany, C. Basaruddin, Constructive, robust and adaptive os-elm in human action recognition, in: Industrial Automation, Information and Communications Technology (IAICT), 2014 International Conference on, 2014, pp. 39–45. [10] A. Budiman, M. I. Fanany, C. Basaruddin, Adaptive online sequential elm for concept drift tackling, Hindawi, 2016. [11] S. J. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, 2nd Edition, Pearson Education, 2003. [12] L. Guo, S. Ding, A hybrid deep learning cnn-elm model and its application in handwritten numeral recognition, 2015, pp. 2673–2680. [13] S. Pang, X. Yang, Deep convolutional extreme learning machine and its application in handwritten digit classification, Hindawi, 2016. [14] G.-B. Huang, Z. Bai, L. L. C. Kasun, C. M. Vong, Local receptive fields based extreme learning machine, IEEE Computational Intelligence Magazine (accepted) 10. [15] J. Xin, Z. Wang, L. Qu, G. Wang, Elastic extreme learning machine for big data classification, Neurocomputing 149, Part A (2015) 464 – 471. [16] Q. He, T. Shang, F. Zhuang, Z. Shi, Parallel extreme learning machine for regression based on mapreduce, Neurocomput. 102 (2013) 52–58. [17] M. Zinkevich, M. Weimer, L. Li, A. J. Smola, Parallelized stochastic gradient descent, in: Advances in Neural Information Processing Systems 23, Curran Associates, Inc., 2010, pp. 2595–2603. [18] N.-Y. Liang, G.-B. Huang, P. Saratchandran, N. Sundararajan, A fast and accurate online sequential learning algorithm for feedforward networks, Neural Networks, IEEE Transac- tions on 17 (6) (2006) 1411–1423. [19] M. D. Zeiler, R. Fergus, Visualizing and Understanding Convolutional Networks, Springer International Publishing, Cham, 2014, pp. 818–833. [20] N. Martinel, C. Micheloni, G. L. Foresti, The evolution of neural learning systems: A novel architecture combining the strengths of nts, cnns, and elms, IEEE Systems, Man, and Cybernetics Magazine 1 (3) (2015) 17–26. [21] T. Hoens, R. Polikar, N. Chawla, Learning from streaming data with concept drift and imbalance: an overview, Progress in Artificial Intelligence 1 (1) (2012) 89–101.

[22] R. Elwell, R. Polikar, Incremental learning of concept drift in nonstationary environments, Neural Networks, IEEE Transactions on 22 (10) (2011) 1517–1531. [23] L. Kuncheva, Classifier ensembles for changing environments, in: Multiple Classifier Systems, Vol. 3077 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2004, pp. 1–15. [24] I. Zliobaite, Learning under Concept Drift: an Overview, Computing Research Repository abs/1010.4. [25] A. Tsymbal, M. Pechenizkiy, P. Cunningham, S. Puuronen, Dynamic integration of classifiers for handling concept drift, Inf. Fusion 9 (1) (2008) 56–68. [26] B. Mirza, Z. Lin, Meta-cognitive online sequential extreme learning machine for imbalanced and concept-drifting data classification, Neural Networks 80 (2016) 79–94. [27] M. Grachten, C. E. Cancino Chacón, Strategies for conceptual change in convolutional neural networks, Tech. Rep. OFAI-TR-2015-04, Austrian Research Institute for Artificial Intelligence (October 2015). [28] N. Liu, H. Wang, Ensemble based extreme learning machine, IEEE Signal Processing Letters 17 (8) (2010) 754–757. [29] Z. Yu, L. Li, J. Liu, G. Han, Hybrid adaptive classifier ensemble, IEEE Transactions on Cybernetics 45 (2) (2015) 177–190. [30] A. van Schaik, J. Tapson, Online and adaptive pseudoinverse solutions for {ELM} weights, Neurocomputing 149, Part A (2015) 233 – 238. [31] Y. Zhang, D. Zhao, J. Sun, G. Zou, W. Li, Adaptive convolutional neural network and its application in face recognition, Neural Processing Letters 43 (2) (2016) 389–399. [32] B. Barney. [link]. URL https://computing.llnl.gov/tutorials/parallel_comp/ [33] S. K. Foo, P. Saratchandran, N. Sundararajan, Parallel implementation of backpropagation neural networks on a heterogeneous array of transputers, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 27 (1) (1997) 118–126. [34] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems 25, Curran Associates, Inc., 2012, pp. 1097–1105. [35] B. T. Polyak, A. B. Juditsky, Acceleration of stochastic approximation by averaging, SIAM Journal on Control and Optimization 30 (4) (1992) 838–855.

[36] D. Scherer, H. Schulz, S. Behnke, Accelerating Large-Scale Convolutional Neural Networks with Parallel Graphics Multiprocessors, 2010, pp. 82–91. [37] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems, p. 2012. [38] Q. Wang, J. Zhao, D. Gong, Y. Shen, M. Li, Y. Lei, Parallelizing convolutional neural networks for action event recognition in surveillance videos, International Journal of Parallel Programming (2016) 1–26. [39] N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd Edition, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002. [40] Y. LeCun, C. Cortes. MNIST handwritten digit database [online] (2010). [41] Y. Bulatov, notmnist dataset (September 2011). URL http://yaroslavvb.blogspot.co.id/2011/09/notmnist-dataset.html [42] A. Krizhevsky, Learning Multiple Layers of Features from Tiny Images, Master’s thesis (2009). URL http://www.cs.toronto.edu/~{}kriz/learning-features-2009-TR.pdf [43] R. Savitha, S. Suresh, H. J. Kim, A meta-cognitive learning algorithm for an extreme learning machine classifier, Cognitive Computation 6 (2) (2014) 253–263. [44] B. Settles, Active learning literature survey, Computer Sciences Technical Report 1648, University of Wisconsin–Madison (2009).