A Deep Convolutional Neural Network Based

A Deep Convolutional Neural Network Based Classification of Multi-Class Motor Imagery with Improved Generalization Aupendu Kar#1, Sutanu Bera#2, S. P. K Karri3, Sudipta Ghosh4, Manjunatha Mahadevappa5, Debdoot Sheet6 Abstract— Motor imagery (MI) based brain-computer interface (BCI) plays a crucial role in various scenarios ranging from post-traumatic rehabilitation to control prosthetics. Computeraided interpretation of MI has augmented prior mentioned scenarios since decades but failed to address interpersonal variability. Such variability further escalates in case of multiclass MI, which is currently a common practice. The failures due to interpersonal variability can be attributed to handcrafted features as they failed to extract more generalized features. The proposed approach employs convolution neural network (CNN) based model with both filtering (through axis shuffling) and feature extraction to avail end-to-end training. Axis shuffling is performed adopted in initial blocks of the model for 1D preprocessing and reduce the parameters required. Such practice has avoided the overfitting which resulted in an improved generalized model. Publicly available BCI Competition-IV 2a dataset is considered to evaluate the proposed model. The proposed model has demonstrated the capability to identify subject-specific frequency band with an average and highest accuracy of 70.5% and 83.6% respectively. Proposed CNN model can classify in real time without relying on accelerated computing device like GPU.

I. INTRODUCTION A brain-computer interface is an integration of hardware and software that enables a subject to control their surroundings through signals originated from the brain. One of the major challenges in BCI is translating a brain signal into a control signal to govern external device. This is needed in subjects diagnosed with neuromuscular diseases, like amyotrophic lateral sclerosis (ALS), brain stroke, spinal cord injury, voluntary muscle paralysis, but their brain remains unaffected. BCI equips these patients to control external devices such as computers, neural prostheses, speech synthesizers, exoskeleton, etc., using their brain. #

These authors contributed equally to this work. Kar is an M.Tech student of the Department of Electrical Engineering, Indian Institute of Technology Kharagpur, Kharagpur, India 1 Aupendu

[email protected]* 2 Sutanu Bera is an M.Tech student of School of Medical Science and Technology, Indian Institute of Technology Kharagpur, Kharagpur, India

[email protected] 3 SPK Karri is a Post-doc of Department of Electrical Engineering, Indian Institute of Technology Kharagpur, Kharagpur, India

[email protected] 4 Sudipta Ghosh is a PhD scholar of School of Medical Science and Technology, Indian Institute of Technology Kharagpur, India

[email protected] 5 Manjunatha Mahadevappa is Professor of School of Medical Science and Technology, Indian Institute of Technology Kharagpur, India [email protected] 6 Debdoot Sheet is Assistant Professor of Department of Electrical Engineering, Indian Institute of Technology Kharagpur, Kharagpur, India

[email protected]

978-1-5386-3646-6/18/$31.00 ©2018 IEEE

Motor imagery is a set of signals generated when brain intends (imagines) to perform a physical activity without actually executing it. Motor imagery is decoded based on changes in brain wave rhythm in non-invasive Electroencephalography (EEG) recordings. Researchers [1] have found imagination of right hand or left-hand movement results in attenuation of signal power in α (8-12 Hz) from the contralateral area of the brain and a power increase in β (1230Hz) from the ipsilateral area of the brain. This attenuation of the α band is known as event-related desynchronization (ERD), and increment in the β band is known as eventrelated synchronization (ERS). This ERD signal can be interpreted as an electrophysiological correlate of activated cortical areas involved in the production of motor behavior; it came just before the actual movement took place, in case of motor imagery before when we start imagining about the movement. However, this operational frequency band for ERD and ERS is not uniform across every subject. To address the problem of the subject-specific frequency band, many researchers have proposed several approaches. Novi et al. [2] proposed an algorithm known as subband common spatial pattern (SBCSP), in which it decomposed the wideband EEG into several subbands and extracted feature common spatial filter. Another advance algorithm proposed by ang et al. [3] uses the same concept of frequency filtering and followed by the spatial filter which is termed as filter bank common spatial pattern (FBCSP). Commonly used classifiers in BCI are linear discriminant analysis (LDA), support vector machine (SVM). However, our brain signal has higher complexity than the humanmade signal which demands advanced machine learning algorithms for accurate decoding. So, recent solutions for decoding motor imagery [4] include deep belief network based algorithm, in which multiple single channels based weak classifiers are constructed and combined with the AdaBoost approach. Another approach [5] include extracting frequency information from the raw EEG signal using 1D CNN and classifying motor imagery through stacked autoencoder. In this paper, we are proposing a new algorithm for multiclass motor imagery classification using deep convolution neural network. The proposed network architecture is a hybridization of temporal convolution and 2D convolution to solve the subject-specific frequency band problem. In this paper, we have used motor imagery data from BCI competition-IV 2a dataset, and all four class motor imagery is classified. Section II presents the CNN based proposed

5085

Fig. 1. Proposed deep convolutional neural architecture. The spatial resolution of the feature maps is indicated in the boxes. The underlying layer symbols are pointed to the left.

Fig. 2.

layer 22 filters of size, 25 × 1 is applied to each channel of EEG signal, and resulting output is of size 512 × 22 × 22. Such operation is functionally equivalent to preprocessing individual channel with a window size of 25. No padding was used, so the depth has been reduced from 532 to 512. Then the outcome is shuffled to 22 × 22 × 512 and 32 which is convolved with 32 spatial filters of size 22 × 22. This allows extracting features across channels at a time point. The responses are shuffled back to match the axis order of input, so the resultant size is 512 × 32 × 1. These are the first two layers of architecture as shown in Figure 1. After that conventional convolutional layers are used for extracting features and making decisions. Exponential linear unit (ELU)[6] is applied after each convolution block to introduce non-linearity in this architecture. The ELU is defined as:

Timing diagram

f (x) = max(0, x) + min(0, k(ex − 1))

Fig. 3.

(1)

k is a hyper-parameter and k ≥ 0. In this architecture, k = 1. Batch normalization[7] is used after spatial convolution to compensate for co-variate shift. The core formula of batch normalization is:

Electrode Position

network architecture. The training method of the deep CNN architecture is described in Section III. Section IV shows the results and comparison of the proposed method. Finally, we have concluded the work along with a summary of its impact in Section V. II. METHODOLOGY The outline of the proposed architecture for four class, motor imagery classification, is illustrated in Figure 1. Temporal convolution is used in the first layer of the architecture to get the time series features of the EEG data. After that 2D convolution is used to extract features along with a channel and across channels simultaneously. EEG signals with 22 channels is considered input (536 × 22 and 22). In first

x − E[x] x∗ = p ×γ+β var(x) +

(2)

where x∗ is the new value of a single component, E[x] is its mean within a batch and var(x) is its variance within a batch. is added for numerical stability and γ, β are the learnable parameters. Average pooling layers after spatial convolutions are used to reduce variance and extract lowlevel features from the neighborhood. Two fully-connected neural network block is used at the end of the architecture for classification. Dropout[8] of 20% is used as a weight constraint on the fully-connected layer, and it leads to lower generalization errors. Batch normalization and dropout are active only during the training phase. There is no effect of these during validation or testing as these techniques are used to prevent the network from overfitting.

5086

TABLE I C LASSIFICATION R ESULT AND COMPARISON WITH WINNER OF BCI COMPETITION -IV

Author Kai Keng Liu Guangquan Wei Song Damien Coyle Jin Wu Proposed

Mean 67.8% 63.6% 48.3% 47.8% 46.7% 70.5%

S1 76.0% 76.8% 53.5% 59.5% 55.8% 73.7%

S2 56.5% 50.5% 38.5% 43.8% 37.8% 64.0%

S3 81.3% 78.3% 61.0% 73.8% 54.3% 68.5%

A. Dataset We used publicly available BCI Competition-IV 2008 dataset IIa in this study. This dataset is available at http://www.bbci.de/competition/iv/. It is a cue-based four class motor imagery dataset. The experimental paradigm consists of four motor imagery tasks, the imagination of left hand (class 1), right hand (class 2), both feet (class 3), and tongue (class 4) respectively. The electrode used in experiment and timing diagram are given in Fig 3 and in Fig 2 respectively. This dataset consists of 9 subjects and each subject contain around 200 − 280 number of trails. All experiments use 80% of trials for training and 20% for testing for each subject. 5-fold validations are used to crossvalidate our results. We extracted approximately 2 sec long signal, starting from 0.5 sec of after the cue, i.e., from 2.5s to 4.5s after starting of each trail and use this for training and evaluation of our model. B. Loss functions CNN architecture is trained by optimizing the crossentropy loss which provides a probabilistic similarity between the actual label and the predicted label of the network. Average cross-entropy loss penalizes for the deviation of each class estimated probability from its label. Classification accuracy is used to evaluate the performance of the network. C. Optimisation During training Adam[9] optimizer is used with an of 1.0, β1 of 0.9 and β2 of 0.999. The Adam optimization is performed in mini-batches with a size of 10 and a total of 100 iterations. For training a deep CNN model, large numbers of samples are required but in this dataset, very few trials are available in each subject. So to prevent overfitting due to less number of samples L2 regularizer is used. Weight decay in Adam optimizer is used for L2 regularization. We optimize overall cost function given by the combination of cross-entropy loss and an additional weight decay term for regularization. `overall = `cross−entropy + λ k W (.) k2

(3)

λ is the weight decay factor. Different weight decay factor is used for different subjects for better generalization of the model. During the start of training, the learning rate is set to 10−4 . Learning rate is reduced by order of magnitude if validation loss doesn’t decrease significantly with an interval of few iterations. Random initialization of the weights

S4 61.0% 58.0% 49.8% 48.3% 43.8% 62.5%

S5 55.0% 37.0% 30.3% 34.0% 29.5% 66.8%

S6 45.3% 40.8% 35.5% 30.3% 37.0% 68.0%

S7 82.8% 74.5% 46.8% 25.0% 50.5% 76.4%

S8 81.3% 79.8% 61.8% 59.5% 58.8% 71.1%

S9 70.8% 76.8% 58.0% 56.5% 52.8% 83.6%

leads to nonperforming CNN architecture. So before training for a specific subject, architecture is pre-trained by using other eight subjects. Due to this, it can extract inter-subject common features from the EEG dataset, and it gives better initialization of weights which leads to the better result. III. RESULTS This section presents an experimental validation of our proposed technique and its performance in comparison to earlier methods. Table I and Figure 4 presents a comparison between BCI competition IV winners and our results. It is evident that our approach has the highest accuracy (70.5%) compared to other methods. It also exhibits minimum variation in performance over all the 9 subjects, and it indicates that our model has learned features which are prominent in every subject. This provides experimental evidence that CNN based features can be used for better analysis of EEG signal. From notch plots, even though the central tendency is comparable with winner solutions the minimum variance of the proposed method indicates that the predictions are more precise in comparison to state-of-the-art. Squeezed lower part of notch plot establish that lower bounds are consistent and proposed method has better lower bounds. In BCI competition-IV 2a dataset, subject 6 had highest noise affected trails, that is why most of the algorithm failed to give good classification result in subject 6, but our model has given consistent accuracy in subject 6 also. So it proves that proposed CNN based approach can handle noise more efficiently than any other algorithms and can extract more subject-specific relevant features even from the noisy data. This is achieved by constraining the initial layers of proposed architecture to perform preprocessing. Conventional practices involve untangling the preprocessing blocks from the pipeline and employing independently preprocessed signals. The proposed method established that a model is superior if it can perform both filtering and feature extraction. This is because the preprocessing part not only removed the noise but also unwanted information to enhance the performance of the feature extraction. The comparison between proposed and recently proposed approaches for the same dataset are tabulated in Table II. Results indicate that our algorithm has achieved one of the best classification accuracy using a new CNN based approach. Even though the proposed algorithm has illustrated better generalization, regularization during training has the contribution for boosting the generalization capability of the

5087

Fig. 4. Comparison of multi-class motor imagery classification accuracy of different approaches on BCI Competition-IV 2a dataset.

model. This being the reason the presented results need not be the upper bound of the proposed method. The upper bound of test accuracy can be further raised upon the availability of more training data which reflects the generalizability of the model. TABLE II C OMPARISON WITH OTHER APPROACHES APPLIED FOR THE DATASET Method Ang et al. [3] Gouy-Pailler et al. [10] Wang [11] Barachant et al. [12] Wang et al. [13] Kam et al. [14] Asensio-Cubero et al. [15] Asensio-Cubero et al. [16] Proposed method

Mean Accuracy(%) 67.7 62.5 68.5 67.0 67.0 70.0 69.2 67.0 70.5

A. Training and Testing Complexity The training and testing are performed in an Intel Corei5 machine with a 24 GB RAM. The code is implemented using PyTorch, a Python-based scientific computing package. The network is trained for 100 iterations, and it took approximately 20 minutes. The testing time in CPU is approximately 0.1 seconds to process single motor imagery trails, and architecture size is 34MB. So it takes less amount of size, very fast for deployment and it can be used for the real-time application. IV. CONCLUSIONS In this paper, we have presented a method for multiclass motor imagery classification. The method has been experimentally proved to be a more generalized single model for this classification task. Deep CNN architectures have given good classification accuracy in other applications like computer vision, image segmentation, etc. In this paper, we have proposed a new method to use deep CNN for EEG signal processing. Our experimental results show that Deep CNN model can achieve better classification accuracy than any other machine learning approaches. Table I and Fig 4 shows that this method gives better accuracy with a minimum variance which proves that the CNN based model can be used extensively in the field of EEG signal processing. In this architecture, the temporal filter is applied to extract

frequency information from the raw EEG signal and after that spatial filtering is used. This combination of spatiospectral filtering is widely used in BCI for feature extraction, but with the help CNN we have extracted this type of features more efficiently. The more important advantage of our model is it does not require any pre-processing of the EEG data. Raw EEG data will directly feed to the network, and it will be classified automatically. So it reduces the complexity of practical implementation. Currently, the performance of this method is marginally limited due to lack of data available. It can be improved more efficiently with more training and can be used for practical purpose. R EFERENCES [1] G. Pfurtscheller and F. L. Da Silva, “Event-related eeg/meg synchronization and desynchronization: basic principles,” Clinical neurophysiology, vol. 110, no. 11, pp. 1842–1857, 1999. [2] Q. Novi, C. Guan, T. H. Dat, and P. Xue, “Sub-band common spatial pattern (sbcsp) for brain-computer interface,” in Neural Engineering, 2007. CNE’07. 3rd International IEEE/EMBS Conference on. IEEE, 2007, pp. 204–207. [3] K. K. Ang, Z. Y. Chin, C. Wang, C. Guan, and H. Zhang, “Filter bank common spatial pattern algorithm on bci competition iv datasets 2a and 2b,” Frontiers in neuroscience, vol. 6, p. 39, 2012. [4] X. An, D. Kuang, X. Guo, Y. Zhao, and L. He, “A deep learning method for classification of eeg data based on motor imagery,” in International Conference on Intelligent Computing. Springer, 2014, pp. 203–210. [5] Y. R. Tabar and U. Halici, “A novel deep learning approach for classification of eeg motor imagery signals,” Journal of neural engineering, vol. 14, no. 1, p. 016003, 2016. [6] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” arXiv preprint arXiv:1511.07289, 2015. [7] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning, 2015, pp. 448–456. [8] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014. [9] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [10] C. Gouy-Pailler, M. Congedo, C. Brunner, C. Jutten, and G. Pfurtscheller, “Nonstationary brain source separation for multiclass motor imagery,” IEEE transactions on Biomedical Engineering, vol. 57, no. 2, pp. 469–478, 2010. [11] H. Wang, “Multiclass filters by a weighted pairwise criterion for eeg single-trial classification,” IEEE Transactions on Biomedical Engineering, vol. 58, no. 5, pp. 1412–1420, 2011. [12] A. Barachant, S. Bonnet, M. Congedo, and C. Jutten, “Multiclass brain–computer interface classification by riemannian geometry,” IEEE Transactions on Biomedical Engineering, vol. 59, no. 4, pp. 920–928, 2012. [13] D. Wang, D. Miao, and G. Blohm, “Multi-class motor imagery eeg decoding for brain-computer interfaces,” Frontiers in neuroscience, vol. 6, p. 151, 2012. [14] T.-E. Kam, H.-I. Suk, and S.-W. Lee, “Non-homogeneous spatial filter optimization for electroencephalogram (eeg)-based motor imagery classification,” Neurocomputing, vol. 108, pp. 58–68, 2013. [15] J. Asensio-Cubero, J. Gan, and R. Palaniappan, “Multiresolution analysis over simple graphs for brain computer interfaces,” Journal of neural engineering, vol. 10, no. 4, p. 046014, 2013. [16] J. Asensio-Cubero, J. Q. Gan, and R. Palaniappan, “Extracting optimal tempo-spatial features using local discriminant bases and common spatial patterns for brain computer interfacing,” Biomedical Signal Processing and Control, vol. 8, no. 6, pp. 772–778, 2013.

5088