Time Series Classification with the Shallow Learning

0 downloads 0 Views 344KB Size Report
Time series classification (TSC) has been an ongoing ma- chine learning ... by neural network's (NN) architecture in image classification to TSC. Deep learning ...
Time Series Classification with the Shallow Learning Shepard Interpolation Neural Networks Kaleb E. Smith1 and Phillip Williams2 1

Florida Institute of Technology, Melbourne FL 32901, USA, 2 University of Ottawa, Ottawa, Canada

Abstract. Time series classification (TSC) has been an ongoing machine learning problem with countless proposed algorithms spanning a multitude of fields. Whole series, intervals, shapelet, dictionary based, and model based are all different approaches to solving TSC. Then there’s deep learning approaches that try to utilize all the success demonstrated by neural network’s (NN) architecture in image classification to TSC. Deep learning typically require vast amount of training data and computational power to have meaningful results. But, what if there was a network inspired not by a biological brain, but that of mathematics proven in theory? Or better yet, what if that network was not as computationally expensive as deep learning networks which have billions of parameters and need a surplus of training data? This desired network is exactly what the Shepard interpolation neural networks (SINN) provides - a shallow learning approach with minimal training samples needed and is based on a statistical interpolation technique to achieve great results. These networks learn metric features which can be more mathematically explained and understood. In this paper, we leverage the novel SINN architecture on a popular benchmark TSC data set achieving state-of-the-art accuracy on several of its test sets while being competitive against the other established algorithms. We also demonstrate that even when there is a lack of training data, the SINN outperforms other deep learning algorithms.

1

INTRODUCTION

Time series data is one of the most in practical pieces of information in today’s world that allows us to understand what’s going on around us. Financial markets, ocean tides, astrological events, communications, supply chain purchasing, hurricanes, and earthquakes are all different types of time series data. Understanding these events (and others) is another crucial step in making sense of the chaotic world seen in time. Time series classification (TSC) differs from regular data classification problems because of the additional chronological determination of the data . This causes some poor characteristics for simple classification techniques to be used like a simple nearest neighbors algorithm. A more sophisticated technique, such as machine learning (which tries to mathematically give computers the ability to see patterns and data that the common human

cannot see) can instead be used for better success. This field has had an extraordinary amount of popularity in the last decade in computer science due to its accomplishments. Different machine learning algorithms have been applied to TSC all of which showing strengths in particular data sets and weakness in others. This is due to the complex nature of time series data, and how some algorithms are primed for the specific complexity but fail at the simple. One of those algorithms which tries to overcompensate and achieve well on all data is that of neural networks (NN) particularly in deep learning frameworks[6]. NNs are biologically inspired by the one thing computer scientists are striving for - the brain. With the usage of deep learning, which combines multiple layers of neurons which are stacked in succession and trained in a back propagation manner, can simulate neurons firing in brains and then stimulating the correct neurons of remembrance or learning. Where deep learning struggle is in two areas: first, the amount of data needed to train the algorithm to achieve successful results and second, the amount of computational power needed to achieve the training of all that data. The latter has been accelerated to today’s hardware craze of using high performance computing and graphics processing units (GPUs) but still does not address the issue of minimum training data; this can not be fixed if the data simply doesn’t exist. In this paper, a shallow learning method that compensates the above issues is explored. This easily trained, and computationally efficient Sheaprd Interpolation Neural Network(s) (SINN) is tested on the University of California, Riverside (UCR) time series classification and clustering repository against several state-of-the-art algorithms [2, 15]. Our proposed algorithm establishes higher benchmark accuracy on several of the data sets and is highly competitive against the elite TSC algorithms across the entire data set establishing a mean error rate of 18.8%. In addition to accuracy, SINN excel in overall simplicity against other TSC algorithms, in specific that of deep learning NNs by achieving high accuracy on data sets which contain fewer training data than testing. The remainder of this paper is as follows: first, a brief review of Shepard interpolation and its uses in machine learning. Second, a mathematical look into the structure SINN. Third, experimental results on the UCR data set against competition algorithms; and finally, a conclusion section looking at our contribution and future work.

2

RELATED WORKS

Shepard Interpolation is a type of data interpolation which falls into the inverse distance weighting class [13]. This method takes a distance metric that describes how far away the original data points are from the functional data points estimated. The interpolation function is dependent on the inverse weighting function  wi (x) =

1 d(x, xi )

p (1)

where d(x, xi ) is some distance function. Shepard Interpolation is defined as ( PN w (x)u u(x) =

i P N i

ui

i

i

wi (x)

(2) d(x, xi ) = 0

The exponential p typically is set to two in Shepard’s method, representing something similar to the euclidean distance. For statistical purposes, this algorithm is simple to implement and has very few parameters that need tweaking. This method not only has the ability to work on data in N dimensions, but it can interpolate data which is scattered and came to fame because of its ability to compensate for data on any grid. It’s a versatile and quick interpolation method as well. It does fault however when the data set is large since its O(n) (where n is the number of samples) and is a global method which over values outliers when it comes to fitting new points. A major drawback of classic Shepard interpolation is that all known data points share a same distance function that is symmetric across all dimensions. In the case of real world data this is certainly not the case, for example, imagine the situation of binary classification of images. The images can be represented as vectors of pixels’ values, with an associated value of 1 or 0 for true or false in the classification task. Changing a pixel in the corner versus changing a pixel in the center of the image have different impacts, and the same change of pixel value at the same position might not represent an equivalent change when executed on two vastly different images. Typically, because of the ability to work so well on grids, Shepard interpolation has been mainly applied in machine learning practices regarding image classification. For instance, Shepard interpolation was used in a convolutional neural networks [11]. The proposed method is the addition of a Shepard interpolation layer as an augmentation on architectures used for low-level image processing tasks. Park and Sandberg describe a Radial Basis Function Network that functions in much the same manner as the SINN, in that they are both based off of exact interpolation and function using a single hidden layer [10]. The Shepard Interpolation Neural Network architecture departs from Radial Basis Function Networks through implementation as well as a few characteristics of the activation functions used. For example ,Radial Basis Function Networks require that the activation function be radially symmetric while Shepard Interpolation Neural Networks do not place such a constraint on the metric and Shepard activation functions. Our proposed algorithm’s architecture has only been very recently developed, as such, there are no other published attempts at improving this shallow learning approach as well as no attempts at using it for TSC. This outline will be reviewed in more detail in section 2.

3

SHEPARD INTERPOLATION NEURAL NETWORKS

By exploiting the Shepard interpolation method, the architecture of the network can be designed rather than found through exploration, requiring very

little hyper parameter tuning as well as providing increased efficiency. To further elaborate, several data points are selected from the each output class, and are used to deterministically initiate the weights of the two layers. The selection method for these data points has a large impact on the initial performance of the model as well as the final classification accuracy after training. By substituting the distance function in equation 1 with a function that operates on vectors, the singular point x can be replaced by a vector x. Intuitively, Shepard interpolation can calculate the value of a surface defined by the known data points at an unknown point by taking a weight average of the known points The weights are determined by the distance between known and unknown points. The neural network component of the SINN functions are based around two types of neurons: metric neurons and Shepard neurons. A single metric neuron encodes a known data point from the training set as well as a distance function unique to that data point. This uniqueness of distance function helps fix the issue addressed with classic Shepard interpolation addressed in section 2. The Shepard neuron encodes values of the surface at all known data points for a single output function as well as the local curvature of the surface around each known data point. SINN have different activation functions than typical neural networks. For the metric node, the activation function is defined as φ(x) = |α(wx + b)|

(3)

A single metric node is a one-to-one function. Intuitively, it is a distance function between the input x and some known point parameterized by α, b and w along a single dimension. A collection of metric neurons can calculate the distance between a given vector and a known vector. Since in this paper we focus on time series data, it can be seen as the distance between one time series vector to another. In order to encode some n number of features to some m known values, the SINN would need n ∗ m metric neurons. These metric neurons then “belong” to a inverse neurons. This can be seen in figure 1. For some input vector x = (x1 , x2 , ..., xn ), the inverse distance weighting activation function is defined as wi (x) =

hX

i−p φ(xi )

(4)

While, obtaining some vector of Shepard node weights, u = (u1 , u2 , ..., un ) the Shepard activation function is P ui wi (x) (5) u(x) = P w(x) In this, the normalization layer is combined with the Shepard layer. This seen in figure 1 shows the normalization layer and Shepard layer being sepawi (x) rated for high level understanding. In reality, the P w(x) is the normalization part of the appropriate layer. The Shepard neuron on the other hand, combines

Fig. 1. Shepard Interpolation Neural Network Architecture

all the weights into a single prediction. It combines the metrics for each feature (n) and exponentiates them yielding a multi-dimensional distance function. Then the neuron combines the m multi-dimensional distance values into a single prediction. By allowing the neurons to learn distance functions that are nonsymmetric across the dimensions (and unique across the metric neurons) the relative importance of features given the values of the features can be learned. This allows for more intelligent segmenting of the feature space for classification due to the influx of inflection points interpolated by the beforehand layers. Shepard neurons can also learn the local curvature of the surface as well. This can be achieved by tuning the p parameter in equation 4.

4

EXPERIMENTAL RESULTS

Our algorithm was tested on the time series data set from the University of California, Riverside (also known as UCR). This data set has multiple test sets within it, some of which with multiple classes in the test set, or simply binary. There might be more training data then testing (which is typical) but in some cases in this UCR data set there is less training data then there is testing data. Nine other state-of-the-art algorithms spanning different fields are explored in comparison with the SINN. These algorithms include: Baydogan et al’s time series based on a bag of features (TSBF), Schafer’s 1-NN Bag of SFA Symbols (BOSS), Lines and Bagnall’s elastic ensemble (PROP), Cetin et al’s shapelet ensemble (SEI), Bagnall et al’s flat-COTE (COTE), Nanopoulos et al’s Multilayer perceptron (MLP), Wang et al’s fully convolutional neural network (FCN), Karim et al’s Long Short Term Memory Network (LSTM) and the benchmark Berndt and Clifford’s dynamic time warping (DTW) [1, 3–5, 7–9, 12, 14]. The above approaches (omitting MLP, FCN and LSTM) rely heavily on algorithm specific feature extraction or data preprocessing. This causes a discrepancy in the accuracy results for these certain algorithms. Also, of these non neural algorithms, one to mention is COTE, which is a collective voting of 35 different classifiers.

Fig. 2. TSC error rate for 49 UCR test sets comparing neural only algorithms

4.1

Experimental Settings

For our experiments, there is no preprocessing done to the data, the raw data is simply fed into the SINN. As far as parameters, the number of nodes needed for the hidden layer is the only thing to set. In our experiments, we set the number of hidden layer Shepard neurons to be m (number of classes in the data set). This means, from figure 1, there is also m inverse nodes which own m metric nodes for encoding. The SINN was coded in Keras and the experiments were ran on a Dell Inspiron desktop with a Intel Core i5 processor, and a Nvidia Titan X graphics processing unit. 4.2

Result

Each test of the SINN was ran 100 time and averaged out. Presented in table 1 and shown visually in figure 2 we can see the results of SINN versus the other TSC best algorithms. Our proposed algorithm shows extremely strong results, with a mean error rate of 18.8% over the entire 49 test sets. This beats the benchmark algorithm of DTW, as well as three other algorithms. As far as individual test sets, SINN achieved new benchmark performance on four test sets: Cofee, ItalyPower, ProximalPhalanxTW, UWaveX. These test sets span a wide variety of time series collections. Cofee, MiddlePhalanxOutlineAgeGroup, and ProximalPhalanx, are all test sets that contain a gridded input - spectrograms and images. Knowing how well Shepard interpolation does on data that lies on a grid, it might give insight into why these achieved such great results. COTE being a combination of many different classifiers and using a voting scheme, performs well on the UCR data set. SINN outperforms COTE on 17 test sets spanning image, device, sensor, motion, and spectrogram showing the strength of our algorithm against a combination algorithm. The SINN architecture follows closely with the neural network framework, and comparing the

results with the other three neural network algorithms, LSTM outperforms all the other techniques. This fortifies the strength of deep learning techniques in the time series classification field and possibly stems from the success on the image based time series explained above. Where our shallow learning excels is when the training data is significantly smaller than the testing data. To challenge it further, we took several data sets and took the training data and cut it down to several fractions of what the original training data was. We downsize the data by 50% and 75% then train our algorithm again on the smaller training set. The results can be seen in table 2, to note, we tried to keep the same amount of samples per class. So if there is an even number of samples total and three classes then it is approximately the downsize in data. Our algorithm performs extremely well considering the training data decreased in each case. This shows the shallow learning ability to compensate the discrimination issue of the data. Also, with our Shepard interpolation, the ability for our algorithm to create and infer new points in the feature space helps strengthen our performance.

5

CONCLUSION

Time series classification is a difficult machine learning task for not only algorithms designed around solving just time series data but also neurologically inspired classification algorithms. In this paper, a novel approach of combining mathematical foundation to neurologically inspired architecture is used to the task of TSC. Our proposed algorithm shows to be effective on time series data and achieves new benchmark results on several test sets while showing high results while cutting down the training data. Our SINN can be mathematically understood to learning the curvature and distances of the surface interacting in the feature space. Going forward, a combination of things could be looked at to make SINN better at TSC. As mentioned before, a wider network or a network implementing a stacked CNN with a SINN could benefit greatly. With this in mind, a feature based input would also be interesting to further explore, utilizing a time series specific feature as wavelets or Fourier spaces. Also, the way the data is initialized in the Shepard neurons needs to be better achieved. Possibly a unsupervised clustering algorithm would be able to group points together to initiate the neurons, or even a centroid of a cluster. These clustering algorithms might increase computational time, but may benefit the outcome of the results on the test sets. There are possible alternates on SINN which could look at different distance functions as well. SINN have established state-of-the-art results based around a new mathematically evolved neural inspired networks. This architecture paired up with theoretical mathematics have potential to become a great asset in the machine learning community working on time series classification.

Table 1. Testing Error and Mean Error on UCR Data set DTW TSBF BOSS PROP

SEI

COTE MLP

FCN

LSTM SINN 0.141 0.298

Adiac

0.396 0.245

0.22

0.353

0.373

0.233

0.248 0.143

Beef

0.367 0.287

0.2

0.367 0.133

0.133

0.167

0.25

0.1

0.15

CBF

0.003 0.009

0

0.002

0.01

0.001

0.14

0

0.002

0.061

0.312

0.314 0.128

0.157

0.190

0.261

0.064

0.158

0.187

0.114

0.089

0

0

0

0

0

0.297 0.154

0.431

0.185

0.193

0.412

ChlorineCon

0.352 0.336

0.34

0.36

CinCECGTorso

0.349 0.262

0.125

0.062 0.021

Cofee

0

0

0.246 0.278

0.259

0.203

CricketY

0.256 0.259

0.208 0.156

0.326

0.167

0.405

0.208

0.182

0.415

CricketZ

0.246 0.128

0.246

0.156

0.277 0.128

0.408

0.187

0.188

0.435

0.07

0.039

0.078

0.015

0.01

0.072 0.212

CricketX

0

0.004

0.126

0

DiatomSizeR

0.33

0.046

0.059

0.069

ECGFiveDays

0.232 0.183

0

0.178

0.055

FaceAll

0.192 0.234

0.082 0.036 0

0.03

0.21

0.152

0.247

0.105

0.115 0.071

0.06

FaceFour

0.17

0.051

0

0.091

0.034

0.091

0.17

0.068

0.057

0.09

FacesUCR

0.095

0.09

0.042

0.063

0.079

0.057

0.185

0.052

0.077

0.192

0.31

0.207

0.301

0.18

0.288

0.191

0.288

0.321

0.196

0.372

0.034

0.057

0.029

0.126

0.029

0.018

0.148

0

0

0.06

FiftyWords Fish

0.177 0.029 0.011

GunPoint

0.093 0.011

0

0.007

0.06

0.007

0.067

Haptics

0.623 0.488

0.536

0.584

0.607

0.488

0.539 0.449

0.426 0.548

Herring

0.391

0.235 0.343

InlineSkate InstWngBtSnd

0.395

0.395

0.415

0.368

0.39

0.55

0.616 0.603 0.511

0.41

0.567

0.653

0.551

0.649

0.589

0.535

0.398 0.384

0.375

0.48

0.361

0.454

0.426

0.339 0.355

0.49

0.587

ItalyPower

0.05

0.096

0.053

0.039

0.053

0.036

0.034

0.03

0.037

0.03

Lightning2

0.131 0.257

0.148

0.115 0.098

0.164

0.279

0.197

0.197

0.263

Lightning7

0.274 0.262

0.342

0.233

0.274

0.247

0.356 0.137

0.165

0.273

Mallat

0.066 0.037

0.058

0.05

0.092

0.036

0.064

0.02

0.053

MedicalImages

0.263 0.269

0.288

0.245

0.305

0.258

0.271 0.208

0.199 0.238

MidPhxOutAgeGrp 0.419 0.328

0.334

.302

0.354

0.277

0.423

0.44

0.188

0.22

0.073

0.114

0.113

0.085

0.131

0.05

0.061

0.108

MoteStrain

0.165 0.135

0.02

NonInvThorax1

0.21

0.138

0.161

0.178

0.174

0.093

0.058 0.039

0.035 0.075

NonInvThorax2

0.135

0.13

0.101

0.112

0.118

0.073

0.057 0.045

0.038 0.069

OliveOil

0.167 0.09

0.1

0.133

0.133

0.1

0.60

0.167

0.133

OSULeaf

0.409 0.329

0.012

0.194

0.273

0.145

0.43

0.012

0.005 0.245

0.167

ProxalPhalanxTW 0.225 0.202

0.227

0.21

0.32

0.185

0.225

0.217

0.165 0.161

0.275 0.175

0.321

0.293

0.238

0.146

0.273 0.032

0.019 0.181

SonyAIBORobotII 0.169 0.196

0.022 0.161

SonyAIBORobot

0.098

0.124

0.066

0.076

0.161 0.038

0.093 0.022 0.021

0.079

0.093

0.031

0.043

0.033

0.025

Strawberry

0.031 0.032

0.03

0.042

0.085

0.037

0.184

0.124

0.016 0.021

SwedishLeaf

0.208 0.075

0.072

0.085

0.12

0.046

0.107 0.034

0.021 0.124

0.05

0.034

0.032

0.049

0.083

0.046

0.147

0.038

0.017 0.110

0.007 0.008

0.03

0.01

0.033

0

0.05

0.01

0.01

0.03

0

0.01

0.05

0.01

0.18

0

0

0.01

StarLightCurves

Symbols SyntheticControl Trace TwoLeadECG

0

0.01

0

0.06

0.001

0.004

0

0.029

0.015

0.147

0

0.01

0.188

TwoPatterns

0.096 0.046

0.016

0.067

0.048

0

0.114

0.103

0.01

0.104

UWaveX

0.272 0.164

0.241

0.199

0.248

0.196

0.232

0.246

0.151 0.087

UWaveY

0.366 0.249

0.313

0.283

0.322

0.267

0.297 0.246

0.248

UWaveZ

0.342 0.217

0.312

0.29

0.346

0.265

0.295

0.271

0.207 0.263

0.004 0.001

0.003

0.001 0.014

Wafer WordSynonyms Yoga Best

0.02

0.351 0.302

0.345 0.226

0.164 0.149 0.081 3

2

Total Average Error 0.218 0.175

0.121

0.298

0.002 0.001

0.004

0.003

0.357

0.266

0.406

0.42

0.33

0.436

0.159

0.113

0.145

0.155

0.083

0.175

10

22

12

5

3

7

2

0.162

0.166

0.191

0.138

0.221

0.148 0.118

4 0.188

Table 2. SINN Accuracy with altering training data train/test 100% train 50% train 25% train CBF 30/900 93.9 86.4 80.9 NonInvThorax1 1800/1965 92.5 87.6 78.3 StarlightCurves 1000/8236 94.0 84.2 78.8 Yoga 300/3000 82.5 76.9 70.4

References 1. Anthony Bagnall, Jason Lines, Jon Hills, and Aaron Bostrom. Time-series classification with cote: the collective of transformation-based ensembles. IEEE Transactions on Knowledge and Data Engineering, 27(9):2522–2535, 2015. 2. Anthony Bagnil. The uea & ucr time series classification repository. URL. www.timeseriesclassification.com. 3. Mustafa Gokce Baydogan, George Runger, and Eugene Tuv. A bag-of-features framework to classify time series. IEEE transactions on pattern analysis and machine intelligence, 35(11):2796–2802, 2013. 4. Donald J Berndt and James Clifford. Using dynamic time warping to find patterns in time series. In KDD workshop, volume 10, pages 359–370. Seattle, WA, 1994. 5. Mustafa S Cetin, Abdullah Mueen, and Vince D Calhoun. Shapelet ensemble for multi-dimensional time series. In Proceedings of the 2015 SIAM International Conference on Data Mining, pages 307–315. SIAM, 2015. 6. Ken-Ichi Funahashi. On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3):183–192, 1989. 7. Fazle Karim, Somshubra Majumdar, Houshang Darabi, and Shun Chen. Lstm fully convolutional networks for time series classification. arXiv preprint arXiv:1709.05206, 2017. 8. Jason Lines and Anthony Bagnall. Time series classification with ensembles of elastic distance measures. Data Mining and Knowledge Discovery, 29(3):565–592, 2015. 9. Alex Nanopoulos, Rob Alcock, and Yannis Manolopoulos. Feature-based classification of time-series data. International Journal of Computer Research, 10(3):49–61, 2001. 10. Jooyoung Park and Irwin W Sandberg. Universal approximation using radialbasis-function networks. Neural computation, 3(2):246–257, 1991. 11. Jimmy SJ Ren, Li Xu, Qiong Yan, and Wenxiu Sun. Shepard convolutional neural networks. In Advances in Neural Information Processing Systems, pages 901–909, 2015. 12. Patrick Sch¨ afer. The boss is concerned with time series classification in the presence of noise. Data Mining and Knowledge Discovery, 29(6):1505–1530, 2015. 13. Donald Shepard. A two-dimensional interpolation function for irregularly-spaced data. In Proceedings of the 1968 23rd ACM national conference, pages 517–524. ACM, 1968. 14. Zhiguang Wang, Weizhong Yan, and Tim Oates. Time series classification from scratch with deep neural networks: A strong baseline. In Neural Networks (IJCNN), 2017 International Joint Conference on, pages 1578–1585. IEEE, 2017. 15. Phillip Williams. Sinn: Shepard interpolation neural networks. In International Symposium on Visual Computing, pages 349–358. Springer, 2016.