Neural Network Time Series Prediction With Matlab By Thorolf Horn Tonjum School of Computing and Technology, University of Sunderland, The Informatics Centre, St Peter's Campus, St Peter's Way, Sunderland, SR6 ODD, United Kingdom Email : [email protected]

Introduction This paper describes neural network time series prediction project, applied to forecasting the American S&P 500 stock index. 679 weeks of raw data is preprocessed and used to train a neural network. The project is built with Matlab (Mathworks inc.) Matlab is used for processing and preprocessing the data. A prediction error of 0.00446 (mean squared error) is achieved. On of the major goals of the project is to visualize how the network adapts to the real index course by approximation, this is achieved by training the network in series of 500 epochs each, showing the change of the approximation (green color) after each training. Remember to push the ‘Train 500 Epochs’ button at least 4 times, to get good results and a feel for the training. You might have to restart the whole program several times, before it ‘lets loose’ and achieves a good fit, one out of 5 reruns produce god fits. To run/rerun the program in matlab, type : >> preproc >> tx Dataset 678 weeks of American S&P 500 index data. 14 basic forecasting variables. The 14 basic variables are : 1. S&P week highest index. 2. S&P week lowest index. 3. NYSE week volume. 4. NYSE advancing volume / declining volume. 5. NYSE advancing / declining issues. 6. NYSE new highs /new lows. 7. NASDAQ week volume. 8. NASDAQ advancing volume / declining volume. 9. NASDAQ advancing / declining issues. 10. NASDAQ new highs /new lows. 11. 3 Months treasury bill. 12. 30 Years treasury bond yield. 13. Gold price. 14. S&P weekly closing course. These are all strong economic indicators. The indicators have not been subject to re-indexation or other alternations of the measurement procedures, so the dataset covers an unobstructed span from January 1980 to December 1992. Interest rates and inflation are not included, as they are reflected in the 30 years treasury bond and the price of gold. The dataset provides an ample model of the macro economy.

Preprocessing The weekly change in closing course is used as output target for the network, The 14 basic variables are transformed into 54 features by : Taking the first 13 variables and producing : I. The change since last week (delta). II. The Second power (x^2). III. The third power (x^3). And using the course change from last week as an input variable the week after, gives 54 feature variables (the 14 original static variables included). All input variables are then subjected to normalization, which ensures that the input data follows the normal distribution, with a standard deviation of 1, and a mean of zero. [Matlab command : prestd] The dimensionality of the data is then reduced to 28 variables after a principal component analysis with 0.001 as threshold. The threshold is set low since we want to preserve as much data as possible for the Elman network to work on. [Matlab command : prepca] We then scale the variables (including the target data) to fit the [-1,1] range, as we use tansig output functions. [Matlab command : premnmx] SE matlab file ‘Preproc.m’ for further details. Choice of Network architecture and algorithms. We are doing time series prediction, but we are forecasting a stock index, and rely on current economic data just as much as the lagged data from the time series being forecasted, this gives us a wider specter of neural model options. Multi Level Perceptron networks (MLP), Tapped Delay-line (TDNN), and a recurrent network model can be used. In our case, detecting cyclic patterns, becomes a priority together with good multivariate pattern approximation ability. The Elman network is selected on behalf of its ability to detect both temporal and spatial patterns. Choosing a recurrent network is favorable, as it accumulates historic data in its recurrent connections. Using a Elman network for this problem domain, demands a high number of hidden weights, 35 is found to be the best trade off in our example, whereas if we used a normal MLP network, around 16 hidden weights would be enough. The Elman network needs more hidden nodes to respond to the complexity in the data, as well as having to approximate both temporal and spatial patterns. We train the network with gradient descent training algorithm, enhanced with momentum, and adapting learning rate, this enables the network to performance-vise climb past points were gradient descent training algorithms without adapting learning rate would get stuck.

We use the matlab learning function : learnpn for learning, as we need robustness to deal with some quite large outliers in the data. Maximum validation failures [net.trainParam.max_fail=25] Is set arbitrarily high , but this provides the learning algorithm higher ability to escape local minima, and continue to improve, were it would otherwise get stuck. The momentum is also set high (0.95) to ensure high impact of previous weight change, This speeds up the gradient descent, helps keeps us out of local minima, and resists memorization. The learning rate is initially set relatively high at 0.25, this is possible because of the high momentum, and because it’s remote controlled by the adaptive learning rate rules of the matlab training method traingdx. We choose the purelin as the transfer function for the output of the hidden layer, as this provided more approximation power, and tansig for the output layer, as we scaled the target data to fit the [-1,1] range. The weight initialization scheme initzero is used to start the weights off from zero, this provides the best end results, but heightens the trial and error factor, resulting in having to restart the program between 5 to 8 times to get a “lucky” fit. Once you have a “lucky” fit, training the network for 3-5 X 500 epochs usually yields result in the 0.004 mse range. Maximum performance increase is set to 1.24, giving the algorithm some leeway to test out alternative routes, before getting called back on the path. [net.trainParam.max_perf_inc = 1.24]. With well over 400 training cases to work with, 35 hidden neurons and 28 input variables, we get 980 hidden layer weights, which is well below the rule of thumb number 4000 (10 x Cases). Results in the 0.004 mse range, supports the conclusion that the model choice was not the worst possible. Additional results could have come from adding lagged data, like exponentially smoothed averages from different time frames and with different smoothing factors, efficiently accumulating memory of large time scales. Integrating a tapped delay-line setup, could also have been beneficial. But these alternatives would have added to the course of dimensionality, probably not yielding great benefits in return, especially as long as the recurrent memory of the Elman network seemed to perform with ample sufficiency. The training sets 400 weeks was taken from the start of the data, then came the 140 weeks of test set, and finally 139 weeks of validation data, In effect approximating data on the 0.004 mse level, more than 5 years (279 weeks) into the future.

Training & Visualization. The data is as described above, divided in the classic 60 20 20 format for training-set testing-set and validation-set. The approximation is visualized by the actual course (blue) versus the approximation (green). This is done for the training set, the testing set, and the validation set. This clearly demonstrates how the neural net is aproximating the data :

Errors are displayed as red bars in the bottom of the charts. The training is done by training 500 epochs, then displaying the results, then training a new 500 epochs, and so forth. Seeing the approximation ‘live’ gives interesting insights into how the algorithm adapts, and how changes in the model affect adaptation.

Push the button to train a new 500 epochs.

The effect of the adaptive learning rate is quite intriguing, specifically the effect on the performance.

Dynamic learning rate, controlled by adaptation rules.

Vivid performance change by the changing learning rate.

The correlation plot gives ample insight into how close the model is mapping the data. To see this push the ‘correlation plot’ button. The ‘Sum(abs(errors))’ displays the sum of all the absolutes of the errors, as a steadfast and unfiltered measurement.

Bibliography Valluru Rao 1993 “C++ Neural Nettworks”. Neural Nettwork Toolbox Userguide. 4 edition. Appendix: A. The Matlab code : Preproc.m : Preprocessing the data. Tx.m : Setting up the network. Gui.m : Training and displaying the network.

Introduction This paper describes neural network time series prediction project, applied to forecasting the American S&P 500 stock index. 679 weeks of raw data is preprocessed and used to train a neural network. The project is built with Matlab (Mathworks inc.) Matlab is used for processing and preprocessing the data. A prediction error of 0.00446 (mean squared error) is achieved. On of the major goals of the project is to visualize how the network adapts to the real index course by approximation, this is achieved by training the network in series of 500 epochs each, showing the change of the approximation (green color) after each training. Remember to push the ‘Train 500 Epochs’ button at least 4 times, to get good results and a feel for the training. You might have to restart the whole program several times, before it ‘lets loose’ and achieves a good fit, one out of 5 reruns produce god fits. To run/rerun the program in matlab, type : >> preproc >> tx Dataset 678 weeks of American S&P 500 index data. 14 basic forecasting variables. The 14 basic variables are : 1. S&P week highest index. 2. S&P week lowest index. 3. NYSE week volume. 4. NYSE advancing volume / declining volume. 5. NYSE advancing / declining issues. 6. NYSE new highs /new lows. 7. NASDAQ week volume. 8. NASDAQ advancing volume / declining volume. 9. NASDAQ advancing / declining issues. 10. NASDAQ new highs /new lows. 11. 3 Months treasury bill. 12. 30 Years treasury bond yield. 13. Gold price. 14. S&P weekly closing course. These are all strong economic indicators. The indicators have not been subject to re-indexation or other alternations of the measurement procedures, so the dataset covers an unobstructed span from January 1980 to December 1992. Interest rates and inflation are not included, as they are reflected in the 30 years treasury bond and the price of gold. The dataset provides an ample model of the macro economy.

Preprocessing The weekly change in closing course is used as output target for the network, The 14 basic variables are transformed into 54 features by : Taking the first 13 variables and producing : I. The change since last week (delta). II. The Second power (x^2). III. The third power (x^3). And using the course change from last week as an input variable the week after, gives 54 feature variables (the 14 original static variables included). All input variables are then subjected to normalization, which ensures that the input data follows the normal distribution, with a standard deviation of 1, and a mean of zero. [Matlab command : prestd] The dimensionality of the data is then reduced to 28 variables after a principal component analysis with 0.001 as threshold. The threshold is set low since we want to preserve as much data as possible for the Elman network to work on. [Matlab command : prepca] We then scale the variables (including the target data) to fit the [-1,1] range, as we use tansig output functions. [Matlab command : premnmx] SE matlab file ‘Preproc.m’ for further details. Choice of Network architecture and algorithms. We are doing time series prediction, but we are forecasting a stock index, and rely on current economic data just as much as the lagged data from the time series being forecasted, this gives us a wider specter of neural model options. Multi Level Perceptron networks (MLP), Tapped Delay-line (TDNN), and a recurrent network model can be used. In our case, detecting cyclic patterns, becomes a priority together with good multivariate pattern approximation ability. The Elman network is selected on behalf of its ability to detect both temporal and spatial patterns. Choosing a recurrent network is favorable, as it accumulates historic data in its recurrent connections. Using a Elman network for this problem domain, demands a high number of hidden weights, 35 is found to be the best trade off in our example, whereas if we used a normal MLP network, around 16 hidden weights would be enough. The Elman network needs more hidden nodes to respond to the complexity in the data, as well as having to approximate both temporal and spatial patterns. We train the network with gradient descent training algorithm, enhanced with momentum, and adapting learning rate, this enables the network to performance-vise climb past points were gradient descent training algorithms without adapting learning rate would get stuck.

We use the matlab learning function : learnpn for learning, as we need robustness to deal with some quite large outliers in the data. Maximum validation failures [net.trainParam.max_fail=25] Is set arbitrarily high , but this provides the learning algorithm higher ability to escape local minima, and continue to improve, were it would otherwise get stuck. The momentum is also set high (0.95) to ensure high impact of previous weight change, This speeds up the gradient descent, helps keeps us out of local minima, and resists memorization. The learning rate is initially set relatively high at 0.25, this is possible because of the high momentum, and because it’s remote controlled by the adaptive learning rate rules of the matlab training method traingdx. We choose the purelin as the transfer function for the output of the hidden layer, as this provided more approximation power, and tansig for the output layer, as we scaled the target data to fit the [-1,1] range. The weight initialization scheme initzero is used to start the weights off from zero, this provides the best end results, but heightens the trial and error factor, resulting in having to restart the program between 5 to 8 times to get a “lucky” fit. Once you have a “lucky” fit, training the network for 3-5 X 500 epochs usually yields result in the 0.004 mse range. Maximum performance increase is set to 1.24, giving the algorithm some leeway to test out alternative routes, before getting called back on the path. [net.trainParam.max_perf_inc = 1.24]. With well over 400 training cases to work with, 35 hidden neurons and 28 input variables, we get 980 hidden layer weights, which is well below the rule of thumb number 4000 (10 x Cases). Results in the 0.004 mse range, supports the conclusion that the model choice was not the worst possible. Additional results could have come from adding lagged data, like exponentially smoothed averages from different time frames and with different smoothing factors, efficiently accumulating memory of large time scales. Integrating a tapped delay-line setup, could also have been beneficial. But these alternatives would have added to the course of dimensionality, probably not yielding great benefits in return, especially as long as the recurrent memory of the Elman network seemed to perform with ample sufficiency. The training sets 400 weeks was taken from the start of the data, then came the 140 weeks of test set, and finally 139 weeks of validation data, In effect approximating data on the 0.004 mse level, more than 5 years (279 weeks) into the future.

Training & Visualization. The data is as described above, divided in the classic 60 20 20 format for training-set testing-set and validation-set. The approximation is visualized by the actual course (blue) versus the approximation (green). This is done for the training set, the testing set, and the validation set. This clearly demonstrates how the neural net is aproximating the data :

Errors are displayed as red bars in the bottom of the charts. The training is done by training 500 epochs, then displaying the results, then training a new 500 epochs, and so forth. Seeing the approximation ‘live’ gives interesting insights into how the algorithm adapts, and how changes in the model affect adaptation.

Push the button to train a new 500 epochs.

The effect of the adaptive learning rate is quite intriguing, specifically the effect on the performance.

Dynamic learning rate, controlled by adaptation rules.

Vivid performance change by the changing learning rate.

The correlation plot gives ample insight into how close the model is mapping the data. To see this push the ‘correlation plot’ button. The ‘Sum(abs(errors))’ displays the sum of all the absolutes of the errors, as a steadfast and unfiltered measurement.

Bibliography Valluru Rao 1993 “C++ Neural Nettworks”. Neural Nettwork Toolbox Userguide. 4 edition. Appendix: A. The Matlab code : Preproc.m : Preprocessing the data. Tx.m : Setting up the network. Gui.m : Training and displaying the network.