Feature Selection of Autoregressive Neural Network ... - IEEE Xplore

Feature Selection of Autoregressive Neural Network Inputs for Trend Time Series Forecasting Sven F. Crone

Stephan Häger*

Lancaster University Management School Department of Management Science Lancaster, United Kingdom [email protected]

iqast Intelligent Forecasting Systems RSG Software GmbH Hamburg, Germany [email protected]

Abstract—The capability of artificial Neural Networks to forecast time series with trends has been a topic of dispute. While selected research following Zhang and Qi has indicated that prior removal of trends is required for a Multilayer Perceptron (MLP), others provide evidence that Neural Networks are capable of forecasting trends without data preprocessing, either by choosing input-nodes employing an adequate autoregressive lag-structure of lagged realisations or by adding explanatory variables with trends. This paper proposes a novel variable selection methodology of autoregressive lags for trended time series with and without seasonality, and assesses its efficacy using the dataset of the International Time Series Forecasting Competition conducted at WCCI 2016. Our experiments indicate that MLPs are capable of forecasting different trend forms, but that more than a single lag-structure is required to do so, making the use of multiple input-lag variants and a robust model selection strategy necessary to achieve robust forecast accuracy.

findings are surprising, as they ultimately question the MLP’s capability to universally approximate any type of linear or nonlinear regression function to any desired degree of accuracy [9], and their applicability in practice. In contrast, other researchers conclude that prediction of time series with trends and seasonality is possible without data preprocessing. Crone and Dahwan [10] evaluate MLPs on forecasting a representative dataset of trend, seasonal, and trend-seasonal time series patterns and deduce that a direct prediction is both feasible and accurate. Similarly, Franses and Draisima conclude that MLPs were able to directly forecast both trends and seasonality [10, 11], albeit using different and more extensive autoregressive input lag structures. But despite this dispute in literature, and the significant implications to the limited capabilities of MLPs in forecasting, only limited research has been conducted on forecasting trends, without any particular attention paid to the data conditions of the different functional trend-forms (e.g. linear, degressive, sigmoid or exponential), trend-slopes, or signal to noise ratios. More importantly, the autoregressive inputs used in past studies were not necessarily suited to capture all forms of trends.

Keywords—forecasting, time series prediction, artificial neural networks, trend time series, seasonal time series, structural breaks

I. INTRODUCTION Forecasting time series with trends, a non-stationary pattern of long term changes in the conditional mean of a series, is an ubiquitous requirement for business decision making across domains, including forecasting demand of industrial products in business management, transportation volumes in civil engineering, or mid-term electricity load in electrical engineering [1]. Consequently, the capability of forecasting trended time series with the popular algorithms of Artificial Neural Networks (NN), and in particular the most widely applied feedforward Multilayer Percepton (MLP), is a prerequisite in their use in various real-life situations. However, although trends (together with seasonality) are an archetypical time series pattern [2], the ability of MLPs to forecast trended time series is still under discussion [3–5]. Most prominently, Zhang and Qi [6, 7] concluded after a series of empirical simulations, that the most accurate approach to forecasting trended time series was to decompose the time series into its components, effectively removing trend and seasonality using statistical methods, then forecasting the remainder with an MLP, and then recombining the forecasted components. Hill et al. [4] followed a similar approach, and Cogollo and Velásquez [8] also found evidence that MLPs are not capable of directly forecasting trended time series. These The research has been carried out in collaboration with iqast staff located in Hamburg (www.iqast.de). The authors gratefully acknowledge the support and provision of iqast software from H. Kausch, T. Kempcke, J. Thyson and D. Vinko.

c 978-1-5090-0620-5/16/$31.00 2016 IEEE

This paper seeks to remedy this omission by systematically evaluating the capabilities of MLPs in forecasting trended time series without data preprocessing, using a valid and reliable experimental design, applying an out-of-sample evaluation with multiple step-ahead forecasts across rolling time origins, and robust error metrics. We identify several crucial MLP meta parameters from prior studies, e.g. input variable scaling, lagstructure of input nodes and the number of hidden nodes [6, 10], and systematically assess their accuracy across time series of different trend-forms. As a semi-representative dataset we use the dataset of the International Time Series Forecasting Competition 2016 [12], which contains 72 time series with deterministic trend-pattern, including linear, exponential, and sigmoid trends with and without seasonality, as well as structural breaks in trended time series. In addition to past studies, we introduce a novel input-lag specification of multiple autoregressive support points to capture a wide range of deterministic trend-functions, and assess its empirical accuracy. This paper is organised as follows: section II.A provides a brief literature review about forecasting trend time series, with a new proposal on autoregressive support points outlined in section II.B. Chapter III outlines the experimental design and results are presented in chapter IV, with conclusions and an outlook for future work in chapter V.

1515

II. TREND FORECASTING WITH NEURAL NETWORKS A. Prior studies on trends forecasting with Neural Networks Trends constitute an archetypical time series pattern of a long-term increase or decrease over time, which are prevalent in data of a many application domains, and as such constitutes an important forecasting problem for future decision making, see, e.g. Fig. 1 for examples. More formally, let yt denote the value of the time series at time t with trend, yt = f(t) + nt, where f(t) represents a possibly nonlinear trend and nt is an autocorrelated error process. For a linear function f(t) = β0 + β1,t, we set x1,t = t and define yt = β0 + β1x1,t + nt as a linear deterministic trend (i.e. without changes in trend slope over time). Choosing alternative f(t) allows nonlinear trend forms, as well as structural breaks and piecewise linear or nonlinear stochastic trends. Due to their prevalence across disciplines, adequate specification of trends and other forms of non-stationarity has received substantial attention in statistics and econometrics. For time series prediction with NNs, and MLPs in particular, trended time series have received substantially less attention. In an early paper, Tang et al. [3] compared the accuracy of MLPs with Autoregressive Integrated Moving Average (ARIMA) models for three different seasonal-trend scenarios. Their experiments considered a set of autoregressive input vectors of {t-1}, {t-1, …, t-6}, {t-1, …, t-12} or {t-1, …, t-24} input neurons, to correspond with a possible monthly, semi-annual, annual and biennial seasonality in monthly data respectively. Although their analysis is limited in reliability in evaluating only four sets of lags and three time series, they confirm MLPs to be capable of approximating both trend and seasonality of the time series. In contrast, Zhang and Qi [6] found out that MLPs are not capable to forecast seasonality and trends directly, by utilising MLPs with ten different lag-structures including t-1 to t-4, t-12 to t-14, t-24, t-25 and t-36. They reasoned, that only a data preprocessing in form of a detrending, deseasonalizing or a combination of both lead to reasonable error scales. In another study [7] they showed, that at trend time series are predictable with an appropriate error, when lag-structures are used with t-1 to t-5 lags. However, the small amount of 25 MLPs (resulting from combining 1 to 5 inputs with 1 to 5 hidden neurons) used to forecast the time series directly seems to be prone to failure, when it comes to a comparison of best practice models and statistical evaluations. In a similar study Cogollo and Velásquez [8] concluded that MLPs are neither able to capture nonlinear nor nonstationary functions. Again, their experimental setup lacks rigor, investigating only few types of different MLPs, but seemingly confirming Zhang and Qi. In contrast, Crone and Dhawan [5, 10, 13] attempt a more systematic evaluation, where the influence of several input vectors is systematically utilised. In their experiments some MLPs are able to forecast different trend patterns with additive as well as multiplicative seasonality. Additionally, they make a convincing case by reasoning, that the training process needs to take place with multiple random initialisations in order to make valid statistical statements about the potential of different MLP designs. Their study selects lags between t-1 and t-3, t-6, t-9, t12, t-13, t-15, t-21, to capture a larger range of autocorrelations

1516

between consecutive months. Beside the contributions mentioned above, the literature review showed few other papers, indicating a lack of studies evaluating different lagstructures on time series containing deterministic trends, yet no research on trend time series including structural breaks.

Fig. 1:Exemplary time series with underlying trend-functions. Top-left: linear trend-function (ts3, MA 12). Top-right: exponential trend-function (ts6, MA 12). Bottom-left: sigmoid trend-function (ts9, MA 12). Bottom-right: linear trendfunction with structural breaks (ts10, MA 12).

B. Novel input node specification for autoregressive lags Focusing first on deterministic trend-functions, the NN should be capable to rebuild and extent the temporal course, if the input vector allows a unique temporal assignment of the requested output. Fig. 1 shows four trends we identified within the data set. Considering the linear type for instance, the NN is requested to learn a function described by ݂ሺ‫ݐ‬ሻ ൌ ܾ଴ ൅ ܿ‫ݐ‬. An analytic procedure needs two value pairs of ሾ݂ሺ‫ݐ‬ሻǡ ‫ݐ‬ሿ to determine the parameters ܾ଴ and ܿ, if the form of the equation is given. However, we neither presume that NNs are able to analytically solve equation systems nor perform taylor series without any further interpretation, but rather that NNs are able to numerically approximate time depended functions and extent them within a defined interval. Hence, the most obvious approach is to provide the value pair of time and target value, which is also performed by Qi and Zhang, who add the time index to the input vector [6, 7]. Their findings however show, that the learning process in MLPs differ from analytical proceeding. Therefore, we go one step beyond by neglecting the linear time course and assume, that a MLP is able to adopt the trend-function without knowing the actual time-stamp. This presupposes that the MLP can handle an orientation within different curvatures by a set of given inputs. In order to achieve adequate forecasting results, the trend-function should be preeminent enough in opposite to the additional error terms. Otherwise, the identification of the trend-function is distorted and prevents a correct extrapolation. An example for this is the exponential trend-function in Fig. 1. The first two thirds on the x-axis can easily be misinterpreted as a linear trend function, which might cause issues when it comes to a so structured training, validation and test partitioning. In order to emphasise the temporal course, the input lags should span an interval sufficiently large enough to identify the correct function. As opposed to the proceedings from literature, we are choosing input vectors, which do not

2016 International Joint Conference on Neural Networks (IJCNN)

contain the complete set between lag t-1 and t-n, but instead use equidistant support points. Additionally, the input vector must be designed in a way that allows the adoption of trends as well as seasonality. Analogue to other researchers we found out, that the lags in twelve steps are especially crucial to the modelling of annual seasonality [14, 15]. Therefore, we design our support points around these lags to generalise our procedure regarding the adoption of trends and seasonality. These lags are t-12, t-24, t-36 etc. Furthermore, we consider the width of our support points. A single support point contains one lag at each time segment (e.g. lag t-1, t-12), double support points equals two lags at each time segments (e.g. lag t-1, t-2, t-12, t-13) and so on, which enables MLP to perform different mathematical operations. For example, the utilisation of double support points allows the consideration of gradients between adjacent lags. Input vectors consisting of triple support points allow to generate two gradients as well as to calculate a higher generalisation of the average value, which is likely to result in a better adoption of trends, when noise or seasonality hinder a clear detection. Of course, a precise selection of lags is not sufficient to mirror the inner workings of a MLP. Although significant evidence could not be found, the selection of lags is likely to guide the learning of a MLP in a certain mathematical direction, whereas the deed of not including certain lags into the input-vector, obviously, excludes some mathematical operations, but may direct the MLP rather than confusing it with too many ways to go. III. EXPERIMENTAL DESIGN A. Experimental Data Set We assess the empirical accuracy of the proposed approach using the data set of the International Time Series Forecasting Competition 2016. Forecasting competitions have a long tradition in providing empirical evidence, and provide a valid test-bed to objectively compare forecasting methodologies and techniques [16, 17]. The examined data set consists of 72 time series of monthly observations. The dataset is split by forecasting horizon, segmenting time series to forecast of six months or twelve months in the future respectively. For the sake of coherence in our argument, we ignore the forecasting horizons and assume each time series to be demanding a twelve time step forecast. The observed time series patterns include different forms of deterministic trends, including linear, exponential, damped and sigmoid trends, as well as seasonality and structural breaks overlaid with random noise. The underlying trend function can easily be discerned by plotting a moving average (MA), see Fig. 1 for some examples series from the data set. A structured overview following the Pegels-Gardner Classification [18, 19] of Time Series Patterns, extended by a novel type of sigmoid trend, is shown in TABLE I. Unfortunately, the numbers of elements in the related groups and corresponding cell-sizes of homogeneous patterns is small and are mostly unbalanced, which prevents a valid comparison of average errors and standard deviations between groups in the tradition of a multifactorial ANOVA. As such, the results should not be generalised beyond this dataset without care.

TABLE I. Time Series Pattern No Trend Linear Trend Exponential Trend Damped Trend Sigmoid Trend

TREND STRUCTURES IN DATASET

No Structural Breaks No Some Seasonality Seasonality 2 2 11 6 9

3

6

6

Some Structural Breaks No Some Seasonality Seasonality 2 17 8 -

-

B. Experimental Setup The aim of the experimental setups is to determine the best lag-structure for each time series, which may suit as a starting point for a systematic analysis of errors across groups of lags and time series properties. A detailed description about the functioning of NN and MLPs is omitted here, instead we refer to [20–22] explaining the basic principles of neurons and the functioning of MLPs in the context of forecasting. In order to determine MLP constructions in their prediction accuracy at different time series structures by limiting at the same time the experimental overhead, we split the set of degrees of freedom into two parts. The first set is defined by the knowledge of previous experiments and literature. The second set of parameters is varied in order to evaluate different values. This set contains different lag-structures, obviously, as well as varying numbers of hidden nodes. As discussed in section II.B, the input vector is designed in the form of support points, focusing the MLP in terms of mathematical operations. The arrangement of the lags utilized in our experiments is presented in TABLE II. This table show the 15 different input-lag selections examined in our experiments. In addition to the discussion about the mathematical meaning of lag-selections in the previous section, we use the complete set of lags to capture possible autocorrelations, and to allow a comparison with prior studies. TABLE II. AUTOREGRESSIVE INPUT LAG SELECTION LagStructure

Lag-Set 1

Single (S)

{t-1}

Double (D)

{t-1, t-2}

Tripe (T)

{t-1, t-2, t-3, {t-1, t-2, t-3, {t-1, t-2, t-11, t-12, t-13, {t-1, t-2, t-3, t-11, t-12, t-13 t-3} t-11, t-12, t-13} t-23, t-24, t-25} t-23, t-24, t-25, t-35, t-36, t-37}

Complete (C)

-

Lag-Set 2

Lag-Set 3

Lag-Set 4

{t-1, t-12}

{t-1, t-12, t-24}

{t-1, t-12, t-24, t-35}

{t-1, t-2, t-12, t-13}

{t-1, t-2, t-12, t-13, t-24, t-25}

{t-1, t-2, t-12, t-13, t-24, t-25, t-36, t-37}

{t-1, …, t-12}

{t-1, …, t-24}

{t-1, …, t-36}

For each input vector, we assess twelve sets of 1, …, nh hidden nodes, with ݄ ൌ ሼʹǡ ͵ǡ Ͷǡ ͷǡ ͸ǡ ͹ǡ ͺǡ ͻǡ ͳͲǡ ͳʹǡ ͳͷǡ ʹͲሽ . With respect to previous analyses, we focus our sensitivity analysis between 2 and 10 nodes, whereas 12, 15 and 20 nodes are utilized to verify the deteriorating trend when the number of hidden nodes increases. Additionally, we limit our sudy to network topologies of up to 700 degrees of freedom, which seems to be a sufficiently large dimensioned search space in


1517

TABLE III. PERFORMANCE OF MODEL-CLASSES ModelClass

Mean 21.35 13.18 11.55 8.73 18.17 11.48 9.85 7.33 16.42 11.66 10.31 7.52 11.89 9.06 7.95

Training (T) Std.Dev. 18.07 14.52 11.85 7.21 12.08 10.24 10.59 5.54 12.08 11.74 11.39 6.86 13.82 8.70 11.29

Median 18.37 9.10 7.82 7.43 16.77 8.53 7.20 6.80 13.81 8.37 7.55 6.36 8.40 6.84 4.64

Mean 8.08 5.61 4.78 4.42 7.78 5.25 4.35 3.96 7.45 5.05 4.24 3.89 5.04 4.24 3.68

sMAPE [pct.] Validation (V) Std.Dev. Median 7.77 5.14 4.33 5.01 3.89 4.01 2.92 3.7 4.75 7.52 3.69 5.01 2.86 3.55 2.29 3.53 4.55 7.41 3.58 4.76 2.88 3.60 2.30 3.35 3.93 4.35 2.82 3.55 2.02 3.31

Lag-S1 Lag-S2 Lag-S3 Lag-S4 Lag-D1 Lag-D2 Lag-D3 Lag-D4 Lag-T1 Lag-T2 Lag-T3 Lag-T4 Lag-C2 Lag-C3 Lag-C4 Best6.87 7.56 5.29 3.05 2.00 Selection Naïve RW 15.36 10.67 12.78 14.37 11.51 Auto ETS 13.23 14.60 7.37 10.70 10.21 SARIMA 13.32 14.24 8.80 13.69 14.91 * Rank calculated by median sMAPE; † Number of time series on which the model was carried out.

Number of Series† 72 72 69 66 72 72 69 66 72 72 69 66 72 69 66

2.69

4.09

3.36

3.44

2

1

1

72

12.01 10.18 12.81

7.63 9.59 13.33

10.96 6.53 9.52

16 7 15

19 14 18

19 14 18

72 72 72

௡

௜ୀଵ

All experiments presented within this paper are calculated by the software iqast - forecast desktop (www.iqast.de), developed by RSG Software GmbH. The average computation time per implemented MLP was faster than 1 second.

1518

Rank of Median sMAPE * T V G 19 17 17 14 13 13 10 9 10 8 8 2 18 16 16 13 12 11 6 6 8 4 4 6 17 15 15 11 11 9 9 7 5 3 3 4 12 10 12 5 5 7 2 3 1

11.87 6.94 9.54

context to a maximal time series length of 108 observation points. With regard to Crone and Dhawan [10], we use a combination of logistic function for the hidden layer and identity for the output layer as activation functions. All forecasts are calculated as one-step-ahead forecasts ‫ݕ‬ො௧ାଵ , that uses one output neuron. Each MLP is initialized 50 times. We use a standard backpropagation algorithm with a variable learning rate of ߟ ൌ ͲǤͺ without momentum, which is decreased by 0.01 per epoch for conversion into local minima. The data sampling takes place in a random order with replacement. All input and output data are linear scaled to an interval of ሾെͲǤ͸ǡ ͲǤ͸ሿ in order to prevent saturation effects in the tails of the activation functions and vanishing gradients when nonstationary time series with trends are considered [13], which enables a generalization regarding the calculation of several pattern. In total 600 MLPs using the same input-vector are created for each time series. In total maximal 43,200 (depending on the position of the last lag leading to conflicts with short time series) MLPs using the same lag-structure are created. For the sake of this paper, we separate the data-set into three parts for training, validation and generalization in relation of 0.6, 0.2 and 0.2. Furthermore, we select the best model of each lag-structure at each time series for further evaluations. The selection is based on the validation errors. The error metric is the symmetric mean absolute percentage error (sMAPE) as suggested by the CIF committee. ȁ‫ݕ‬ො௧ାଵ െ ‫ݕ‬௧ାଵ ȁ ʹ ‫ ܧܲܣܯݏ‬ൌ ෍ ȁ‫ݕ‬ො௧ାଵ ȁ ൅ ȁ‫ݕ‬௧ାଵ ȁ ݊

Generalization (G) Mean Std.Dev. Median 9.92 5.59 8.99 6.81 5.51 5.38 5.87 4.55 5.02 5.16 3.75 4.10 9.75 6.42 8.39 6.76 5.55 5.13 5.42 3.83 4.58 4.83 3.33 4.44 9.00 5.22 8.28 6.23 4.40 5.01 5.06 3.60 4.29 4.90 3.33 4.27 5.93 4.09 5.29 5.14 3.38 4.56 4.67 3.37 4.17

IV. EXPERIMENTAL RESULTS A. Analysis of Model-Classes The outcome of the experimental setup is presented in TABLE III. The table shows the arithmetic mean (Mean), standard deviation (Std.Dev.) and median of sMAPE, estimated across all time origins per series and then across series for training, evaluation and generalizations subsets, plus the rank of final median sMAPE across all series. In reviewing forecast accuracy of aggregate model selection, i.e. employing the same lag sturture to all time series, the best performing model-class is the Lag-S4 model with a median error of 4.10 reached during generalization phase. The second best model is the Lag-C4 with a median error of 4.17. The third to fifth place are Lag-T4, Lag-T3 and Lag-D4 with a median error of 4.27, 4.29 and 4.44. This leads to the conclusion, that especially MLPs, which span a wider range of equidistant lags, outperform their competitors. However, no dominance of one lag-structure unto another is found. To account for different trend types, functional forms and seasonality patterns, we conduct a quasi-out-of-sample model selection across lag-structures for each time series, conducting an individual selection of the autoregressive input lag with the lowest validation error. This best-selection outperforms all other lag-structures of aggregate selection, with a median error of only 3.44 on the generalization data. The competitiveness of the best-selection is presented in Fig. 2, in which the time series are sorted by the median error, that results when the best MLP of each lag-structure is selected. Beside the box plots, the figure shows the sMAPE of the best-selection. The average rank of the selected models calculated over the entire set of time series is 4.58 and the median rank of the selected models is 3.


Fig. 2: Error-distribution of best-model-candidate within one mode-class per time series sorted by median-error.

Furthermore, TABLE III. shows the training errors for every model-class as well as the training error of the best-selection. The median training error for the best-selection of 5.29 seems quite high in comparison to the validation and generalization error, which indicates overfitting. In order to asses this hypothesis, we additionally calculated the median sMAPE excluding time series with structural breaks. The median training error decreases to 4.82, whereas the median validation error stays almost constant at 2.71 as well as the median generalization error with a value of 3.42. Therefore, we conclude, that overfitting is an issue of time series with structural breaks and, thus, it is not an issue of our approach anyway.

This is expectable for time series containing seasonality. That these methods are performing worse in terms of trend-functions leads again to the conclusion, that the adoption of complex structures needs spanning a wider range of equidistant lags. This is supported by the accumulation of selections in lag-set 4. However, it is surprising that although Lag-S4 leads in terms of median sMAPE calculated over all time series, it is barely 3 times selected, whereas Lag-D4, Lag-T4 and in particular Lag C4 are 11, 15 and 25 times performing best on several time series.

In order to evaluate our approach, we forecast the set of time series with referent methods, such as naïve RW, exponential smoothing and ARIMA. The results of these reference experiments are also presented in TABLE III. The median sMAPE of naïve RW during generalization is 6.56, which results in the last place. Furthermore, we define exponential smoothing models using automatic parameter optimization for level, dampened trend and dampened trend with weak as well as moderate seasonality with three initializations per structure type. Then we select the best exponential smoothing model at each time series, which results in median error of 6.53. Finally, the median error of a best-selection between ARIMA and SARIMA with an automatic parameter finding is calculated. The outcome of this experiment is a sMAPE of 9.52, which is equal to place eighteen. The performance of the reference methods is also depicted in Fig. 3. An analysis of how often each lag-structure is represented in the best-selection underlines our findings, that the selection differs and depends individually. TABLE IV. discloses that lagset 1 (only one support point), for instance, is not selected once.

Fig. 3: Box-plots of sMAPEs for aggregate selection and individual selection of autoregressive input lags across the whole set of time series.


1519

Fig. 4: Forecasting results for exemplary time series including the twelve step horizon and the forecast confidence; last row: lag-C4 utilized on sigmoid trendfunction without seasonality (time series 32, left) and with seasonality (time series 21, right); original time series (blue), 12-step forecasts (red), forecast confidence.

1520


Lag-S4

Lag-C4

Lag-T3

Lag-D2

15 HN

10 HN

12 HN

9 HN

8 HN

7 HN

6 HN

5 HN

3 HN

4 HN

2 HN

20 HN

15 HN

12 HN

9 HN

10 HN

8 HN

7 HN

6 HN

5 HN

3 HN

4 HN

2 HN

15 HN

10 HN

12 HN

9 HN

8 HN

6 HN

7 HN

5 HN

3 HN

4 HN

2 HN

15 HN

10 HN

12 HN

9 HN

8 HN

7 HN

6 HN

5 HN

3 HN

4 HN

2 HN

Fig. 5: Box-plots of sMAPE in pct. dependent on the number of hidden neurons for Lag-S4, Lag-C4, Lag-T3, Lag-D2 calculated over the complete set of time series after excluding non-converged models.

TABLE V. ERROR RESULTS BY TREND STRUCTURES TABLE IV. NUMBER OF OCCURRENCE IN BEST INDIVIDUAL SELECTION LagLag-Set 1 Structure Single (S) Double (D) Tripe (T) Complete (C) -

Lag-Set 2

Lag-Set 3

Lag-Set 4

3 5

1 2 2 5

3 11 15 25

B. Further Experimental Examples TABLE V. presents the best-selection performance depending on time series patterns. The table contains the median sMAPE achieved by the best-selection depending on different time series pattern during the generalization period. The median sMAPE is differing by weather time series with or without seasonality are considered. The results generally behave as it might be expected. For example, it seems to be instinctive that level time series are less complex and therefore less challenging to forecast than linear-trend functions, which are then easier to forecast than exponential trends and sigmoid trends. This assumption is supported by a median sMAPE of 1.69 for linear trend functions and 3.41 and 5.81 for exponential trends and sigmoid trends. That the higher error occurred at no trend time series is certainly explainable by the number of time series within this group. Furthermore, time series with seasonality are tending to be better predictable by MLPs than time series without structural breaks, which is indicated by error values of 4.07 vs. 5.83, 1.69 vs. 4.15 and 3.41 vs. 5.68 for no trends, linear trends and exponential trends without seasonality and 3.59 vs. 5.10 for linear trend with structural breaks. However, the sigmoid trend function related error tends to be lower for time series with seasonality. It needs further research efforts to prove or disprove these conclusions, however, we assume, that this behaviour results from the individual noise character of the considered time series in combination with partially bad trend-curve adoption. An in depth analysis discloses, that the sigmoid-functions are susceptible for overfitting, due to the beginning and the ending of the temporal course, which resemble a function without trend. If there is additional noise, the slop cannot be reproduced properly. Fig. 4 presents, inter alia, this behavior for a Lag-C4 structure with seven hidden neurons (bottom left) and six

Time Series Pattern No Trend Linear Trend Exponential Trend Damped Trend Sigmoid Trend

Validation Median sMAPE [pct.] No Structural Breaks Some Structural Breaks No Some No Some Seasonality Seasonality Seasonality Seasonality 4.07 5.83 0.04 1.69 4.15 3.59 5.10 3.41

5.68

-

-

5.81

3.63

-

-

hidden neurons (bottom right) selected as best models for time series 32 and time series 21. The plots contain the twelve step horizon as well as the 50, 75 and 95 percent quantiles. We assume, that the high outliers of time series 32 confuses the MLP in terms of adopting the temporal course in the first two third. However, although the training error is high, the saturation phase, in which the trend-function is constant again, is properly adopted. Contrary, in this case the consistent seasonality of time series 21 enables a learning of the underlying sigmoid trend-function resulting in significantly focused confidence intervals. The analysis of the usage of different number of hidden neurons is presented in Fig. 5. As assumed in the theoretical section, the sMAPE error tends to increase with the number of hidden neurons. V. CONCLUSION AND FUTURE RESEARCH Although almost every company has to deal in some way with forecasting, there is still a lack of using sophisticated statistical forecasting methods. The research community promote this status by not conveying a clear reflection about the potential of different methods. In this paper, we evaluate several lagstructures designed as support points on a set of 72 time series presented by International Time Series Forecasting Competition at WCCI 2016. In general, we found evidence, that MLPs are capable to forecast time series showing several different patterns, such as seasonality, deterministic trends as well as structural breaks, with an adequate error scale in comparison to reference methods, e.g. SARIMA or Auto ETS. Despite the utilization of similar kind of lag-structures, we refute the findings of Zhang and Qi [6], who state that MLPs were not able to forecast seasonality and trends on a non-


1521

preprocessed data base, by lining up a systematic experiment. Furthermore, we conclude by our experiments, that utilizing only one type of lag-structures, even though the considered lagstructure leads to best results in comparison to its competitors, can be outperformed by a best-selection out of the complete set. We showed, that a selection taken place by choosing the minimal sMAPE validation error, significantly improve the forecasting accuracy. In addition, the experiments disclose, that the performance of MLPs differs with time series patterns. However, the results mostly follow qualitatively the intuitive assumptions about the connection between structural complexity and achievable forecasting accuracy. However, we showed that overfitting is an issue in terms of sigmoid trendfunctions and structural breaks. It needs further research effort to depict a clearer image of how a model selection can take place in order to be optimized in complex structural environments. VI. REFERENCES [1]

[2] [3]

[4] [5]

[6]

[7]

[8]

[9]

1522

[10]

[11]

[12]

[13]

[14]

S. Hylleberg, “General introduction,” in Modelling Seasonality: Advanced Texts in Econometrics, S. Hylleberg, Ed, Oxford: Oxford University Press, 1992, pp. 3–14. J. A. Miron, The Economics of Seasonal Cycles. Cambridge: The MIT Press, 1996. Z. Tang, C. d. Almeida, and P. A. Fishwick, “Time series forecasting using neural networks vs. BoxJenkins methodology,” Simulation, vol. 57, no. 5, pp. 303–310, 1991. T. Hill, M. O'Connor, and W. Remus, “Neural Network Models for Time Series Forecasts,” Management Science, vol. 42, no. 7, pp. 1082–1092, 1996. S. F. Crone and N. Kourentzes, “Naive Support Vector Regression and Multilayer Perceptron Benchmarks for the 2010 Neural Network Grand Competition (NNGC) on Time Series Prediction,” The 2010 International Joint Conference on Neural Networks (IJCNN), 2010. G. Zhang and M. Qi, “Neural network forecasting for seasonal and trend time series,” European Journal of Operational Research, vol. 160, no. 2, pp. 501–514, 2005. M. Qi and G. P. Zhang, “Trend time-series modeling and forecasting with neural networks,” (eng), IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council, vol. 19, no. 5, pp. 808–816, 2008. M. R. Cogollo and J. D. Velásquez, “Are Neural Networks Able To Forecast Nonlinear Time Series With Moving Average Components?,” IEEE Latin America Transactions, vol. 13, no. 7, 2015. K. Hornik, M. Stinchcombe, and H. White, “Multilayer Feedforeward Networks are Universal Approximators,” Neural Networks, vol. 2, pp. 359–366, 1989.

[15]

[16]

[17]

[18] [19] [20] [21] [22]

S. F. Crone and R. Dhawan, “Forecasting Seasonal Time Series with Neural Networks: A Sensitivity Analysis of Architecture Parameters,” Proceedings of International Joint Conference on Neural Networks, pp. 2099–2104, 2007. P. H. Franses and G. Draisma, “Recognizing changing seasonal patterns using artificial neural networks,” Journal of Econometrics, vol. 81, no. 1, pp. 273–280, 1997. M. Stepnicka and M. Burda, International Time Series Forecasting Competition: Computational Intelligence in Forecasting. Available: http://irafm.osu.cz/cif/main.php (2016, Jan. 10). S. F. Crone, J. Guajardo, and R. Weber, “The Impact of Preprocessing on Support Vector Regression and Neural Networks in Time Series Prediction,” World Congress in Computational Intelligence, WCCT'06, 2006. D. N. Bao, N. Vy, and D. T. Anh, “A Hybrid Method for Forecasting Trend and Seasonal Time Series,” IEEE RIVF International Conference on Computing & Communication Technologies - Research, Innovation and Vision for the Future, 2013. M. Ghiassi, H. Saidane, and D. K. Zimbra, “A dynamic artificial neural network model for forecasting time series events,” International Journal of Forecasting, vol. 21, no. 2, pp. 341–362, 2005. S. G. Makridakis, A. Andersen, R. Carbone, R. Fildes, M. Hibon, R. Lewandowski, and et al, “The accuracy of extrapolation (time-series) methods - results of a forecasting competition,” Journal of Forecasting, vol. 1, pp. 111–153, 1982. S. F. Crone, M. Hibon, and K. Nikolopoulos, “Advances in forecasting with neural networks? Empirical evidence from the NN3 competition on time series prediction,” International Journal of Forecasting, vol. 27, no. 3, pp. 635–660, 2011. C. C. Pegels, “Exponential forecasting: Some new variations,” Management Science, vol. 15, pp. 311– 315, 1969. E. S. Gardner, “Exponential Smoothing: The State of the Art,” Journal of Forecasting, vol. 4, no. 1, pp. 1– 28, 1985. D. S. Levni, Introduction to Neural and Cognitive Modeling, 2nd ed. Mahwah, New Jersey: Lawrence Erlbaum Associates, Inc, 2000. S. F. Crone, “Stepwise selection of artificial neural network models for time series prediction,” Journal of Intelligent Systems, vol. 14, no. 2-3, pp. 99–122, 2005. N. Kourentzes, D. K. Barrow, and S. F. Crone, “Neural Network Ensemble Operators for Time Series Forecasting,” Expert Systems with Applications, vol. 41, no. 9, pp. 4235–4244, 2014.