Computer network traffic prediction: a comparison

0 downloads 0 Views 1MB Size Report
Keywords: deep learning; internet traffic; neural network; prediction; stacked autoencoder; SAE; ..... The conducted experiments used DeepLearn Toolbox. (Palm ...
28

Int. J. Big Data Intelligence, Vol. 3, No. 1, 2016

Computer network traffic prediction: a comparison between traditional and deep learning neural networks Tiago Prado Oliveira*, Jamil Salem Barbar and Alexsandro Santos Soares Federal University of Uberlˆandia (UFU), Faculty of Computer Science (FACOM), Uberlˆandia, MG, Brazil Email: tiago [email protected] Email: [email protected] Email: [email protected] *Corresponding author Abstract: This paper compares four different artificial neural network approaches for computer network traffic forecast, such as: 1) multilayer perceptron (MLP) using the backpropagation as training algorithm; 2) MLP with resilient backpropagation (Rprop); (3) recurrent neural network (RNN); 4) deep learning stacked autoencoder (SAE). The computer network traffic is sampled from the traffic of the network devices that are connected to the internet. It is shown herein how a simpler neural network model, such as the RNN and MLP, can work even better than a more complex model, such as the SAE. Internet traffic prediction is an important task for many applications, such as adaptive applications, congestion control, admission control, anomaly detection and bandwidth allocation. In addition, efficient methods of resource management, such as the bandwidth, can be used to gain performance and reduce costs, improving the quality of service (QoS). The popularity of the newest deep learning methods have been increasing in several areas, but there is a lack of studies concerning time series prediction, such as internet traffic. Keywords: deep learning; internet traffic; neural network; prediction; stacked autoencoder; SAE; time series. Reference to this paper should be made as follows: Oliveira, T.P., Barbar, J.S. and Soares, A.S. (2016) ‘Computer network traffic prediction: a comparison between traditional and deep learning neural networks’, Int. J. Big Data Intelligence, Vol. 3, No. 1, pp.28–37. Biographical notes: Tiago Prado Oliveira graduated in Computer Science from Federal University of Uberlˆandia (UFU), Uberlˆandia City, Brazil in 2012. He received his Masters in Computer Science from UFU in 2014. He has experience in computer science in the following areas: computer systems, computer network management, artificial intelligence, artificial neural networks, deep learning and data prediction. Jamil Salem Barbar graduated in Electrical Engineering from the Federal University of Uberlˆandia in 1985. He received his MSc in Electronics and Computer Engineering from Instituto Tecnológico de Aeronáutica (ITA) in 1990 and PhD in Electrical and Computer Engineering from ITA (1998). Currently, he is a Professor in UFU. He has experience in computer science, acting on the following subjects: computer networks, QoS, wavelets, multimedia systems, computer forensics and peer-to-peer. He is a member of the group of the subject area informatics project tuning Latin America, which consists of 12 countries of Latin America. Alexsandro Santos Soares graduated in Electrical Engineering from the Federal University of Uberlˆandia in 1996. He received his Masters in Computer Science and Computational Mathematics from the University of São Paulo (USP) in 2001 and Doctorate in Electrical Engineering from UFU (2007). Currently, he is a Professor in UFU. Has experience in artificial intelligence, mainly in the following areas: natural computing, bioinformatics, artificial neural networks and evolutionary computation.

Copyright © 2016 Inderscience Enterprises Ltd.

Computer network traffic prediction

29

This paper is a revised and expanded version of a paper entitled ‘Multilayer perceptron and stacked autoencoder for internet traffic prediction’ presented at 11th IFIP International Conference on Network and Parallel Computing (NPC 2014), Ilan, Taiwan, September 18–20, 2014.

1 Introduction

2 Artificial neural networks

Using past observations to predict future network traffic is an important step to understand and control a computer network. Computer network traffic prediction can be crucial to the network providers and computer network management in general. It is of significant interest in several domains, such as adaptive applications, congestion control, admission control and bandwidth allocation. There are many studies that focus on adaptive and dynamic applications. They usually present some algorithms that use the traffic load to dynamically adapt the bandwidth of a certain network component (Han, 2014; Zhao et al., 2012; Liang and Han, 2007) and improve the quality of service (QoS) (Nguyen et al., 2009). Several works have been developed using artificial neural networks (ANNs) and they have shown that ANN are a competitive model, overcoming classical regression methods such as the autoregressive integrated moving average (ARIMA) (Cortez et al., 2012; Hallas and Dorffner, 1998; Ding et al., 1995; Feng and Shu, 2005). Thus, there are works that combine these two factors, therefore producing a predictive neural network to dynamically allocate the bandwidth in real-time video streams (Liang and Han, 2007). Network traffic is a time series, which is a sequence of data regularly measured at uniform time intervals. For network traffic, these sequential data are the bits transmitted in some network device at a certain period on time. A time series can be a stochastic process or a deterministic one. To predict a time series is necessary to use mathematical models that truly represent the statistical characteristic of the sampled traffic. For adaptive applications that require real-time processing, the choice of the prediction method must take into account the prediction horizon, computational cost, prediction error and the response time. This paper analyses four prediction methods that are based on ANN. Evaluations were made comparing multilayer perceptron (MLP), recurrent neural network (RNN) and stacked autoencoder (SAE). MLP is a feed-forward neural network with multiple layers that uses a supervised training. SAE is a deep learning neural network that uses a greedy algorithm for unsupervised training. For the MLP, it were compared two different training algorithms, the standard backpropagation and the resilient backpropagation (Rprop). These models were selected with the objective to confirm how competitive simpler approaches are (RNN and MLP) comparing to the more complex ones (SAE and deeper MLP). The analysis focuses on short-term and real-time prediction and the tests were made using samples of internet traffic time series, which were obtained on DataMarket database (http://datamarket.com).

ANNs are simple processing structures, which are separated into strongly connected units called artificial neurons (nodes). Neurons are organised into layers, one layer has multiple neurons and any one neural network can have one or more layers, which are defined by the network topology and vary among different network models (Haykin, 1998). Neurons are capable of working in parallel to process data, store experimental knowledge and use this knowledge to infer new data. Each neuron has a synaptic weight, which is responsible for storing the acquired knowledge. Network knowledge is acquired through learning processes (learning algorithm or network training) (Haykin, 1998). In the learning process, the neural network will be trained to recognise and differentiate the data from a finite set. After learning, the ANN is ready to recognise the patterns in a time series, for example. During the learning process the synaptic weights are modified in an ordered manner until they reach the desired learning. A neural network offers the same functionality as neurons in a human brain for resolving complex problems, such as nonlinearity, high parallelism, robustness, fault and noise tolerance, adaptability, learning and generalisation (Cortez et al., 2012; Haykin, 1998). Historically, the use of neural networks was limited in relation to the number of hidden layers. Neural networks made up of various layers were not used due to the difficulty in training them (Bengio, 2009). However, in 2006, Hinton presented the deep belief networks (DBN), with an efficient training method based on a greedy learning algorithm, which trains one layer at a time (Hinton et al., 2006). Since then, studies have encountered several sets of good results regarding the use of deep learning neural networks. Through these findings this study has as its objective to use the deep learning concept in traffic prediction. Deep learning refers to a machine learning method that is based on a neural network model with multiple levels of data representation. Hierarchical levels of representation are organised by abstractions, features or concepts. The higher levels are defined by the lower levels, where the representation of the low-levels may define several different features of the high-levels, this makes the data representation more abstract and nonlinear for the higher levels (Bengio, 2009; Hinton et al., 2006). These hierarchical levels are represented by the layers of the ANN. Deep learning ANN allows for the adding of a significant complexity to the prediction model. This complexity is proportional to the number of layers that the neural network has, the abstraction of the features are

30

T.P. Oliveira et al.

more complex for deeper layers. That way, the neural network depth concerns the number of composition levels of nonlinear operations learned from trained data, i.e., more layers; more nonlinear and deeper the ANN. The main difficulty in using deep neural networks relates to the training phase. Conventional algorithms, like backpropagation, do not perform well when the neural network has more than three hidden layers (Erhan et al., 2009). Furthermore, these conventional algorithms do not optimise the use of more layers and they do not distinguish the data characteristics hierarchically, i.e., the neural network with many layers does not have a better result to that of a neural network with few layers, e.g., shallow neural network with two or three layers (de Villiers and Barnard, 1993; Hornik et al., 1989).

3 Review of literature Several types of ANN have been studied for network traffic prediction. There are several studies in feed-forward neural networks, such as MLP (Oliveira et al., 2014; Cortez et al., 2012; Ding et al., 1995), but many studies aim RNN (Hallas and Dorffner, 1998) because of the internal memory cycles that it has, facilitating learning temporal and sequential dynamical behaviour and making it a good model for time series learning. An advantage of ANN is the response time, i.e., how fast the prediction of future values is made. After the learning process, which is the slowest step in the use of an ANN, the neural network is ready for use, obtaining results very quickly compared to other more complex prediction models as ARIMA (Cortez et al., 2012; Feng and Shu, 2005). Therefore, ANNs are very good for on-line prediction, obtaining a satisfactory result regarding prediction accuracy and response time (Oliveira et al., 2014; Cortez et al., 2012).

3.1 MLP and backpropagation One of commonest architectures for neural networks is the MLP. This kind of ANN has one input layer, one or more hidden layers, and an output layer. Best practice suggests one or two hidden layers (de Villiers and Barnard, 1993). This is due to the fact that the same result can be obtained by raising the number of neurons in the hidden layer, rather than increase the number of hidden layers (Hornik et al., 1989). MLPs are feed-forward networks, where all neurons in the same layer are connected to all neurons of the next layer, yet the neurons in the same layer are not connected to each other. It is called feed-forward because the flow of information goes from the input layer to the output layer. The training algorithm used for MLP is the backpropagation, which is a supervised learning algorithm, where the MLP learns a desired output from various entry data. Usually the backpropagation suffers a problem of the magnitude of partial derivative, making it too large or too

small. This is a problem because the learning process can go through many fluctuations, slowing the convergence time or making the network become stuck in a local minima. To help avoid this problem, was created the Rprop, which has a dynamic learning rate, i.e., it updates the learning rate for every neuron connection, reducing the error for each neuron separately.

3.2 Recurrent neural network RNNs are neural networks that has one or more connections between neurons that forms cycles. These cycles are responsible for storing and passing the feedback of one neuron to another one, creating an internal memory that facilitate learning of sequential data (Hallas and Dorffner, 1998; Haykin, 1998). The cycles can be used anywhere in the neural network and in any direction, e.g., it can have a delayed feedback from the output to the input layer; a feedback loop from one hidden layer to another layer or to the same layer and any combination of it (Haykin, 1998). One of the objectives of this work is to compare the simple neural network models with the more complex ones. For that purpose it was chosen the Jordan neural network (JNN), since it is a simple recurrent network (SRN) and usually is used to prediction. The JNN has a context layer that holds the previous output from the output layer, this context layer is responsible for receiving the feedback from the previous iteration and transmitting it to the hidden layer, allowing a simple short term memory (Hallas and Dorffner, 1998).

3.3 SAE for deep learning SAE is a deep learning neural network built with multiple layers of sparse autoencoders, in which the output of each layer is connected to the input of the next layer. SAE learning is based on a greedy layer-wise unsupervised training, which trains each layer independently (Vincent et al., 2008; Ufldl, 2013; Bengio et al., 2007). The SAE uses all benefits of a deep neural network and has a high classification power. Therefore, a SAE can learn useful data concerning hierarchical grouping and part-whole decomposition of the input (Ufldl, 2013). The key idea behind deep neural networks is that multiple layers represent various abstraction levels of the data. Consequently, deeper networks have a superior learning compared to a shallow neural network (Bengio, 2009). The strength of deep learning is based on the representations learned by the greedy layer-wise unsupervised training algorithm. Furthermore, after a good data representation in each layer is found, the acquired neural network can be used to initialise some new ANN with new synaptic weights. This new initialised neural network can be an MLP, e.g., to start a supervised training if necessary (Bengio, 2009). A lot of papers emphasise the benefits of the greedy layer-wise unsupervised training for deep network initialisation (Bengio, 2009; Hinton et al., 2006; Bengio

Computer network traffic prediction et al., 2007; Larochelle et al., 2009; Ranzato et al., 2007). Therefore, one of the goals of this paper is to verify whether the unsupervised training of deep learning does actually bring advantages over the simpler ANN models.

4 Experiments and results The utilised time series was created by R.J. Hyndman and it is available under the ‘Time Series Data Library’ term at the DataMarket Web portal (http://data.is/TSDLdemo). The experiments were performed from data collected daily, hourly and at five minute intervals. Altogether, six time series were used, with them being ‘A-1d’, ‘A-1h’, ‘A-5m’, ‘B-1d’, ‘B-1h’ and ‘B-5m’. The time series used are composed of internet traffic (in bits) from a private internet service provider (ISP) with centres in 11 European cities. The data corresponds to a transatlantic link and was collected from 06:57 hours on 7 June 2005 to 11:17 hours on 31 July 2005. This series was collected at different intervals, resulting in three different time series: ‘A-1d’ is a time series with daily data; ‘A-1h’ is hourly data; and ‘A-5m’ contains data collected every five minute. The remaining time series are composed of internet traffic from an ISP, collected in an academic network backbone in the UK. They were collected from 19 November 2004, at 09:30 hours to 27 January 2005, at 11:11 hours. In the same way, this series was divided into three different time series: ‘B-1d’ is daily data; ‘B-1h’ is hourly data; and ‘B-5m’, with data collected at five minute intervals. The conducted experiments used DeepLearn Toolbox (Palm, 2014) and Encog machine learning framework (Encog, 2014). Both are open source code of different libraries that covers several machine learning and artificial intelligence techniques. They were chosen because they are widespread and used in research, they supply our needs, such as MLP, RNN and SAE, is open source and has easy access and usability. DeepLearn Toolbox is a MATLAB library set that covers a variety of deep learning techniques such as ANN, DBN, convolutional neural networks (CNN), convolutional autoencoders (CAE) and SAE. Encog is a machine learning framework that supports several algorithms, like genetic algorithms, Bayesian networks, hidden Markov models and neural networks. For the Encog, the Java 3.2.0 release was used, but it is also available for .Net, C and C++. For prediction and training of the multilayer perceptron with backpropagation (MLP-BP), the MLP with resilient backpropagation (MLP-RP) and the RNN was used the Encog framework. The SAE was developed using the MATLAB’s DeepLearn Toolbox. The experiments were carried out on a Dell Vostro 3550; Intel Core i5-2430M processor, clock rate of 2.40 GHz and 3 Mb Cache; 6 GB RAM DDR3; Windows 7 Home Basic 64-Bit operating system. The instaled Java Platform is the Enterprise Edition with JDK 7 and the MATLAB version is the R2013b.

31

4.1 Data normalisation Before training the neural network, it is important to normalise the data (Feng and Shu, 2005), in this case, the time series. Hence, to decrease the time series scale the min-max normalisation was used to limit the data between the interval [0.1, 1]. This interval was set so that the prediction output values stay inside the range of the sigmoid activation function used by the neural network. Besides, for the minimum value it was selected 0.1 instead of 0, for avoiding a division by zero when calculating the normalised errors. The original time series is normalised generating a new normalised time series, which will be used for the training. The range of the 6 time series used varies from 51 values (for the smallest time series, with daily data) to 19,888 values (for the largest time series, with data collected at five minute intervals). During the experiments the data range for the training set varied greatly, from 25 values (for the smallest time series) to 9,944 values (for the largest time series). The size for the training set was chosen as that of the first half of the time series, the other half of the time series is the test set for evaluating the prediction accuracy. The size of each dataset can be seen in Table 1. Table 1 The time interval and size of each time series Dataset Time interval Time series total size Training set size A-1d A-1h A-5m B-1d B-1h B-5m

1 day 1h 5 min 1 day 1h 5 min

51 1,231 14,772 69 1,657 19,888

25 615 7,386 34 828 9,944

4.2 Neural network architecture and topology For the MLP-BP was used a sigmoid activation function, a learning rate of 0.01 and backpropagation as the training algorithm. Higher learning rate accelerates the training, but may generate many oscillations in it, making it harder to reach a low error. On the other hand, a lower learning rate leads to steadier training, however is much slower. Different values for the learning rate were also tested, such as 0.5, 0.25 and 0.1, yet as expected, the lowest errors were obtained using 0.01 for the learning rate. For the deep learning neural network the SAE was used, also with sigmoid activation function and a learning rate of 0.01. For the MLP-RP and the RNN, it was also used a sigmoid activation function. The training algorithm of the RNN was the Rprop and the topology was the one of a JNN, with one context layer responsible for storing previous iterations allowing a short term memory. The training algorithm of SAE is a greedy algorithm that gives similar weights to similar inputs. Each autoencoder is trained separately, in a greedy fashion, then it is stacked onto those already trained; thereby, producing a SAE with multiple layers. The SAE has an unsupervised training

32

T.P. Oliveira et al.

algorithm, i.e., it does not train the data considering an expected output. Thus, the SAE is used as a pre-training stage to initialise the neural network weights. After that, the fine-tuning step is initiated and was used the backpropagation algorithm for supervised training (Bengio, 2009; Erhan et al., 2009; Ufldl, 2013; Bengio et al., 2007). Several tests were carried out varying the ANN topology, both in number of neurons per layer as in the number of layers. The tests consist int create several neural networks with different topology and train each one of them. The variations of number of neurons were every five neurons for the input layer, every five neurons for the hidden layer and one output neuron for the prediction data. Additionally, the number of hidden layers ranged between [2, 5] for the MLP-BP and MLP-RP, [3, 7] for the SAE and just one hidden layer for the RNN plus one context layer. The tests ware stopped when the validation error began to decrease. All result comparisons were made accordingly with the validation errors for each analysed ANN type (MLP-BP, MLP-RP, RNN and SAE). For the MLP-BP and MLP-RP, the best performances were obtained with four layers, around 15 input neurons, 1 output neuron, 45 and 35 neurons in the hidden layers, respectively, as shown in Figure 1. It was found that increasing the number of neurons or the number of layers did not result in better performance, to some extent the average of normalised root mean square error (NRMSE) was found similar for the same time series. Figure 1

MLP architecture showing the layers, number of neurons of each layer and the information feed-forward flow

results began to worsen from 40 neurons in the input layer and 120 neurons in the entire neural network. Figure 2

JNN architecture showing the layers, number of neurons of each layer, the information flow and the context layer with the feedback from the output layer to the hidden layer

For the SAE, the best results were found with six layers, with 20 input neurons, one output neuron, 80, 60, 60 and 40 neurons in each of the hidden layers, respectively. Increasing the number of neurons in the layers of the SAE did not produce better results; on average the NRMSE was very similar. Similar results were found also with four layers, like the MLP, whereas deeper SAE achieved slightly better results. A comparison of the NRMSE of each prediction model will be shown in Table 3. The influence of the number of neurons in the error is better seen in Figure 3. Is noticed that the mean squared error (MSE) increased significantly after 120 neurons in the MLP-RP and RNN. For a smaller number of neurons, less than 120, the error does not changed very much. Besides, the more neurons in the network harder and longer the training will be. For the SAE, the MSE was steadier even for deeper architectures.

4.3 Neural network training

The MLP-BP, MLP-RP and RNN had a similar average of NRMSE, but the RNN had a lower average, even with fewer neurons. The best performances of RNN were obtained with three layers (excluding the JNN’s context layer), around ten input neurons, one output neuron and 45 neurons in the hidden layer, as shown in Figure 2. In the same way, increasing the number of neurons did not result in better performance, in fact, overly increasing the number of neurons was detrimental to performance. The

The neural network training was carried out in a single batch. This way all input data of the training set are trained in a single training epoch, adjusting the weights of the neural network for the entire batch. Tests with more batches (less input data for each training epoch) were also realised and similar error rates were found. Nevertheless, for smaller batches, the training took more time to converge, because a smaller amount of data is trained at each epoch. The MLP-BP, MLP-RP and RNN training lasted 1,000 epochs. The SAE training is separated into two steps. The first one is the unsupervised pre-training, which lasted 900 epochs. The second step is the fine-tuning that uses a supervised training, which lasted 100 epochs.

Computer network traffic prediction Figure 3

33

Proportion between the complexity of the neural network measured in total number of neurons and the respective MSE for the prediction of the B-5m time series

The training time is mainly affected by the size of the training set and by the number of neurons of the neural network. The higher the size of the training set and higher the number of neurons, more time is necessary to train the neural network. Table 2 shows the average training time obtained for each time series and prediction model. Table 2 Comparison of the average training time Training time (milliseconds)

Dataset A-1d A-1h A-5m B-1d B-1h B-5m

MLP-BP

MLP-RP

RNN

SAE

268 2,373 86,296 558 7,322 117,652

79 1,482 56,078 326 7,181 78,078

67 1,473 33,981 108 2,838 45,968

8,874 585,761 6,724,641 17,280 837,567 8,691,876

The only difference between the MLP-BP and the MLP-RP is the training algorithm. The dynamic learning rate of Rprop makes it one of the fastest training algorithms for the ANN and that is clear in our results, the MLP-RP was faster than the MLP-BP. The RNN with Rprop used fewer neurons and fewer layers, three layers in total, instead of four layers like the other methods. Thereby, the fastest training of all neural network models was the RNN. The initial 50 training epochs and errors are shown in Figure 4, it is compared the fine-tuning training of the SAE with the MLP-BP training. It is possible to observe that, because of the SAE pre-training, the SAE training converges faster than the MLP-BP training. However, more training epochs are enough for them to obtain very similar error rates. Figure 5 shows the first 50 training epochs and errors for the MLP-RP and RNN, as both uses the Rprop as training algorithm, the errors change drastically at the beginning of training, since there is no fixed learning rate.

Due to this, the error keeps varying until it finds a better update value for the training. Figures 6 and 7 show the time series prediction results for the MLP-RP and SAE, respectively. The fitted values graphic for MLP-BP and RNN were very similar to the MLP-RP. Due to low-scale image it would be difficult to see a difference between them, so it was only shown the MLP-RP. It is noted that MLP-RP, MLP-BP and RNN (see Figure 6) best fit the actual data, nevertheless, the SAE fared well in data generalisation. All the prediction models, the MLP-BP, MLP-RP, RNN and SAE, learned the time series features and used these features to predict data that are not known a priori. An important detail is that the RNN used fewer neurons and just one hidden layer. Therefore, the training phase of the RNN is faster than the MLP-BP, MLP-RP and SAE.

4.4 Main results The key idea of deep learning is that the depth of the neural network allows learning complexes and nonlinear data (Bengio, 2009). However, the use of SAE for time series prediction was not beneficial, i.e., the pre-training did not bring significant benefits to prediction and the SAE was outperformed by the other methods. The results with respective NRMSE are shown in Table 3. The NRMSE is the normalised version of the root of the MSE that is: et = yt − yˆt , ∑n 2 e M SE = i=1 i , n √ M SE N RM SE = , y + max − ymin

(1a) (1b) (1c)

where et is the prediction error at time t; yt is the actual value observed in time t; yˆt is the predicted value at time t; n is the number of predictions, ymax is the maximum observed value and ymin is the minimum observed value.

34

T.P. Oliveira et al.

Figure 4

A MSE comparison of SAE and MLP-BP at each training epoch, for B-5m time series

Figure 5

A MSE comparison of the MLP-RP and the RNN at each training epoch, for B-5m time series

Figure 6

A comparison of the actual (the original) time series (represented in grey) and the predicted network traffic represented in black) using a MLP-RP with two hidden layers, for B-5m time series

Notes: Since the actual and the predicted plot lines are very similar, it is difficult to see the difference with a low scale image. Yet, it is possible to see that the predicted values fit very well to the actual values. Another observation is that the actual and predicted graphic plot of network traffic for the MLP-BP and RNN were almost identical of the MLP-RP.

Computer network traffic prediction Figure 7

35

A prediction comparison of the SAE with four hidden layers, at each training epoch, for B-5m time series

Notes: It shows the actual (the original) time series (represented in grey) and the predicted network traffic (represented in black) using the SAE. It is noted that the predicted values did not fit well from 1 × 104 to 1.4 × 104 period in time, but for the rest of the series the predicted values fit well to the actual values. Table 3 Comparison of NRMSE results Dataset

A-1d A-1h A-5m B-1d B-1h B-5m

Normalised root mean squared error (NRMSE) MLP-BP

MLP-RP

RNN

SAE

0.19985 0.05524 0.01939 0.12668 0.04793 0.01306

0.20227 0.04145 0.01657 0.14606 0.02927 0.01008

0.19724 0.04197 0.01649 0.11604 0.02704 0.00994

0.36600 0.09399 0.02226 0.21552 0.06967 0.01949

MLP-RP, which obtained the second fastest training time (see Table 2). For the NRMSE, the MLP-RP method obtained the best (smallest) for the ‘A-1h’ series, the RNN was the second best, so that is why the percentage is negative. Still in the NRMSE comparison, the RNN was the best for the remaining series, the MLP-BP was the second best for the ‘A-1d’ and ‘B-1d’ series and the MLP-RP was the second best for the ‘A-5m’, ‘B-1h’ and ‘B-5m’ (see Table 3). Table 4 Percentage of how fast and accurate the RNN is, compared to the second best prediction method

In the time series prediction, the SAE method has more complexity than the other ones, since it has the extra unsupervised training phase, which initialises the neural network weights for the fine-tuning stage. Even with the additional complexity, the SAE was inferior. Due to this fact, this approach is not recommended for time series prediction. There were no major differences concerning the MLP-BP and MLP-RP, but in general the MLP-RP had better results, both in accuracy and training time. Ultimately, comparing all the results in Table 3, the RNN approach produced the best results with the smallest NRMSE. Moreover, the RNN used fewer neurons and performed better than the other methods, making the training much faster as shown in Table 2. In addition, Table 4 shows that the RNN can be up to 66.87% faster, and 8.4% more accurate, compared to the second best method. The comparison made in Table 4 is showing the percentage of how fast the RNN is, as well as the percentage of how accurate the RNN is, when comparing it to the second best prediction method in each category. For the training time it is compared the RNN with the

Dataset A-1d A-1h A-5m B-1d B-1h B-5m

Training time

NRMSE

15.19% 0.4% 39.4% 66.87% 60.47% 41.12%

1.3% –1.25% 0.48% 8.4% 7.62% 1.39%

There are works, in the pattern recognition field, where the use of autoencoders are advantageous (Ranzato et al., 2007), as they are based in unlabelled data. On the other hand, there are works in energy load forecast, showing that the autoencoders approach is worse than classical neural networks (Busseti et al., 2012), like MLP and recurrent ANN. Each of these problems has a better method for solving it, so it is important to analyse the entry data type before choosing the most appropriate method to be used.

36

T.P. Oliveira et al.

5 Conclusions All types of studied ANN have proven that they are capable of to adjusting and predicting network traffic accurately. However, the initialisation of the neural network weights through the unsupervised pre-training did not brought an improvement for time series prediction. The result shows that MLP and RNN are better than SAE for internet traffic prediction. In addition, the SAE deep neural network approach reflects in more computational complexity during the training, so the choice of MLP or RNN is more advantageous. In theory, of all ANN studied in this work, the best prediction method would be the RNN, since it is the one that uses previous observations as feedback in the learning of newer observations, facilitating the learning of temporal and sequential data. Accordingly, the carried out experiments shown that the best results in both, accuracy and computation time, were obtained with the JNN, an SRN. Therefore, of all used methods, the best prediction method for short-term or real-time prediction is the RNN with Rprop as training algorithm, whereas obtained the smallest errors in a significantly less time, as shows the Table 4. The use and importance of deep neural networks is increasing and very good results are achieved in images, audio and video pattern recognition (Larochelle et al., 2009; Ranzato et al., 2007; Arel et al., 2010; Chao et al., 2011). However, the main learning algorithms for this kind of neural network are unsupervised training algorithms, which use unlabelled data for their training. In contrast, network traffic and time series, in general, are labelled data, requiring an unsupervised pre-training before the actual supervised training as a fine-tuning. Yet, as shown in Chao et al. (2011), the DBN and restricted Boltzmann machine (RBM), which are deep learning methods, can be modified to work better with labelled data, i.e., time series datasets. Future works will focus in other deep learning techniques, like DBN and continuous restricted Boltzmann machine (CRBM). Other future works will use the network traffic prediction to create an adaptive bandwidth management tool. This adaptive management tool will first focus on congestion control through bandwidth dynamic allocation, based on the traffic predicted. The goal is to ensure a better QoS and a fair share of bandwidth allocation for the network devices in a dynamic and adaptive management application.

References Arel, I., Rose, D. and Karnowski, T. (2010) ‘Deep machine learning – a new frontier in artificial intelligence research [research frontier]’, Computational Intelligence Magazine, IEEE, Vol. 5, No. 4, pp.13–18. Bengio, Y., Lamblin, P., Popovici, D. and Larochelle, H. (2007) ‘Greedy layerwise training of deep networks’, in Advances in Neural Information Processing Systems 19 (NIPS ‘06), pp.153–160.

Bengio, Y. (2009) ‘Learning deep architectures for AI’, Found. Trends Mach. Learn., Vol. 2, No. 1, pp.1–127. Busseti, E., Osband, I. and Wong, S. (2012) Deep Learning for Time Series Modeling, Cs 229: Machine Learning, Stanford. Chao, J., Shen, F. and Zhao, J. (2011) ‘Forecasting exchange rate with deep belief networks’, in Neural Networks (IJCNN), The International Joint Conference on, pp.1259–1266. Cortez, P., Rio, M., Rocha, M. and Sousa, P. (2012) ‘Multi-scale internet traffic forecasting using neural networks and time series methods’, Expert Systems, Vol. 29, No. 2, pp.143–155. de Villiers, J. and Barnard, E. (1993) ‘Backpropagation neural nets with one and two hidden layers’, Neural Networks, IEEE Transactions on, Vol. 4, No. 1, pp.136–141. Ding, X., Canu, S., Denoeux, T., Rue, T. and Pernant, F. (1995) ‘Neural network based models for forecasting’, in Proceedings of ADT ‘95, pp.243–252, Wiley and Sons. Encog (2014) Encog (2014) ‘Encog articial intelligence framework for Java and DotNet’ [online] http://www.heatonresearch.com/encog (accessed 3 November 2014). Erhan, D., Manzagol, P-A., Bengio, Y., Bengio, S. and Vincent, P. (2009) ‘The difficulty of training deep architectures and the effect of unsupervised pre-training’, in 12th International Conference on Artificial Intelligence and Statistics (AISTATS), pp.153–160. Feng, H. and Shu, Y. (2005) ‘Study on network traffic prediction techniques’, in Wireless Communications, Networking and Mobile Computing, Proceedings, International Conference on, Vol. 2, pp.1041–1044. Hallas, M. and Dorffner, G. (1998) ‘A comparative study on feedforward and recurrent neural networks in time series prediction using gradient descent learning’, in Proc. 14th European Meet. Cybernetics Systems Research, Vol. 2, pp.644–647. Han, M-S. (2014) ‘Dynamic bandwidth allocation with high utilization for xg-pon’, in Advanced Communication Technology (ICACT), 16th International Conference on, pp.994–997. Haykin, S. (1998) Neural Networks: A Comprehensive Foundation, 2nd ed., Prentice Hall PTR, Upper Saddle River, NJ, USA. Hinton, G.E., Osindero, S. and Teh, Y-W. (2006) ‘A fast learning algorithm for deep belief nets’, Neural Comput., Vol. 18, No. 7, pp.1527–1554. Hornik, K., Stinchcombe, M. and White, H. (1989) ‘Multilayer feedforward networks are universal approximators’, Neural Netw., Vol. 2, No. 5, pp.359–366. Larochelle, H., Erhan, D. and Vincent, P. (2009) ‘Deep learning using robust interdependent codes’, in D.V. Dyk and M. Welling (Eds.): Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), Vol. 5, pp.312–319, Journal of Machine Learning Research – Proceedings Track. Liang, Y. and Han, M. (2007) ‘Dynamic bandwidth allocation based on online traffic prediction for real-time mpeg-4 video streams’, EURASIP J. Appl. Signal Process., No. 1, pp.51–51. Nguyen, T., Eido, T. and Atmaca, T. (2009) ‘An enhanced QoS-enabled dynamic bandwidth allocation mechanism for ethernet pon’, in Emerging Network Intelligence, First International Conference on, pp.135–140.

Computer network traffic prediction Oliveira, T.P., Barbar, J.S. and Soares, A.S. (2014) ‘Multilayer perceptron and stacked autoencoder for internet traffic prediction’, in C-H. Hsu, X. Shi and V. Salapura (Eds.): Network and Parallel Computing – 11th IFIP WG 10.3 International Conference, NPC, Lecture Notes in Computer Science, Ilan, Taiwan, September 18–20, Vol. 8707, pp.61–71, Springer. Palm, R.B. (2014) ‘Deeplearntoolbox, a MATLAB toolbox for deep learning’ [online] https://github.com/rasmusbergpalm/DeepLearnToolbox (accessed 3 November 2014). Ranzato, M., Boureau, Y. and Cun, Y.L. (2007) ‘Sparse feature learning for deep belief networks’, in J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.): Advances in Neural Information Processing Systems, Vol. 20, pp.1185–1192, MIT Press, Cambridge, MA.

37 Ufldl (2013) Unsupervised Feature Learning and Deep Learning, Stanford’s Online Wiki, Stacked Autoencoders. Vincent, P., Larochelle, H., Bengio, Y. and Manzagol, P-A. (2008) ‘Extracting and composing robust features with denoising autoencoders’, in Proceedings of the 25th International Conference on Machine Learning, ICML ‘08, pp.1096–1103, ACM, New York, NY, USA. Zhao, H., Niu, W., Qin, Y., Ci, S., Tang, H. and Lin, T. (2012) ‘Traffic load-based dynamic bandwidth allocation for balancing the packet loss in diffserv network’, in Computer and Information Science (ICIS), IEEE/ACIS 11th International Conference on, pp.99–104.