Email Size Prediction: An Artificial Neural Networks ...

Email Size Prediction: An Artificial Neural Networks Approach Orlando Mezquita1, Carlos Calero2, Freddie Salado2, Mauricio Cabrera-Ríos1 1

Department of Industrial Engineering, University of Puerto Rico, Mayagüez Campus Department of Civil Engineering and Surveying, University of Puerto Rico, Mayagüez Campus

2

Abstract In this study, Artificial Neural Networks are proposed to approach two important aspects of emails: (i) predicting the size of an email that is about to be sent and (ii) fitting the sending patterns on emails received using a set of predictors such as sender, date, hour, total size of attachments, and amount of text that an email has. At varying degrees of approximation, the modeling approach is deemed plausible and scalable for characterizing email size and traffic behavior.

1. Introduction One of the common hassles of email services is storage limitation. On one hand, senders need to be aware of the size of the emails being sent, either because of their own sending limitations, like the ones present on most corporate email servers, or to avoid overloading the inbox of the recipients. Surprisingly, even though many requests for providing the email size beforehand have been made to some email service providers, none of them have provided the capability to obtain this information before sending an email. On the other hand, since recipients do not know the size of the emails they will receive, they need to frequently archive or delete emails to have enough space in their mailbox to receive new emails. In spite of the importance of knowing the size of outgoing and income emails this problem seems to have been overlooked; after an extensive search on the literature we found research regarding the management and avoidance of overload rather than regarding the prediction of email size, see Szóstek [1], Dabbish and Kraut [2], and. Jackson et al. [3]. In this article a simple approach is presented to predict the size of an email that is about to be sent and to model the patterns on the emails received. Artificial Neural Networks were used to approach both modeling objectives due to well-known flexibility. The applications provide evidence on the feasibility of the approach to tackle even more complicated email characterization tasks. This work is structured as follows: Section 2 provides a brief explanation of what Artificial Neural Networks are and how they are trained; Section 3 contains the steps followed to gather and process the data before the modeling phase; finally, Section 4 describes the architecture of the neural networks fitted and the results.

2. Artificial Neural Networks Artificial Neural Networks (ANNs) are mathematical constructs that mimic the learning ability and massively parallel structure of the biological neural networks. It is

known that ANNs are widely used because of their flexibility, simplicity, ability to model intricate relationships, and because of demonstrated successes in a variety of empirical applications (see Kuan [4] and White [5]). Some of the applications of the ANNs include modeling time series (White [5]), optical character recognition, fraud detection, credit scoring, spam filtering, classification, and financial applications (See Dorsey et al. [6], Swanson and White [7, 8], and Birgul [9]). ANNs are widely used due to their ability to model intricate and nonlinear relationships. A typical application of ANNs is to predict the value of one or more [ ]. The response variables (⃑ ) given the values of a set of predictors most common architecture of a neural network used for prediction purposes is a Feedforward, Multi-layered Neural Network like the one shown on Figure 1; where the neurons on the input layer represent the predictors; the neurons on the hidden layer process the information coming from the input layer, and the neuron(s) on the output layer process the outputs from the hidden layer and produce the estimate of the response variable (⃑̂ ). An interesting property of feed-forward multi-layered neural networks is that they have been proved to be universal approximators of analytical functions when enough hidden neurons are provided (See K. Hornik et al.[10]). Since the prediction capability and the training time of the ANN depends on its architecture among other factors (Huang et al.[11]), several approaches have been developed to define the “optimal” configuration of the ANN including: trial and error, used –for example- by Mojan et al. [12], and a design of experiments-based approach introduced by Salazar-Aguilar et al. [13].

Input Layer

Hidden Layer

Output Layer

Figure 1. Multi-layered Feed-forward Neural Network

The strength of the inter-neuron connection between neuron j of layer (k+1) and neuron i of layer k is measured by the weight . The training of the neural network means finding the values of the weights that minimize the difference between the real value of the response variable ( ⃑ ) and the prediction of the response variable ( ⃑̂ ) . A possible way of minimizing this difference is by solving the following nonlinear unconstrained optimization problem:

2

[⃑

( )

⃑̂ ( )] [ ⃑

⃑̂ ( )]

(1)

where N is the number of samples in the training set and p is the number of parameters being estimated (weights and biases). The processing of the information on each neuron is performed in two steps: Combination: Inputs of the neural network are combined into a single value ( ), typically by calculating , where b is a constant called bias that plays a role similar to the intercept in a multiple linear regression model. Beside the linear combination, other types of functions can be applied in this stage, such as Radial Base Functions (See Zhao et al.[14], Gordon and Berry [15]). Transfer: The output of the neuron is computed by transforming the single value ( ) by means of a transfer function. The selection of the transfer function is application dependent; some of the common selections of transfer functions are presented next: a) Step function (McCulloch and Pitts Model [16]: ( ) where

{

(2)

is the classification threshold

b) Linear function ( ) For

(3)

( ) turns into the mirror function ( )

and

c) S-shaped functions Logistic Model: ( )

(4)

Hyperbolic Tangent Model:

( ) where

[

] and

(5)

[

]

3

3. Data Collection and Preprocessing In order to collect the information to fit the models a Visual Basic code was developed. The application automatically extracts a defined set of properties from each email in a selected MS Outlook folder. The properties extracted were: email size (measured in Bytes), Sender, Date, Hour, number of characters in the body of the email, and total attachments size (measured in Bytes) defined as ∑

(

) where

is the number of attachments in email .

The properties for 6,571 emails were collected and stored on a comma-separated file. After collecting the information, 818 emails were found with missing information or text in numeric fields. It turned out that these emails were coming from a corrupted folder in MS Outlook. Since a considerable amount of data was available, it was decided to eliminate the anomalies from the data instead of using imputation methods to complete the missing information. After anomalies were removed from the data, categorical predictors were coded and continuous predictors scaled. When fitting models with categorical predictors (without a natural scale), the modeler must be aware of not imposing artificial scales to these variables. To deal with this type of variables it is recommended to assign a set of levels to account for the effect that these predictors may have on the response (Thomas [17]). Hence, the correct coding of categorical variables would be to use t-1 indicator (binary) variables to represent a categorical predictor with t levels. The only categorical variable amongst the ones collected was the sender with 289 possible values, which would require 288 binary variables. Since this is a rather large number of variables, it was decided to classify senders by clusters. To define the clusters a dotplot with jitter of the number of emails from each sender was used. As it can be seen on Figure 2 there are basically two types of senders: frequent (higher number of emails per sender) and non-frequent (lower number of emails per sender). Studying Figure 2, a threshold of 100 emails was defined for classifying a sender as frequent or non-frequent. Instead of assigning a different level to each sender on the data collected, a level to each member of the frequent senders cluster and another level to represent the non-frequent senders were assigned. As a result, the number of levels of the sender variable was significantly reduced from 289 to 7.

Figure 2. Number of emails sent by the senders classified on sending frequency clusters

4

These 7 levels would require the use of 6 binary variables ( ⃗ ⃗ follows:

⃗ ) as

[⃗ ⃗ ⃗ ⃗ ⃗ ⃗ ] [

]

After coding the categorical predictors, the continuous variables were scaled. To avoid dimensionality effects when training the neural network, it is recommended to scale the predictors to the same range. A linear transformation was applied to the continuous predictors to scale them to the range [0,1] in order to have the same scale as the dichotomous variables used to represent the levels of the categorical predictor. The linear transformation applied to each continuous variable was:

( )

( ) ( )

In summary, the dependent (Y) and independent (X) variables that we used to fit the model were: Y* = Size of email (B) scaled to [0,1], X1* = Total Attachment size (B) scaled to [0,1], X2 = Frequent Sender 1 {0,1}, X3= Frequent Sender 2 {0,1}, X4= Frequent Sender 3 {0,1}, X5= Frequent Sender 4 {0,1}, X6= Frequent Sender 5 {0,1}, X7= Frequent Sender 6 {0,1}, X8*= Day of the week [0,1], X9*= Time [0,1], X10* = number of characters in the body of the email scaled to [0,1].

4. Development of the models After cleaning and preprocessing the data, we trained the neural networks for each modeling objective (future emails size and about-to-send emails size). From the data collected, the logical predictors to include in the model for predicting the size of future emails were the indicator variables for each sender, day and time; for predicting the size of about-to-send emails it was decided to include the total attachments size and the number of characters in the body of the email. 5

For both models feed-forward multi-layered neural networks were trained using the hyperbolic tangent (Equation 5) as transfer function in the hidden layer and a linear transfer function (Equation 3) in the output layer. The architecture for both models can be seen on Figure 3.

Figure 3. Architecture of the ANN models. Left: Future emails Model / Right: about-to-send model

Both models were trained using the optimization toolbox in MatLab R2011a, which uses the Levenberg-Marquardt Backpropagation training algorithm described by Hagan and Menaj (1994), and by Marquardt [18] and Rumelhart [19]. The original data set was divided into training and validation sets, assigning 70% of the data to the training set. The Mean Square Error (Equation 1) and the coefficient of correlation between the response variable and its prediction ̂ are presented on Table 1 for both models: Table 1. Performance measures for future and about-to-send models Model

Training Set MSE

About-to-send Email Size Future Email Size

(

Validation Set ̂)

MSE

(

̂)

1.87E-6

0.999

2.24E-6

~1.000

0.016

0.607

0.014

0.700

As it can be seen on Table 1 and Figure 4 the model to predict the size of an about-to-send email has an excellent performance but due to the randomness on the sending patterns of the senders in the data analyzed the performance for the future emails is not as good as the first one, even training the ANN with a smaller subset of the data with less variability. A plot of predictions vs. real values is presented on Figure 4 for each model.

6

Figure 4. Predicted vs. Real Response. Left – Future emails / Right-About-tosend emails

5. Conclusion Knowing the size of future incoming emails might help email users to better manage their storage limitations while knowing the size of an outgoing email beforehand can help the user to avoid sending errors due its own limitations or the recipient’s server limitations. An approach based on artificial neural networks to model the size of an email that is about to be sent and to predict the size of future emails coming into an inbox is proposed. The modeling approach involves data collection, data cleansing, categorical predictors coding, continuous predictors scaling, and artificial neural networks training. A total of 5753 emails where analyzed using this modeling approach and a model with very competitive prediction capability (R( ̂)=0.999) was obtained to predict the size of an email that is about to be sent. Due to the absence of identifiable patterns on the emails sent the model for predicting future emails was not able to accurately predict incoming emails, however this approach can be applied if senders are known to have sending patterns. The results show the capability of artificial neural networks to model a signal. Future work in this line will include statistically determining whether a pattern exists to decide if a modeling task should be pursued.

7

6. References [1] A. M, Szóstek, “Dealing with My Emails: Latent user needs in email management,” Computers in Human Behavior, Vol. 27: pp. 723–729, 2011. [2] L. A. Dabbish and R. E. Kraut, “Email Overload at Work: An Analysis of Factors Associated with Email Strain,” CSCW'06: pp.431-440, 2006. [3] T. Jackson, R. Dawson, and D. Wilson, “Reducing the effect of email interruptions on employees,” International Journal of Information Management, Vol. 23: pp.55-65, 2003. [4] Kuan, Chung-Ming, and Halbert White. “Artificial Neural Networks: An Econometric Perspective .” Econometric Reviews 13 (1): pp. 1–91. 1994. [5] White, Halbert. “Learning in Artificial Neural Networks: A Statistical Perspective.” Neural Computation 1 (4): pp. 425–464. 1989. [6] Dorsey, Robert E., John D. Johnson, and Mark V. Boening. “The Use of Artificial Neural Networks for Estimation of Decision Surfaces in First Price Sealed Bid Auctions.” In New Directions in Computational Economics, ed. W. W. 1994. [7] Swanson, Norman R., and Halbert White. “A Model-Selection Approach to Assessing the Information in the Term Structure Using Linear Models and Artificial Neural Networks.” Journal of Business & Economic Statistics 13 (3): pp. 265–75. 1995 [8] Swanson, Norman R., and Halbert White. “A Model Selection Approach to Real-Time Macroeconomic Forecasting Using Linear Models and Artificial Neural Networks.” Review of Economics and Statistics 79 (4): pp. 540–550, 1997. [9] Birgul Egeli, A. “Stock Market Prediction Using Artificial Neural Networks.” Decision Support Systems 22: pp. 171–185. [10] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, vol. 2, no. 5, pp. 359-366, 198. [11] Longwen Huang, Yuwei Cui, Danke Zhang, and Si Wu. “Impact of Noise Structure and Network Topology on Tracking Speed of Neural Networks.” Neural Networks: The Official Journal of the International Neural Network 8

Society 24 (10) (December): 2011. [12] M. MojanRaju, R. K. Srivasta, D. S. C. Bisht, H. C. Sharma, and A. Kulmar. “Development of Artificial Neural-Networks-Based Models for the Simulation of Spring Discharge”, Advances in Artificial Intelligence, vol. 11, 2011 [13] M. A. Salazar-Aguilar, M. G. Villarreal-Marroquín, G. J. Moreno-Rodriguez, J.A. Rodríguez-Sarasty, and Mauricio Cabrera-Rios, “Generating Multiple Time Series Forecast With Artificial Neural Networks in a Telecommunications Company”, International Journal of Industrial Engineering, vol. 18, no.11, pp. 591-598, 2011 [14] Zhao, Qin, Guoliang Bai, Weifeng Song, and Qian Zhao. “Research and Application on Meshless Method of Radial Basis Functions.” In Mechanic Automation and Control Engineering (MACE). Second International Conference On, pp. 2437 –2440. 2011. [15] Linoff, Gordon S., and Michael J. Berry. Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management. 3rd ed. *Wiley Computer Publishing, 2011. [16] W.S. McCulloch and W. H.Pitts. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5: pp.115-133, 1943. [17] Ryan, Thomas P. 2008. Modern Regression Methods. John Wiley & Sons [18] D.W. Marquardt. “An algorithm for least-squares estimation of nonlinear parameters.” Journal of the Society for Industrial and Applied Mathematics, 11(2): pp.431-441, 1963. [19] D. E. Rumelhart, G. E. Hinton, and R.J. Williams, Learning Internal Representations by Error Propagation In Parallel Distributed Processing, Vol.1, MIT Press, Cambridge, MA, 1986.

9