Open Weekend and Rating Prediction Based on Visualization ... - UFMG

Open Weekend and Rating Prediction Based on Visualization Techniques Elverton Fazzion, Pedro Las Casas∗, Glauber Gonc¸alves∗, Raquel Melo-Minardi∗and Wagner Meira Jr.∗ Abstract— Predicting gross revenue for movies is an important problem for the movie industry. Several studies [3] in economics, marketing, statistics and computer science tried to solve this problem. In this work, we propose an approach to accurately predict opening weekend box office (OW) and the rating of an upcoming movie which is based on a visualization technique. The visualization uses a regression model built with past movies’ data about the same cast of the target movie.

1

I NTRODUCTION

Movie predictions are not new in cinematography field. The movie’s producer wants to estimate profit before making the movie [4]. As it can be observed in recent successful movies, most viewers are motivated to watch a movie based on the actors and director [2]. Considering that, we modeled the past box office incomes and ratings from IMDB system. of the actors and director as important features to predict the opening weekend’s incomes (OW) and rating of a movie. Nevertheless a great number of actors and directors do not have a regular OW and rating during their working life, some present a great career with many good movies and only few bad references, while others are the opposite. We noticed that considering the average is a good measure to summarize OW and rating of the actors/actresses and directors career. In this work, we proposed a visualization to support the prediction of opening weekend box office (OW) and the rating of a upcoming movie. The visualization uses a regression model[1] that predicts OW and rating of a movie given its director and main actor/actress pairing of OW and rating averages. Also, we used tweets to analyze the sentiments of viewers in terms of their expectations for a movie. This analysis was employed to adjust the prediction made by the model. This paper is organized as follows. Section 2 describes the method we propose to support the predictions of our visualization. Section 3 shows some results we obtain using our visualization. Finally, conclusion is shown in Section 4. 2

results are quite reasonable for some movies, as the example provided in Figure 1.

Fig. 1: Simple predictions for the movie ”The Big Wedding”. The simple method proposed was not able to precisely predict the OW for some movies. As an example, we show the results for the movie ”Conjuring” in Figure 2. The OW prediction was $15M against $42M of the real value. Some actors/actress and directors are not good prediction evidences, i.e., they have OW averages very different from the actual OW. Figure 2 shows that Lili Taylor, Ron Livingston e Vera Farmiga have average about $8M while the actual OW of the movie is $42M. Such evidences may affect significantly the final result and decrease accuracy of OW prediction, although they do not decrease accuracy of the rating prediction.

D ESCRIPTION

This Section describes the method used to predict the OW and the rating of a movie in our visualization. First, we introduce our rationale and then we describe our actual approach. The first step of the method is to compute the average of the OW/rating of the director. After, we compute the average of the OW/rating for each of the most important actors and actresses. However, as we need a single value to represent our prediction, we compute the overall average among the individual averages. It is important to notice that when we mention director and actor/actress average, we mean the average computed over all retrived OW/rating values since the director and actor/actress early films. If the director or actor/actress does not have previous OW or rating, we consider its average as zero. Figure 1 shows the prediction for the movie ”The Big Wedding”, using the method described above. The red line is the prediction and the blue line is the real value. Although it is a simple method for prediction, the ∗ is

with UFMG

• Elverton C. Fazzion is with UFMG. E-mail: [email protected]

Fig. 2: Simple predictions for the movie ”Conjuring”. Given the observation above, we decided to improve the accuracy of our model prediction using only the most important features. To discover such features we computed the correlation between each feature and the actual OW/rating. Then, we selected the two highest correlated features that are the director’s average and the actor/actress’s average. We observed that director is the most correlated feature, but we can increase the correlation for OW by pairing the first and second best correlated features, i.e., we computed the average of director and main actor/actress averages. Table 1 shows the values of Pearson correlation between each feature and the actual values for OW/rating. As we can see, director and actor/actress pairing is the highest correlated feature for OW and still keeps a good correlation for rating (the second highest correlation). Feature Director and Main Actor/Actress Director First Main Actor/Actress Second Main Actor/Actress Third Main Actor/Actress Quantity of Tweets

OW Correl. 0.760 0.694 0.280 0.274 -0.076 0.054

rating Correl. 0.640 0.669 0.274 0.099 0.059 -0.012

Table 1: Pearson correlation between features and actual values.

We used the best feature above to build a regression model that provides an estimation of OW and rating given the director and actor/actress pairing. Figure 3 shows the correlation between the predictors and estimated variables. We have 21 blue points that represents movies used to build our model. For each movie, the x axis represents the average of director’s average and main actor/actress’s average (i.e., director and actor/actress pairing) and the y axis represents the actual value for OW or ratting. The black lines show the distance between the director and actor/actress average. As we can observe, most movies have a high distance between the director and actor/actress averages, then we can obtain the highest correlation if we consider the two averages pairing. The red dashed line depicts the predictions of the proposed model. Notice that points are close to the predictions and consequently the model provides quite accurately results. Furthermore, the regression model built for OW and rating predictions is the simplest one because it has only the slope coefficient, which are 1.49 and 0.98 for OW and rating respectively, and the constant coefficient is zero.

and more sophisticated approach in Figure 4. As we can see, the latter provides the most accurate predictions.

Fig. 4: Predictions for ”Conjuring” movie with the improvements made. To emphasize how tweet analysis can help making such predictions, we also show a snapshot with the results for the movie “R.I.P.D.”. The shaded area under the average line means the viewers, in general, are criticizing the movie and we have to lower the prediction. When we compare that with the real value (blue line), the need for lower the prediction is confirmed.

Fig. 5: Results for the movie R.I.P.D Fig. 3: Regression model for OW and rating. We also explored how Twitter data should be used to improve the results obtained. Table 1 shows that the amount of tweets for a given movie is not very well correlated to actual OW and rating, and, accordingly, it is not a good evidence. Hence we analyzed tweet dataset using Semantria1 . For each tweet of our dataset, the tool gives a score between -2 to 2. Positive scores means the message has positive meaning and negative is the opposite. Zero is impartial: neither good nor bad. We used Semantria to obtain a score for each tweet of a target movie. The goal was to compute an error for our model prediction. To compute such error, we used the following equation:

ε=

∑ positive values ∑ negative values + all tweets all tweets

(1)

Equation 1 provides the sum of relative positives and negatives values. We used the error ε as a ratio to increase or decrease the value of our prediction for OW or rating. We divided ε by 2 to obtain a ratio between -1 to 1, and we multiplied this ratio by the value of the prediction to show in our visualization the maximum value that one can increase or decrease from the prediction. Ultimately, the error we computed is supported by tweet analysis and provides extra information about the prediction. In the following section we show some examples of our visualization to clarify its utilization. 3

R ESULTS

Using the final visualization2 , which is composed by the regression model and the tweet analysis, we performed some tests that are shown below. We used as parameters the regression coefficients we showed in previous section, which are 1.49 and 0.98 for OW and rating, respectively. Figure 4 shows a snapshot of the proposed visualization-based prediction for the movie ”Conjuring”. The visual patterns that support our predictions are the red line (mean line) that gives us the prediction and the shaded area to show the sentimental analysis from tweets (increase or decrease the prediction). We compared the results of our simple approach (our initial idea in Figure 2) to the results of our last 1 https://semantria.com/ 2 http://tinyurl.com/elverton-mc1

Our visualization works pretty well for the majority of the analysed movies and we observed few ones for those we have not obtained good predictions. For instance, the movie “ToDo List”, the tweet analysis indicates an increment for the prediction, but the actual OW and rating is lower than the prediction. In this case, we verified that tweet messages have much more positive comments and the incorrect prediction can be related to our director and actor/actress pairing predictor. 4 C ONCLUSION AND F UTURE W ORK In this paper we built a model to estimate OW and rating for movies. We presented a visualization model based on this regression and that makes possible to accurately predict OW and rating of a movie given its director and actor/actress pairing of OW and rating averages. Furthermore, we used tweets’ sentiment analysis to adjust the predictions produced by the model. The proposed visualization provided reasonable results for several movies tested. As future work, we intend to improve the accuracy of our model prediction. First, we can consider more features that could impact in the visualization like the producer of the movie and finally, we can build different regression models by movie genres (action, comedy, horror). In this way, we can combine director, main actor/actress and producer features in such models to show a more accurate prediction in our visualization. Besides we would like to investigate other visualization techniques that should be appropriate to present trends, patterns and exceptions in this new multivariate model. 5 ACKNOWLEDGEMENTS This work was supported by CAPES, CNPq, FAPEMIG, FINEP, InWeb and PRPq–UFMG. R EFERENCES [1] H. Bommaganti. k-split based approach to predict movie rating frequency. [2] A. Elberse. The Power of Stars: Do Star Actors Drive the Success of Movies? Journal of Marketing, 71:102–120, Oct. 2007. [3] M. Joshi, D. Das, K. Gimpel, and N. A. Smith. Movie reviews and revenues: an experiment in text regression. HLT ’10, pages 293–296, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. [4] Simonoff and Sparrow. Predicting movie grosses: Winners and losers, blockbusters and sleepers. Chance, 13(3):15–24, 2000.