Data Mining of Mass Storage Based on Cloud ... - Semantic Scholar

7 downloads 0 Views 175KB Size Report
Pragmatic Chaos team, who bested Netflix's own algorithm for predicting ratings by ... 2005. Thus, how to effectively load and process such scale of data to a laptop should .... change of his movie taste, or ratings for some movies may fall with ...
2010 Ninth International Conference on Grid and Cloud Computing

Data Mining of Mass Storage based on Cloud Computing Jianzong Wang, Jiguang Wan* , Zhuo Liu, Peng Wang School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China Wuhan National Laboratory for Optoelectronics, Wuhan 430073, China * Corresponding author: [email protected]

To challenge the scientists computer science area and seek better strategies for their system, Netflix hold an open competition, i.e., Netflix Prize, which is quite challenging due to various reasons. Firstly, the dataset in Netflix is in large scale that there are over 480, 000 customers versus 17, 000 movies, both of which are identified by unique integer id. The dataset contains over 100 million ratings between Oct. 1998 to Dec. 2005. Thus, how to effectively load and process such scale of data to a laptop should be considered. The second challenge lies on the fact that almost 99% percent of potential user-movie pairs have no rating, which makes traditional recommendation methods relying on complete dataset not work well. Finally, for both movies and users, they are show their own characteristics, making this problem more difficult. For example, some users rates the movies actively while others not; for movies, some popular movies dominate large fractions of users while more movies are only rated by a small percentage of users. Even a specific user may show some temporal variability over time, i.e., some gradual or sudden drifts related with his mood. The final approach won the prize is an ensemble of a large number of models, among which two kinds of models, Knearest neighbors (KNN) and Restricted Boltzmann Machine (RBM) are proved to be successful and simple.In this paper, we do not intend to design a better algorithm outperforming their blended algorithm in a short time; as an alternative, we choose such two typical models, and try to testify their performance utilizing cloud computing. Since our KNN model runs on the residual dataset of global effects (GE), which are simple models capturing statistical corrections applied on user and item sides. By using GE, we can remove various kinds of deviations of the dataset. Our GE model includes 10 stages, i.e., overall mean, movie effect, user effect, etc. After GE model, our algorithm can reduce the rmse of original probe set which is 1.296 to 0.9659, which is near to the performance of Cinematch. All of our GE model is run on database system, and the residual is loaded to the disk, which is used as the dataset of KNN. Due to the sparsity of overlapped votes among users, we select movie-based KNN instead of the general userbased KNN. We divide the algorithm into two stages, i.e., the computation of Pearson correlation, and the parameters training. For the computation of Pearson correlation, we use

Abstract—Cloud computing is an elastic computing model that the users can lease the resources from the rentable infrastructure. Cloud computing is gaining popularity due to its lower cost, high reliability and huge availability.To utilize the powerful and huge capability of cloud computing, this paper is to import it into data mining and machine learning field. As one of the most influential and open competition in machine learning area, Netflix Prize attached with mass storage had driven thousands of teams across the world to attack the problem, among which the final winner was BellKor’s Pragmatic Chaos team, who bested Netflix’s own algorithm for predicting ratings by 10%. Their solution is an ensemble of a large number of models, each of which specializes in addressing a different aspect of the data. Among such different models, k-nearest neighbors (KNN) and Restricted Boltzmann Machine (RBM) are reported to be two most important and successful models. As a result, we build two predictors based on such two model respectively with the order to testify their performance based on cloud computing platforms. The results show that KNN can achieve root mean square deviation (rmse) with 0.9468 after the Global Effect (GE) data preprocessing, which is better than the Cinematch’s performance with rmse being 0.951. The rmse for RBM algorithm is about 0.9670 on the raw dataset, which can be further improved by KNN model. Keywords-Cloud Computing; Mass Storage; Data Mining;

I. I NTRODUCTION Cloud computing is Internet-based computing, whereby shared resources, software and information are provided to computers and other devices on-demand. The user can obtain different resources whatever he want. Recently, many companies, such as Amazon, Google and Microsoft, have launched their cloud service bossiness. Many cloud services like YouTube purchased by Google and Amazon rely on the recommendation systems to predict the customers’ interest based on their previous experience from system interactions. By recommending more relevant content to the customers, the online e-commerce services can not only promote the users’ experience on the systems, but also largely increase the traffic to their websites as well as the potential quantity of business transactions. For example, Clinematch recommendation systems for the Netflix, an on-line movie subscription rental service, analyzes the accumulated movie ratings and uses them to make millions of personalized predictions to users based on their individual tastes. 978-0-7695-4313-0/10 $26.00 © 2010 IEEE DOI 10.1109/GCC.2010.89

426

an innovative method instead of brute-force approach. For the parameters learning, we use APT2, which runs much faster than the gradient descent algorithm. Our algorithm converges after about 100 epochs, and can successfully reduce the rmse from 0.9578 to 0.9648, which is better than 0.9514, the rmse result of Cinematch. Finally, we implement the RBM algorithm, which is much complicated than the other two algorithms. In our model, each user with its votes together corresponds to a RBM, which has 88850 visible states and 100 hidden states. During each epoch, we consider every 100 users as a group, create the RBM for each user in this group, and update the states of visible units and hidden units for such user. For each group, we update various kinds of parameters in RBM by combining both the gradient ascent and Gibbs sampler. The result shows that RBM on raw dataset can successfully reduce the rmse from 1.296 to 0.9670, which can be further improved by our KNN algorithm. The rest of this paper is organized as follows. We specify our three models, including GE, KNN and RBM in Section II. The details of our evaluation is described in Section III. Finally, some related work and conclusion are given in Section IV and Section V respectively.

probes the parameters spanning a larger interval each time, the drawback of which is that the search can get stuck in a local optima in the N-dimensional search space. However, compared with APT1, APT2 run much faster, and 102 to 103 epochs are enough to set ten free parameters in general. The pseudo code of both algorithms can be referred as follows: 1 2 3 4 5

1 2 3 4 5 6 7

II. D ESIGN R ATIONALE Our approach combines three different algorithms, i.e., global effect (GE), K-nearest neighbors (KNN), and restricted boltzmann machines (RBM), all of which will be described in detail in this section. Since all of our three methods use automatic parameter tuner (APT) to learn various parameters, we first describe the APT algorithm in Section II-A. Before the illustration, we should give following assumption for our approach: we suppose that the users in Netflix are rational, and their feedback can honestly reflect their interests towards the movies in the system. They can occasionally deviate their common interests, but they do not intend to attack or even compromise the recommendation system.

while no convergence do Randomly select pi from {p1 , . . . , pN }; pnew ← Norm(pi , max{0.1 · |pi |, 0.001}) ; i if E new < E then pi ← pnew i ; Algorithm 1: APT1

Initialize ei ← 0.8 for all 1 ≤ i ≤ N; for (i ← 1; !convergence; i ← (i + 1)modN) do for j ← 1 to 5 do pnew ← pi · ei or pi · e−1 i i ; go direction for both parameters with E new < E; if E increases then enew ← e0.9 i i ; Algorithm 2: APT2

B. Global Effect. Global Effects are simple models capturing statistical corrections applied on user and item sides. Although many other prediction methods may be more powerful, there are several reasons to precede those methods with global effects. First, there may be large user and item effects. For example, some tough users are more likely to give lower ratings, or some movies with very famous actors are more likely to receive higher ratings. Second, one may have access to information about either the items or users that can benefit the model. Although other models may detect it too, global effect is more effective. Third, there may be characteristics of specific ratings, such as the date of the rating, which can explain some of the variation in ratings. For example, a particular user’s ratings may change over time due to the change of his movie taste, or ratings for some movies may fall with time after their initial release dates. In our design, we apply global effects before K-NN and it turns out to be very helpful. The strategy is to estimate one global effect at a time, for both item side and user side. We don’t focus on the ratings themselves but rather to focus on the residuals which are the ratings minus an effect. At each step, residuals from the previous step are used as the dependent variable for the current step. For the rest of this section, we describe the methods for estimating user specific parameters. The method for items is very similar.

A. Automatic Parameter Tuner. In Global Effect and K-Nearest Neighbor, an automatic tuning of the involved meta parameters is needed in order to minimize the RMSE. Instead of gradient, we use two algorithms proposed by BellKor, denoted by APT1 and APT2. Both of them are simple direct search methods, sensitive to the initialization values, and not guaranteed to find a global optimum. In this work, we implemented both of them to tune the parameters. The basic idea of both algorithms is to randomly change one parameter, check whether the RMSE gets better, and keep the new value if so. The different lies on the fact that APT1 is much more conservative, where each time one of the parameters pi is updated based on a normal distribution with pi as the input parameters. On the contrary, APT2

427

1

We denote by rui the rating of movie i by user u. At the first step, the rui refers to raw ratings. Afterwards, it refers to residual from the previous step. We denote by xui the explanatory variable of interest corresponding to user u and movie i; hence, xui has different explanation for different effect. For user main effects, xui = 1. For other user effects, we center xui for each user s.t. xui = xui −avg(xui for a given u). The prediction model for the user effects is given by

0.9

probability

0.8 0.7 0.6 0.5 0.4 0.3 0.2

rui = θu xui With sufficient ratings for user u, effect can be estimated using an unbiased estimator

0 0 10

1

10

2

10

3

10

4

10

5

10

6

10

the number of votes

∑ rui xui θˆu = i ∑i xui 2

Figure 1.

where the summation is over all movies rated by user u. The problem with this approach is that in many cases the sample sizes are not large enough which result in unreliable estimates. To compensate for this, BellKor use the concept of a posterior mean, which can be derived by making assumptions about the normal distribution. We assume that the true θu are independent random variables drawn from a normal distribution,

The cdf of users and movies’ votes.

The properties of simplicity and robustness make KNN be a suitable way for Netflix problem, and numerous groups for this contest had incorporated KNN as an indispensable component as their various hybrid models. Intuitively, considering the record format in the dataset, there are two different ways of implementation for KNN based on users and movies respectively. The user-based KNN makes predictions based on ratings of the same item by similar users. However, several properties of netflix dataset makes this approach impractical. Firstly, compared with the number of movies, the number of users is much larger (480189), and this makes the average number of votes per user quite small. Secondly, as it is shown in the figure 1, the number of votes for each user is totally imbalanced, and almost 60% of users have less than 100 votes, which would cause the correlation among such users, and the correlations between such users and the remaining users inaccurate. Finally, the relatively small size of memory (less than 4 GB) in a PC makes the storage, computation and search for correlation almost impossible. Fortunately, the movie-based approach can successfully address the cruxes of problems faced by user-based approach, which is also shown in the figure 1. The total number of movies in the dataset is 17770, which is only 1/30th of the number of users, and the whole number of correlations for the dataset is only 1.5 ∗ 108 , which makes it available to be loaded in the main memory of a PC. Secondly, the distribution of movies’ votes is more balanced, and the curve of movies’ cdf is not as steep as that of users’ cdf. Thirdly, only 4% of movies has less than 100 votes, which makes the computation of correlations more reasonable. The only drawback is that for users whose voting number is less than 100, the number of neighbors for movie-based KNN is somewhat small. In the rest of this paper, we use the movie-based approach as our implementation of KNN, which identifies pairs of movies that tend to be rated similarly in order to predict

θu ∼ N(µ , τ 2 ) for known µ and τ 2 , while

θˆu |θu ∼ N(θu , σu2 ) for known σu2 . Using these two, the estimator for θu is reestimated by its posterior mean as

τ 2 θˆu + σu2 µ E(θu |θˆu ) = τ 2 + σu2 To make it easier, BellKor assume µ = 0 and σu2 is proportional to 1/nu where nu is the number of ratings by user u. After simplifying, θu is given by

θu =

the cdf of movie’s votes the cdf of users’ votes

0.1

nu θˆu nu + α

where α is a constant determined by cross validation. C. KNN-based collaborative filtering K-Nearest Neighbor (KNN) is one of the most fundamental approaches for classifying objects based on closest training examples in the feature space. It is a powerful nonparametric method applied to a broad class of datasets where no prior knowledge is known with respect to the distributions of the data. In KNN, an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. Since the time it was designed in 1951 by Fix and Hodges, KNN has been widely used for pattern classification, especially the collaborative filtering.

428

ratings for an unrated item based on ratings of similar neighboring items by the same user [7]. The central idea of movie-based KNN is to predict the score of a user with respect a movie based on the K movies rated by this user weighted by the pairwise relationships between such movies and the target movie. In the BellKor’s Pragmatic Chaos final solution, there are over ten different KNN models, each of which is a different extension of the basic KNN model [4], [5], [8]. Among them, we select the KNNMovieV3 by BigChaos, which can yield great results on the residual of RMBs [8]. This model is much more complicated than the common KNN, since it takes more factors into account, including the time of each vote, the number of overlapped votes among movies, the normalization parameters, all of which make the our KNNMovieV3 have much better results.

0. For visible layer, since there are 17770 movies, and each movie can be voted by 5 scores, we suppose that there are m = 17770 ∗ 5 = 88850 visible states. Since a user generally do not vote for all the vies, we treat movies not voted by a user inactivated (missing in this user’s RBM. There are connections between any hidden states and activated visible states, the value of which belongs to [0, 1]. For simplicity, we use h, V and W to denote the hidden states, visible states and the connection weights. For different RBMs of different users, the values of hidden states, visible states can be different, but the weights should be the same for the same connections. In another word, if two users have rated the same movie, their two RBMs must use the same weights between the visible unit for that movie and the hidden units. For a user with m ratings, the V of this user can be expressed by a K × m observed binary indicator matrix with vki = 1 if the user rated movie i as k and 0 otherwise. Given the visible states V and connection weights W , the hidden states can be estimated by the following formula:

D. Restricted Boltzmann Machines As a kind of stochastic recurrent neural networks, a Boltzmann machine can be seen as a stochastic, generative counterpart of Hopfield nets, where a network of units with an ”energy” are defined for the network. However, the learning for a general Boltzmann machines is impractical, and an improved version, i.e., “Restricted Boltzmann Machine (RBM)” can be more efficient in training practical problems. In nature, RBM is a stochastic neural network, where each neuron has some random behaviors after activation. There are two layers in the RBM, i.e., visible layer and hidden layer. Each layer has a certain number of neuron units, the number of which is determined by some specific problems. Neuron units in different layer have connections with each other, which would be assigned by some weights in [0, 1]. Such connections are bidirectional and symmetric, and the weights are the same in both directions during the training process. Meanwhile, units in the same layer do not have connections with each other. In both layer, there exists a special neuron unit named bias. When it is applied to practical problems, the various states in the visible layer can be initialized by the real data. And then the hidden states, visitable states and the parameters will be updated alternatively. During such processing, the predictions can be also achieved based on hidden states and parameters. In the Netflix Prize problem, we treat each user as well as its votes together as an individual input, and we create an RBM for each user. Certainly there are as many as 480, 189 RBMs, and our goal is to use such large number of RBMs to train the same weight parameters for each RBM. Since the set of voted movies for each user is different with high probability, the original RBM should be modified a little for our problem. The modified RBM is illustrated in the Figure ??. There are F hidden states, and F can be determined by the specific problem. In [5], Ruslan et al states that they used 100 hidden stages to model the RBM, and we use F = 100 accordingly. The value of each hidden state can be only 1 or

m

p(h j = 1|V ) = σ (b j + ∑

K

∑ vkiWikj )

(1)

i=1 k=1

where σ (x) = 1/1 + e−x is the logistic function, Wikj is a symmetric weights between the rating k of movie i and feature j, b j is the bias of feature j. III. E VALUATION This section describes the evaluation of different kinds of algorithms for Netflix Prize problem on public clouds. As the metrics used by the other research groups, we also use root mean squared error (RMSE), where large errors have more impact on the final results. Furthermore, we use the probe set as the test set, and report our results only on probe set, where the baseline is 0.9514 achieved by Netflix’s proprietary CF system named Cinematch. The experiments of global effects (GE) , KNN and Restricted Boltzmann Machines (RBM) are performed in cloud computing platforms, which are using the S3 [2] and EC2 [1] of Amazon Web Services. The hardware and software details of cloud computing are introduced in the Table I. Table I H ARDWARE AND S OFTWARE D ETAILS OF C LOUD C OMPUTING Name Region Memory Disks CPU OS AMI

Description US-EAST 1.7GB 160 GB 1.0-1.2GHz 2007 Opteron or Xeon processor Ubuntu 9.10Karmic Koala ami-bb709dd

A. Global Effect We removed 10 Global Effects which have been introduced in the previous section. These Global Effects were

429

1.15

1

Validation Bellkor’s

0.9 0.8

1.1

0.6

CDF

RMSE

0.7

1.05

0.5 0.4 0.3

1 0.2 0.1

0.95

1

2

3

4

5

6

7

8

9

10

0

11

0

0.1

0.2

Stages of GE

Figure 2.

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pearson Correlation Figure 3.

The Result of Global Effect

The distribution of Pearson Correlation

350

K

removed one by one. After removing every one, we compared the parameter α we learned and the result RMSE with BellKor’s, which are shown in the following table. From the table, we can see that RMSE keeps decreasing with removal of Global Effects. Among these 10, not surprisingly, the largest improvements in RMSE are associated with two main effects, Movie Effect and User Effect. But in general, they all contribute to the improvement of RMSE from 1.1296 to 0.9654. From the table we can also see that our results are somehow different from BellKor’s. Apparently, the values of meta parameter α we learned are different from what BellKor published, which result in different RMSE. But the good thing is our results beat theirs a little bit.

300

250

K

200

150

100

50

0

0

20

40

60

80

100

120

140

running epochs Figure 4.

B. K-NN We run K-NN on the residual of Global Effect (GE), and we compute the RMSE on the probe set, which includes 1408394 records. We divide the K-NN algorithm into two stages, i.e., the Pearson Correlation computation and the parameters learning. All the settings are the same with that in [6], some of which have already been described in the previous section. Figure 6 describes the distribution of Pearson correlations among users by the cumulative distribution function (CDF). As it is shown in the graph, the CDF of Pearson Correlation is approximately linear with the Pearson correlation. With less than 7% of correlations are less than 0.1, and almost half of the correlations are larger than 0.5. Figure 4 shows the convergence of K, the number of neighbors in our parameters learning algorithm under APT2. After 110 epochs, the number of K converges, which is about 20. Note that APT2 trains multiple parameters at the same time like Gibbs sampling, so the curve stays horizontal when APT2 trains the other parameters. Since APT2 is a probing algorithm, so the number of K bounces every two epochs. APT2 tries to select the lower rmse value facing two K values at a time.

The convergence of K with epoch number

Figure 5 shows the changing of rmse value for probe set during the training stage. The red line corresponds to the rmse of the Cinematch, the value of which is 0.9514. The blue line is the result of KNN algorithm, which achieves a better result than Cinematch. After the first epoch, the rmse of KNN is 0.9579; and the final result is 0.9468. Here we just give the rmse with declining result of rmse, and rmse also fluctuates with the selection of various parameters. C. Restricted Boltzmann Machines In this paper, we implemented basic RBM. Our final RMSE from running RBM on raw data is 0.9670. The following is a figure showing the performance of RBM. It’s a pity that we couldn’t reproduce Bellkor’s result which was claimed to be less than 0.92. We still couldn’t understand why since we used the same model with the same settings. IV. R ELATED W ORK Companies like Amazon [1], Google, Microsoft, Netapp and EMC provide new features for their public or proprietary solutions. We use several data mining alogrithm to run on

430

0.96

V. C ONCLUSION AND F UTURE WORK

KNN Baseline

0.958

Many organization has started to use cloud services with different kinds of applications. As cloud computing being more and more popular,people are also coming up with many new applications. For example, recently researchers are trying to use cloud to build large scale emulated network experiments since traditional testbed are facing scalability problem. You can imagine, more and more applications will be deployed in cloud. This paper is a effective attempt to deloy the data mining applications and machine learning problem over cloud computing. In this paper, we implemented Global Effect, K-NN and Restricted Boltzmann Machines, using the exactly models that BellKor published. Global Effect and K-NN work very well. The result of running K-NN on the residual of Global Effect actually beats Clinemax a little bit. But unfortunately, RBM didn’t perform as well as we expected. So what we should do is to find one better model to yield better results in the future.

0.956

rmse

0.954 0.952 0.95 0.948 0.946 0.944

0

10

20

30

40

50

60

70

running epochs Figure 5.

The convergence of rmse with epoch number

0.98

RBM Baseline

0.97

ACKNOWLEDGMENT This work is supported by Natural 863 Plan under the Grant No. 2009AA01A402, National Natural Science Foundation of P. R. China under the Grant No.60933002 and Chenguang Plan of Wuhan of China (No. 201050231073) and the Innovation Plan of WNLO.

rmse

0.96

0.95

0.94

0.93

R EFERENCES

0.92

0.91

0

5

10

15

20

25

[1] “Amazon Elastic Compute http://aws.amazon.com/ec2.

30

Cloud,”

in

Service,”

in

running epochs Figure 6.

[2] “Amazon Simple http://aws.amazon.com/s3.

The Result of RBM

Storage

[3] R. M. Bell and Y. Koren. Scalable collaborative filtering with jointly derived neighborhood interpolation weights. In ICDM, pages 43–52, 2007.

public clouds. B. Sarwar et al [7] analyzed different itembased recommendation generation algorithms. They looked into different techniques for computing item-item similarities and different techniques for obtaining recommendations from them. R. Salakhutdinov et al [6] proposed a class of two-layer undirected graphical models, called Restricted Boltzmann Machines (RBMs). They present efficient learning and inference procedures for this class of models and demonstrate that RBMs can be successfully applied to large datasets, including Netflix dataset, and performs very well. R. Bell et al [3] enhanced the neighborhood-based approach leading to substantial improvement of prediction accuracy, without a meaningful increase in running time. A. Tscher et al [9] proposed a way to calculate similarities by formulating a regression problem which enables us to extract the similarities from the data in a problem-specific way. They also presented an algorithm C neighborhood-aware matrix C which efficiently includes neighborhood information in a regularized matrix factorization (RMF) model.

[4] Y. Koren. The BellKor Solution to the Netflix Grand Prize. 2009. [5] M. C. M. Piotte. The Pragmatic Theory solution to the Netflix Grand Prize. 2009. [6] R. Salakhutdinov, A. Mnih, and G. E. Hinton. Restricted boltzmann machines for collaborative filtering. In ICML, pages 791–798, 2007. [7] B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Riedl. Itembased collaborative filtering recommendation algorithms. In WWW, pages 285–295, 2001. [8] A. Tscher, M. Jahrer, and R. Bell. The BellKor Solution to the Netflix Grand Prize. 2009. [9] A. Tscher, M. Jahrer, and R. Legenstein. Improved neighborhood-based algorithms for large-scale recommender systems. 2008.

431