CAS based clustering algorithm for Web users - Springer Link

17 downloads 152 Views 647KB Size Report
Jan 22, 2010 - logs. In this technique, Web users are clustered by a new clustering algorithm which uses the mechanism analysis of chaotic ant swarm (CAS).
Nonlinear Dyn (2010) 61: 347–361 DOI 10.1007/s11071-010-9653-2

CAS based clustering algorithm for Web users Miao Wan · Lixiang Li · Jinghua Xiao · Yixian Yang · Cong Wang · Xiaolei Guo

Received: 5 July 2009 / Accepted: 3 January 2010 / Published online: 22 January 2010 © Springer Science+Business Media B.V. 2010

Abstract This article devises a clustering technique for detecting groups of Web users from Web access logs. In this technique, Web users are clustered by a new clustering algorithm which uses the mechanism analysis of chaotic ant swarm (CAS). This CAS based clustering algorithm is called as CAS-C and it solves clustering problems from the perspective of chaotic optimization. The performance of CAS-C for detecting Web user clusters is compared with the popular clustering method named k-means algorithm. Clustering qualities are evaluated via calculating the average intra-cluster and inter-cluster distance. Experimental results demonstrate that CAS-C is an effective clus-

tering technique with larger average intra-cluster distance and smaller average inter-cluster distance than k-means algorithm. The statistical analysis of resulted distances also proves that the CAS-C based Web user clustering algorithm has better stability. In order to show the utility, the proposed approach is applied to a pre-fetching task which predicts user requests with encouraging results. Keywords Clustering · Chaotic ant swarm (CAS) · Web access logs · Web user clustering

1 Introduction M. Wan () · L. Li · J. Xiao · Y. Yang · C. Wang · X. Guo Information Security Center, State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, P.O. Box 145, Beijing 100876, China e-mail: [email protected] M. Wan · L. Li · Y. Yang · C. Wang · X. Guo Key Laboratory of Network and Information Attack & Defence Technology of MOE, Beijing University of Posts and Telecommunications, Beijing 100876, China M. Wan · L. Li · Y. Yang · C. Wang · X. Guo National Engineering Laboratory for Disaster Backup and Recovery, Beijing University of Posts and Telecommunications, Beijing 100876, China J. Xiao School of Science, Beijing University of Posts and Telecommunications, Beijing 100876, China

The problem of detecting clusters or communities in networks has been studied by mathematicians, computer scientists, and more recently by physicists. A large amount of clustering algorithms has been developed to find a reasonably good partition of the complex network, from the perspective of dynamics [1–3], statistical mechanics [4, 5] and data mining techniques [6, 7]. These researches are concerned with the detection of structural network properties which have been observed in many different topologies, including metabolic networks, banking networks, and most notably social networks. Yet, as the most widely and frequently used topological system, the World Wide Web (WWW) has obtained little attention, and much of its intrinsic properties are still unknown by physicists.

348

The continuous growth in the size and use has made access behaviors play an important role in WWW. Therein, Web access logs with considerable information of frequently returning users are an ideal tool for studying characteristics and properties of the Web. In physical fields, several works have recently centered on capturing user behaviors by analyzing Web logs and traffic information [8–10], but some fundamental issues still need to be further understood. Detecting clusters of Web users from their navigating behaviors is one of these issues. Since clustering Web users could capture users’ task-oriented behaviors patterns, which can be used for building user profiles and other advanced Web applications, such as Web recommendation or personalization [11, 12], Web caching and pre-fetching [13, 14], improvement of Web design [15, 16] and e-commerce [17]. This raises a new question: How to cluster Web users from access logs? To get round this problem, some standard techniques of date mining such as kmeans algorithm [18] have been introduced partitioning n users into k clusters by minimizing the sum of the squared distances to the cluster centers. But traditional clustering methods are sensitive to the initial values and may get trapped in a local optimal easily. To relax this drawback, chaotic optimization techniques are employed for finding global optimal solutions which use chaotic variables to search the entire space. Thus far, there are many theoretical methods of chaotic optimization proposed, such as chaotic neural network [19], chaotic simulated annealing [20], chaotic particle swarm optimization [21], and improved algorithms [22, 23] have been proposed, however, few of them are utilized in clustering problems. Chaotic ant swarm (CAS) [24] is a recently proposed chaotic optimization algorithm inspired by chaotic behavior of ants. Until now, it has been applied in many practical systems [25–27]. Different from other existing chaotic optimization methods, CAS is a deterministic process which integrates the chaotic behavior of the individual ant with self-organizing foraging activities achieved by ant colonies. In this paper, we transfer the clustering task into a chaotic optimization problem and first propose a CAS based clustering algorithm (CAS-C) to group Web users. The performance of CAS-C based Web clustering approach is compared with the most classic kmeans clustering algorithm in terms of average intracluster distance and average inter-cluster distance. Experimental results reflect that the proposed algorithm

M. Wan et al.

combines the following three advantages into one: (i) Find a global optimum clustering result; (ii) No centroid or center needs to be selected in the initial step; and (iii) Have good algorithm stability. For proving the superiority more clearly, we also apply our algorithm to a real pre-fetch task to predict future requests of clustered users according to their common pages which takes advantage of the spatial locality of grouped users.

2 CAS based clustering (CAS-C) algorithm In this section, we will give the formal mathematical model and algorithm process of CAS-C to solve general clustering problem. 2.1 Overview of chaotic ant swarm (CAS) algorithm Chaotic dynamics has been studied by many researchers, and its effectiveness has been shown by encouraging results of applications to solve the optimization problems during the past 20 years. Several search strategies based on chaos have been found to obtain nice capabilities of hill-climbing and escaping from local optima, and to be more effective than random search. In 1991, Cole pointed out that the ant colony exhibits a periodic behavior while a single ant shows low-dimensional deterministic chaotic activity patterns [28]. But the problem of how the chaotic behavior of a single ant relates to the self-organization and foraging behaviors of the ant colony has received little attention. From the perspective of dynamics, there are interactions between the two kinds of behaviors. These interactions help ants to find food and survive, which can be adapted to the solution of optimization problems. Consequently, inspired by the chaotic and self-organization behaviors of ants, chaotic ant swarm (CAS) [24] was developed to solve the optimization problems, which incorporated chaotic dynamics of ant, swarm organization and optimization principles. In CAS, an ant colony composed of l ants is considered. These ants are located in a D-dimensional search space S and they try to minimize a function f . The ant colony undergoes two successive phases, chaotic phase and organization phase. To achieve selforganization from chaotic state, a successively decrement of organization variable yi is introduced into

CAS based clustering algorithm for Web users

349

CAS. The influence of the organization variable on the ant’s behavior is very weak in the first process and the behavior of a single ant is chaotic. The motion of the ants approximately are governed by the following equation:

Vid = ω2d (0 < Vid < 1) will make the search interval shift to [0, ωd ]. The impacts to the optimization result by adjusting each parameter in (2) are fully discussed in [24].

xid (t) = xid (t − 1)e3−ψd xid (t−1) .

2.2 CAS based clustering (CAS-C) algorithm

(1)

Equation (1) is a chaotic map suggested by Solé et al. [29]. With the continual small change of yi evolving in time, the influence of the organization on the behavior of individual ant becomes stronger and stronger. When the effect of the organization is sufficiently large, the chaotic behavior of the individual ant disappears. Then ants will do some further searches and move to the position that they can find in the search space. Throughout the whole process, they exchange information with their neighbors continually. Mathematically, the changing process of position for ant i can be described as [24]: ⎧ yi (t) = yi (t − 1)(1+ri ) , ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ xid (t) = (xid (t − 1) + Vid ) −ay (t) (2) × e(1−e i )(3−ψd (xid (t−1)+Vid )) ⎪ ⎪ ⎪ (−2ay (t)+b) i ⎪ + (pid (t − 1) − xid (t − 1))e ⎪ ⎪ ⎩ − Vid , where t means the current iteration step, and (t − 1) is the previous iteration step; yi (0) = 0.999; xid (t) is the current state of the dth dimension of ant i; In order to achieve the information exchange of individuals and the movements to new site taken on the best fitness value, the CAS introduces (pid − xid ). pid (t − 1) is the best position found by the ith ant and its neighbors within (t −1) steps; Vid (0 < Vid < 1) determines the coordinate transformation of search region for ant i; a is a sufficiently large positive constant which is used to enlarge the effect of yi (t) and a = 2000 is large enough; b is a constant, and 0 ≤ b ≤ ln 2 to keep the convergence of our system. In addition, ri and ψd are two important parameters. ri is the organization factor of ant i, which affects the convergence speed of the CAS directly. The larger ri is, the faster the system converges. The format of ri can be designed according to the concrete problem and runtime. Each ant could have different ri , such as ri = 0.1 + 0.2 × rand(1). ψd affects the search ranges of the CAS. If the interval of the search is [− ω2d , ω2d ], we can obtain an approximate formula ωd ≈ 7.5 ψd , and

Clustering is a data mining technique which classifies objects into groups (clusters) without any prior knowledge. Given a sample data set S = {x1 , x2 , . . . , xn }, the main goal of clustering is to find k clusters C1 , C2 , . . . , Ck which satisfied: ⎧ k Ci = S; ⎪ ⎪ ⎨ i=1 Ci ∩ Cj = ∅, i, j = 1, 2, . . . , k; i = j ; ⎪ ⎪ ⎩ Ci = ∅, i = 1, 2, . . . , k.

(3)

The criteria for this classification is that the data in a cluster will be similar (or related) to one another and different from (or unrelated to) the objects in other clusters. In the viewpoint of mathematics, cluster Ci can be determined by ⎧ Ci = {xj | xj − zi  ≤ xj − zp , xj ∈ S}, ⎪ ⎪ ⎨ p = i, p = 1, 2, . . . , k, ⎪ ⎪  ⎩ zi = |C1i | xj ∈Ci xj , i = 1, 2, . . . , k,

(4)

where  ·  denotes the distance of any two data points in the sample set. zi is the center of cluster Ci , which is represented by the average(mean) of all the points in the cluster. We can see from (4) that Ci is composed by some data items nearest to zi . So, the task of clustering can be seen as a process of determining k centers of {C1 , C2 , . . . , Ck }. The common criteria of evaluating clustering results is the Sum of Squared Error (SSE): SSE =

k  

xj − zi 2 .

(5)

i=1 xj ∈Ci

For each data in the given set, the error is the distance to the nearest cluster. Experiments show that the smaller SSE is, the better results the clustering will get. Thus, the clustering problem is converted to a process of searching k centers z1 , z2 , . . . , zk , which

350

M. Wan et al.

can minimize the sum of distance between all the sample data xi and its closest center. This could be considered as a function optimization issue and the objective function forms as f = argminz1 ,z2 ,...,zk

n 

argmin1≤p≤k xj − zp 2 . (6)

j =1

Different definition of distances leads to different optimization models. It is easy to see that (6) tries to make the results of clustering more compact and independent. As data clustering can be seen as an optimization problem of seeking a global optimal solution to (6), in this paper, we present a CAS based clustering approach, called CAS-C algorithm, to solve this problem. In CAS-C, clustering is regarded as a process of ant foraging, and the centers or centroid can be seen as the goal (food) to search. There is no initial centroid to be preset in our algorithm. In the initial step, numbers of data in the sample set are randomly selected as the positions of the ants. After steps of iteration, the ants move and converge to some points that are considered as centers of each cluster in the data space. Generally, there are two ways to stop the iteration of optimization-based algorithm. The one is presetting a number of iterations large enough for the convergence based on the properties of algorithm by experience. The other is by calculating the value of the objective function to find a converging state when the value is stable. We choose the former way throughout this paper and preset Istep as a number of maximum steps of iteration. When the iterations arrive to Istep, the clustering process will stop with the results output. Based on (2), the iteration equation for CAS-C algorithm is given as ⎧ yi (t) = yi (t − 1)(1+ri ) , ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ zpid (t) = (zpid (t − 1) + Vid ) −ay (t) × e(1−e i )(3−ψd (zpid (t−1)+Vid )) ⎪ ⎪ ⎪ ⎪ + (zbestpid (t − 1) − zpid (t − 1)) ⎪ ⎪ ⎩ × e(−2ayi (t)+b) − Vid ,

ter zp (p = 1, 2, . . . , k, where k is the desired cluster number and needed to be pre-assigned before the algorithm starts); (3) zbestpid (t −1): the best position of the dth dimension found by all the ants within (t − 1) steps for ant i; (4) Other parameters have the same meaning with (2). Now we will give a work flow to explain how CAS-C algorithm is implemented. Given the desired cluster number k, the CAS-C algorithm is carried out as follows: Step 1: Initialization. There are several parameters to be preassigned before iteration starts in CAS-C. Initialize ψd for scope of searching in the data space, the transformation scale Vid of the search region, the ant number antnum, the maximum number of generations Istep, and the organization factor ri . Then set t = 1 and generate positions of n ants randomly from the data space for each center. In CAS-C, no centroid or center needs to be selected in the initial step. Step 2: Iteration process. At the tth step, the best position found by all ants within (t − 1) steps is picked out as zbestpi (t − 1) for ant i. Then every individual ant changes their places in the data space as (7). After moving, select proper zbestpi (t) for each ant based on minimizing (6), and remember it to the next iteration. Step 3: Set t = t + 1, and return to Step 2. The iteration cycle is terminated till t equals to the maximum iteration step Istep. Step 4: Mark final centers. All the ants will converge to some points in the search space after the iteration process. The final positions of the ants are considered as the required centers. Step 5: Perform clustering on the data set. Allocate all the data according to (4) into different clusters which are represented by the final centers gained after the iteration process. Mark every data object with its corresponding label.

(7)

where (1) t: the current iteration step, and (t − 1) is the previous step of iteration; (2) zpid (t): the current state of the dth dimension (d = 1, 2, . . . , D) of ant i for the pth desired cen-

3 Web user clustering based on CAS-C Detecting user clusters of a Web site is quite different from traditional clustering techniques due to the inherent difference between Web user behaviors and figures. In this section, we will give a full description of the proposed Web user clustering scheme using CAS-C.

CAS based clustering algorithm for Web users

351

3.1 Preprocessing the web logs

3.1.3 Session identification

The first part of Web user cluster detection, called preprocessing, is usually complex and specific circumstances demanding. Generally, it comprises three domain dependent tasks of data cleaning, user identification, and session identification.

After individual users are identified, the next step is dividing each user’s click stream into different segments, which are called sessions. Most session identification approaches identify user sessions by a maximum timeout. Many commercial products use 30 minutes as a default timeout [30]. Besides, web browsers may also request content on a regular time frequency if the page requests it. For example, www.cnn.com uses the “httpequiv” meta tag to indicate the page should be refreshed every 30 minutes [32]. Reference [33] established a timeout of 25.5 minutes (or near values) based on empirical data. If the time between page requests exceeds a certain limit of access time, we assume a user is starting a new session.

3.1.1 Data cleaning Depending on different applications and tasks, Web access logs may need to be cleaned from entry requested pages, and keep useful and relevant information only. For the purpose of user clustering, all the data tracked in Web logs that are useless, such as graphical page content (e.g., jpg and gif files) and dynamic pages (that with filename suffixes as jsp, asp or php) in the access logs, which are not content pages or documents, need to be removed. In general, a user does not explicitly request all of the graphics that are on a Web page and automatically downloaded due to the HTML tags. Since the main intent of Web Usage Mining is to get a picture of the user’s behavior, it does not make sense to include file requests that the user did not explicitly request [30]. Duplicated requests are also filtered in this step, leaving only one entry per page request. 3.1.2 User identification Identifying different users is an important issue of data preprocessing. There are several ways to distinguish individual visitors in Web log data which are collected from three main sources: Web servers, proxy servers and Web clients. Then most obvious assumption is that single user in Web logs acquired from the server and proxy sides are identified by the same IP address. However, this is not very accurate because, for example, a visitor may access the Web from different computers, or many users may use the same IP address (if a proxy is used). This problem can be partially solved by the use of cookies [31], URL rewriting [11], or the requirement for user registration [32]. User identification from client trace logs are much easier because these logs are traced via different user IDs.

3.2 Extraction of feature vectors and user modeling Premising all the Web logs are preprocessed, the log data should be further analyzed in order to find common user features and create proper user model for user clustering. The established matrix of user model will be the input of Web user clustering algorithm. 3.2.1 Label vector selection Based on the results of user identification, every user in the access logs has visited a set of Web pages. There may be some pages that are only requested in a very small period, such as one session, by a user, and are not visited anymore. These pages just represent temporary interests of users, so that they are filtered out. Pages or URLs requested in more than 2 sessions by one user, which to some extent can reflect the steadygoing interests of this user, are selected as user interested pages. After this process, each user will hold a ‘user interested page set’ Sj (j = 1, 2, . . . , n, where n is the total number of identified users) which implicates user’s behavior features and is more suitable for clustering analysis. Since pages with very low hit rates in the log file only can reflect the personal interest of individual users, these pages should be removed from Sj based on the pre-set number of user or host. After the process of low support pages filtering, we will get a label vector L = {URL1 , URL2 , . . . , URLm } composed with the remaining m URLs. Each element in the label vector L is successfully visited more than the pre-set number of user.

352

The whole process to extract label vector is shown below: Procedure: Label vector selection Input: L = null (initialization of label vector) Sall = {S1 , S2 , . . . , Sn } for (each Sj in Sall , j = 1, 2, . . . , n) for (each URLp in Sj ) if (URLp is requested in only one session) Sj .remove (URLq ) else Sj .keep (URLq ) end if end for end for S = {S1 , S2 , . . . , Sn } (S is the user interested page set) for (each URLq in S) if (URLq is requested by more than 2 users) L.add (URLq ) end if end for Output: L = {URL1 , URL2 , . . . , URLm } 3.2.2 User model matrix creation The output vector L = {URL1 , URL2 , . . . , URLm } in the step of label vector selection is the basic item for the extracting of user access features. For each user, j (j = 1, 2, . . . , n, n is the total number of users), based on the label vector L, we create a feature vector formed as Aj = {R1 , R2 , . . . , Rm } which can represent the access pattern of the j th user. The element ‘Ri ’ (i = 1, 2, . . . , m) in vector Aj is a binary variable which equals to 1 or 0. 1 means this user requested URLi , while 0 means the user didn’t request this URL. (8) describes the procedure of feature vector extraction for each user: 1, if URLi is requested, (8) Ri = 0, if URLi is not requested. Feature vectors to some extent represent common interests of users. Gather feature vectors Aj (j = 1, 2, . . . , n, n is the total number of users) of all users, then an n × m matrix A = {A1 , A2 , . . . , An }T is created. This matrix is called as user model matrix in our paper. Each row of the user model matrix can represent one user in our work, and is taken as one input data for the CAS-C based Web user clustering algorithm.

M. Wan et al.

3.3 Web user clustering using CAS-C algorithm We can draw from the definition of clustering that a good clustering method should produce clusters with high intra-class similarity while low inter-class similarity. The similarity is expressed in terms of a distance function which is usually very different in diverse applications. Euclidean distance is chosen to measure the distance among data points in our CAS-C based Web user clustering algorithm. So, cluster results are presented and measured by calculating the average intracluster distance and average inter-cluster distance. As we have mentioned in Sect. 2, CAS-C converts the clustering task to a global optimization problem. The procedure of Web user clustering using CAS-C algorithm is illustrated in Fig. 1.

4 Experiments In this section, we will discuss the whole process of CAS-C based Web user clustering approach, and present several simulation experiments on the platform of Matlab to give a detailed illustration on the superiority of the proposed algorithm. The evaluation of quality measures of our algorithm will be performed with experimental comparisons with k-means algorithm. 4.1 Preprocessing of data source The data source for applying our proposed algorithm is the Web site access log of Boston University Computer Science department [34]. It was collected by the Oceans Research Group (http://cs-www.bu.edu/ groups/oceans/Home.html) at Boston University. The log file is available at The Internet Traffic Archive [35] sponsored by ACM SIGCOMM. It contains a total of 1,143,839 requests for data transfer, representing a population of 762 different users. The raw data in the access log is formed as follows: machinename, timestamp, userid, requestedURL, sizeofdocument, bytessentinreply

Because of the large file size, we choose part of the logs which are in the period of January and February as our data source. In session identification, we set the maximum elapsed time to 30 minutes, which is used in many commercial applications. According to the item

CAS based clustering algorithm for Web users

353

Fig. 1 Working flow of Web user clustering approach using CAS-C algorithm

of userid in the log data, we select 50 users in the step of user identification. After access log preprocessing, we get 527 sessions from a total of 50 users. The user IDs are renumbered, and each one of them have been assigned an identification number between 1 and 50. 4.2 Feature vectors and user model matrix During the process of label vector selection, we set a minimum hit number to 2. Content pages which are requested less than 2 users are filtered out. After user interested page selection and low support page filtering, there are 13 requested URLs left which compose the label vector L (as shown in Appendix A). Then we construct a 50 × 13 matrix A = {A1 , A2 , . . . , A50 }T as the user model matrix (as shown in Appendix B), and take it as the input of our Web user clustering algorithm. 4.3 Performance and analysis of experiment results To show the quality and effectiveness of the proposed CAS-C based Web user clustering algorithm,

we compare the clustering results with the famous kmeans clustering algorithm in terms of the average intra-cluster distance in each cluster and the average inter-cluster distance between the clusters, choosing the cluster number k = 4, 5 and 6. In CAS-C, we set antnum = 20, Istep = 800, ri = 0.5 + rand(1) ∗ 0.1. For this clustering task, 20 ants are enough for searching the global optimal solution to (6). Each experiment runs 10 times for 3 different values of k. Figures 2 and 3 present the intra-cluster and intercluster distance values we obtained in every experiment. Via distance comparisons in Figs. 2 and 3 we can find that: (i) Whatever value k is, the proposed CAS-C algorithm will group users into clusters with smaller intra-cluster distance and larger inter-cluster distance. That is to say, the CAS-C based Web clustering algorithm can make data (users) in one cluster more independent to different clusters and more closer to data (users) in the same group. (ii) CAS-C is a clustering algorithm with better stability and not sensitive to various initial values. The distance values obtained by CAS-C algorithm are always with small differences,

354

M. Wan et al.

Fig. 2 Intra-cluster distances acquired by both k-means and CAS-C algorithm for various k values: (a) k = 4; (b) k = 5; (c) k = 6

Fig. 3 Inter-cluster distances acquired by both k-means and CAS-C algorithm for various k values: (a) k = 4; (b) k = 5; (c) k = 6

while large changes are shown in distance curves got by k-means algorithm. In order to show the effectiveness and stability of the proposed CAS-C based Web user clustering algorithm more clearly, we analyze the results of the above

experiments by calculating and comparing means and variances of the resulted data. Table 1 gives an overview of the clustering performance comparison of k-means algorithm and our Web user clustering algorithm based on CAS-C. The re-

CAS based clustering algorithm for Web users

355

Table 1 Average intra-cluster and inter-cluster distance of various k values obtained from k-means and CAS-C algorithm k

Avg. intra-cluster distance

Avg. inter-cluster distance

k-means

CAS-C

k-means

CAS-C

4

0.7605

0.6600

4.9435

6.2459

5

0.7000

0.5729

6.8919

7.8266

6

0.7264

0.6124

8.6063

9.7368

The results of Table 1 and Fig. 4 can be summarized as: (1) It is easy to see that CAS-C based Web user clustering algorithm has smaller average intra-cluster distance and larger average inter-cluster distance than those of k-means algorithm whenever k equals to 4, 5, or 6. This result analysis reflects that the Web user clustering algorithm based on CAS-C is more effective for clustering problems than k-means algorithm for various k values. (2) Variance values of intra-cluster distances acquired from CAS-C based Web user clustering approach are smaller than that of k-means algorithm at each k, especially at k = 5. This kind of situation is also happened in variance comparison for inter-cluster distances between the two algorithms. In statistics, the smaller the variance of a variable is, the smaller the output deviates from the expected value (mean). These comparisons present that the results of CAS-C based Web clustering algorithm change less at different experiments, and CAS-C algorithm is a more stable clustering technique.

5 Application: pre-fetching 5.1 Scheme

Fig. 4 Variance comparisons for various k values obtained from k-means and CAS-C algorithm

sulted distances in Table 1 are average values (means) got from 10 times running of each experiment. Figure 4 gives the variance comparisons of intracluster distance and inter-cluster distance between kmeans algorithm and our Web user clustering algorithm based on CAS-C.

The results produced by our Web user clustering algorithm can be used for pre-fetching and caching task which pre-fetches the URL objects before users request them and loads them into the Web server cache. For a effective pre-fetching scheme, there should be an efficient method to predict users’ requests and make proper pre-fetching and caching strategies. Web caching and Web pre-fetching are two important techniques used to reduce the noticeable response time perceived by users [36]. The caching technique exploits the temporal locality, whereas the pre-fetching technique utilizes the spatial locality of Web objects. An efficient caching and pre-fetching scheme effectively reduces the load and response time of Web servers. Various techniques, including Web Mining approaches [37–39], have been utilized for improving the accuracy of predicting user access patterns from Web access logs, making the pre-fetching of Web objects more effectively. But most of these techniques concerned on predicting requests for a single user [40, 41].

356

M. Wan et al.

Groups of users’ interests have caught little attention and new approaches of users’ common interested requests prediction will be a new issue in the area of pre-fetching. Our pre-fetch task tries to exploit the advantage of spatial locality within grouped users. We first cluster users by the proposed CAS-C based Web users clustering algorithm, and then find common requested pages of most users in each cluster as the pre-fetching objects. We will use the Web access logs of January and February for user clustering and request predicting. The accuracy of our pre-fetching scheme will be verified by comparing predicted URLs with the access logs of March. The pre-fetch rule is defined as follows: For each cluster, let P = {p1 , p2 , . . . , pm } be a set of Web pages in the Web server. In this paper, the pre-fetch rule is defined as an implication of the c form {p1 , p2 , . . . , pi } → {q1 , q2 , . . . , qj }, where P1 = {p1 , p2 , . . . , pi } is the page set that users requested in January and February, P2 = {q1 , q2 , . . . , qj } is the page to be pre-fetched in March, and P2 ⊆ P1 ⊆ P , c is the portion (or ratio) of users who have requested P2 in January and February. After parameter investigation, we select c = 0.5 for our pre-fetch task, which Table 2 Clustering results obtained from CAS-C based Web user clustering algorithm

means pages that more than or equal to 50% of the users in one cluster have requested in January and February will be pre-fetched for March. 5.2 Experiments After the user clustering comparison experiments in Sect. 4, we choose the cluster number as 5 because our algorithm has the smallest average intra-cluster distance at k = 5. There are also two new parameters utilized to verify the performance of our pre-fetching task: (1) hits which indicate the number of URLs are requested from the pre-fetched URLs, and (2) accuracy which is the ratio of hits to the number of URLs that are pre-fetched. Table 2 shows clustering results we obtained. Not all users are assigned to the five clusters because 3 users do not belong to any of them due to the sparseness of the log data. From Table 2, we can conclude that: (1) Different clusters of users can be identified by the common teachers or courses they selected, such as Cluster 2, 3, and 4; (2) Some groups of users are clustered by their common interest, such as Cluster 5; (3) Many users only visit the home page, which is the entry

C

Members

Common user requests

1

2,3,7,10,

cs-www.bu.edu/courses/Home.html

13,14,17,22, 24,25,27,29, 30,31,33,34, 36,38,39,40, 41,43,44,45, 47,48,49,50 2

8,11,19,32

cs-www.bu.edu/pointers/Home.html, cs-www.bu.edu/students/grads/Home.html, cs-www.bu.edu/students/grads/oira/Home.html, cs-www.bu.edu/students/grads/oira/cs112/hmwrk1.html

3

1,9,12,26

cs-www.bu.edu/courses/Home.html, cs-www.bu.edu/faculty/gacs/courses/cs410/Home.html

4

15,18,20, 37,46

cs-www.bu.edu/faculty/Home.html, cs-www.bu.edu/faculty/crovella/Home.html

5

14,16,21, 28,35,42

cs-www.bu.edu/courses/Home.html, cs-www.bu.edu/faculty/heddaya/CS103/Home.html, cs-www.bu.edu/pointers/Home.html

CAS based clustering algorithm for Web users Table 3 Pre-fetching results based on common profiles from results of CAS-C based Web clustering algorithm

357

Cluster

Numbers of URLs Pre-fetched

Members

Requested URLs

Hits

Accuracy (%)

1

1

2 3 7 10 13 14 17 22 24 25 27 29 30 31 33 34 36 38 39 40 41 43 44 45 47 48 49 50

1 48 3 1 243 17 45 1 1 1 1 5 1 10 11 32 1 1 10 1 3 38 45 1 1 10 76 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100

2

4

8 11 19 32

64 76 7 2

4 3 3 2

100 75 75 50

3

2

1 9 12 26

4 5 15 58

2 2 2 2

100 100 100 100

4

2

15 18 20 37 46

3 8 21 13 16

2 1 2 1 2

100 50 100 50 100

5

3

4 16 21 28 35 42

28 29 22 13 3 4

2 2 3 3 2 2

66.7 66.7 100 100 66.7 66.7

358

of this Web site, to check information, such as Cluster 1. Based on the above clustering results, we pre-fetch URL requests for each user in all of the five clusters. In order to verify the performance of our pre-fetch scheme, we calculate the accuracy of pre-fetch hits by comparing predicted URLs with the access logs of March. The pre-fetching results are shown in Table 3. The results in Table 3 show that our CAS-C based pre-fetch approach has acquired accuracies ranging from 50% to 100%. Predicting URLs are even all requested by every user in cluster 1 and 3, which means users in these two groups kept focusing on their old interests and requested all of the pre-fetched URLs. Our pre-fetching technique can provide a high-quality user request predicting service and the average accuracy could achieve 86.56% which is considerably high.

6 Conclusion and future work This paper focuses on detecting clusters of Web users according to their activity patterns acquired from access logs. We present an effective CAS based clustering algorithm (CAS-C) to detect clusters of Web

M. Wan et al.

users. The clustering problem is converted to that of seeking the center for each cluster by optimizing the objective function. Numerical simulation experiments prove that the proposed CAS-C based Web user clustering approach could be used to achieve large average inter-cluster distance and small average intra-cluster distance on Web access log data. According to common profiles of detected clusters, our approach can predict and pre-fetch user requests with encouraging results. Our work still has further research issues. Since the requirement for predicting user needs can improve the usability and user retention of a Web site, we plan to establish a Web personality system which can predict users’ behavior and provide users with what they want or need without them having to ask for it explicitly [42]. With the help of this system, Web site could raise its quality of service (QoS), and furthermore, acquire maximize profits. Acknowledgements This work is supported by the National Natural Science Foundation of China (Grant No. 60805043 and 60821001), the Fund for Ph.D. students, Ministry of Education (20060013007), the National Basic Research Program of China (973 Program) (Grant No. 2007CB311203), Beijing Natural Science Foundation, China (Grant No. 4092029) and the National Key Technology R&D Program (Grant No. 2007BAH05B02-04).

Appendix A: Label vector (L) L = {‘http://cs-www.bu.edu/courses/Home.html’ ‘http://cs-www.bu.edu/faculty/Home.html’ ‘http://cs-www.bu.edu/faculty/crovella/Home.html’ ‘http://cs-www.bu.edu/faculty/crovella/courses/cs210/reading.html’ ‘http://cs-www.bu.edu/faculty/gacs/courses/cs410/Home.html’ ‘http://cs-www.bu.edu/faculty/heddaya/CS103/Home.html’ ‘http://cs-www.bu.edu/faculty/lnd/main_menu.html’ ‘http://cs-www.bu.edu/pointers/Home.html’ ‘http://cs-www.bu.edu/students/grads/Home.html’ ‘http://cs-www.bu.edu/students/grads/oira/Home.html’ ‘http://cs-www.bu.edu/students/grads/oira/cs112/hmwrk1.html’ ‘http://nearnet.gnn.com/gnn/GNNhome.html’ ‘http://web.bu.edu/pagetwo.html’,}

CAS based clustering algorithm for Web users

359

Appendix B: User model matrix [1 1 1 1 0 0 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

360

References 1. Miyano, T., Tsutsui, T.: Data synchronization in a network of coupled phase oscillators. Phys. Rev. Lett. 98, 024102 (2007) 2. Ye, Z., Hu, S., Yu, J.: Adaptive clustering algorithm for community detection in complex networks. Phys. Rev. E 78, 046115 (2008) ˙ 3. Feldt, S., Waddell, J., Hetrick, V.L., Berke, J.D., Zochowski, M.: Functional clustering algorithm for the analysis of dynamic network data. Phys. Rev. E 79, 056104 (2009) 4. Reichardt, J., Leone, M.: (Un)detectable cluster structure in sparse networks. Phys. Rev. Lett. 101, 078701 (2008) 5. Gfeller, D., Chappelier, J.C., DeLosRios, P.: Finding instabilities in the community structure of complex networks. Phys. Rev. E 72, 056135 (2005) 6. Newman, M.E.J.: Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E 74, 036104 (2006) 7. Hu, Y., Li, M., Zhang, P., Fan, Y., Di, Z.: Community detection by signaling on complex networks. Phys. Rev. E 78, 016115 (2008) 8. Vázquez, A., Oliveira, J.G., Dezsö, Z., Goh, K.I., Kondor, I., Barabási, A.L.: Modeling bursts and heavy tails in human dynamics. Phys. Rev. E 73, 036127 (2006) 9. Gonçalves, B., Ramasco, J.J.: Human dynamics revealed through Web analytics. Phys. Rev. E 78, 026123 (2008) 10. Meiss, M.R., Menczer, F., Fortunato, S., Flammini, A., Vespignani, A.: Ranking Web Sites with Real User Traffic. In: Proc. WSDM ’08, California, vol. 1, pp. 65–76. ACM, New York (2008) 11. Mobasher, B., Cooley, R., Srivastava, J.: Automatic personalization based on web usage mining. Commun. ACM 43, 142–151 (2000) 12. Paik, H.Y., Benatallah, B., Hamadi, R.: Dynamic restructuring of e-catalog communities based on user interaction patterns. World Wide Web 5, 325–366 (2002) 13. IBM, SurfAid Analytics, http://surfaid.dfw.ibm.com (2003) 14. Padmanabhan, V.N., Mogul, J.C.: Using predictive prefetching to improve world wide web latency. ACM Comput. Commun. Rev. 3, 23–36 (1996) 15. Berendt, B.: Using site semantics to analyze, visualize, and support navigation. Data Mining Knowl. Discov. 6, 37–59 (2002) 16. Fu, Y., Creado, M., Ju, C.: Reorganizing web sites based on user access patterns. In: Proc. 10th Int. Conf. on Information and Knowledge Management, Georgia, USA, vol. 1, pp. 583–585. ACM, New York (2001) 17. Ansari, S., Kohavi, R., Mason, L., Zheng, Z.: Integrating ecommerce and data mining: architecture and challenges. In: Proc. 2001 IEEE Int. Conf. on Data Mining (ICDM 2001), San Mateo, USA, vol. 1, pp. 27–34. IEEE Computer Society, Washington (2000) 18. MacQueen, J.: Some Methods for Classification and Analysis of Multivariate Observations. In: Proc. 5th Berkeley Symp. on Math. Statist. and Prob., Berkeley, vol. 1, pp. 281–297. University of California Press, Berkeley (1967) 19. Aihara, K., Takabe, T., Toyoda, M.: Chaotic neural networks. Phys. Lett. A 144, 333–340 (1990)

M. Wan et al. 20. Chen, L., Aihara, K.: Chaotic simulated annealing by a neural network model with transient chaos. Neural Netw. 8, 915 (1995) 21. Cai, J., Ma, X., Li, L., Peng, H.: Chaotic particle swarm optimization for economic dispatch considering the generator constraints. Energy Convers. Manag. 48, 645 (2007) 22. Tokuda, I., Aihara, K., Nagashima, T.: Adaptive annealing for chaotic optimization. Phys. Rev. E 58, 5157 (1998) 23. Yang, D., Li, G., Cheng, G.: On the efficiency of chaos optimization algorithms for global optimization. Chaos Solitons Fractals 34, 1366–1375 (2007) 24. Li, L., Yang, Y., Peng, H., Wang, X.: An optimization method inspired by chaotic ant behavior. Int. J. Bifurc. Chaos 16, 2351–2364 (2006) 25. Li, L., Yang, Y., Peng, H., Wang, X.: Parameters identification of chaotic systems via chaotic ant swarm. Chaos Solitons Fractals 28, 1204–1211 (2006) 26. Cai, J., Ma, X., Li, L., Yang, Y., Peng, H., Wang, X.: Chaotic ant swarm optimization to economic dispatch. Electric Power Syst. Res. 77, 1373–1380 (2007) 27. Li, L., Yang, Y., Peng, H.: Fuzzy system identification via chaotic ant swarm. Chaos Solitons Fractals 40, 1399–1407 (2009) 28. Cole, B.J.: Is animal behavior chaotic? Evidence from the activity of ants. Proc. R. Soc. Lond. B, Biol. Sci. 244, 253– 259 (1991) 29. Solé, R.V., Miramontes, O., Goodwill, B.C.: Oscillations and chaos in ant societies. J. Theor. Biol. 161, 343–357 (1993) 30. Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wide web browsing patterns. Knowl. Inf. Syst. 1, 5–32 (1999) 31. Cooley, R.: Web usage mining: discovery and application of interesting patterns from web data. PhD thesis, University of Minnesota (2000) 32. Anderson, C.R.: Amachine learning approach to web personalization. PhD thesis, University of Washington (2002) 33. Catledge, L.D., Pitkow, J.E.: Characterizing browsing strategies in the World-Wide Web. Comput. Netw. ISDN Syst. 27, 1065–1073 (1995) 34. Cunha, C.A., Bestavros, A., Crovella, M.E.: Characteristics of WWW client traces. Boston University Department of Computer Science, Technical Report TR95-010, April 1995. http://ita.ee.lbl.gov/html/contrib/BUWeb-Client.html 35. The Internet Traffic Archive. http://ita.ee.lbl.gov/index. html 36. Teng, W., Chang, C., Chen, M.: Integrating web caching and web prefetching in client-side proxies. IEEE Trans. Parallel Distrib. Syst. 16, 444–455 (2005) 37. Lan, B., Bressan, S., Ooi, B.C., Tan, K.: Rule-assisted prefetching in web server caching. In: Proc. 2000 ACM Int. Conf. on Information and Knowledge Management, Virginia, USA, vol. 1, pp. 504–511. ACM, New York (2000) 38. Nanopoulos, A., Katsaros, D., Manolopoulos, Y.: Effective prediction of web-user accesses: a data mining approach. In: Proc. Workshop Web Usage Analysis and User Profiling (WebKDD’01), San Francisco, USA. ACM, New York (2001) 39. Pitkow, J., Pirolli, P.: Mining longest repeating subsequence to predict world wide web surfing. In: Proc.

CAS based clustering algorithm for Web users 2nd USENIX Symp. Internet Technologies and Systems (USENIX, 1999), Colorado, USA. vol. 1, pp. 139–150 40. Tian, W., Choi, B., Phoha, V.V.: An adaptive web cache access predictor using neural network. In: Proc. 15th Int. Conf. on IEA/AIE, Cairns, Australia, vol. 2358, pp. 450– 459. Springer, Berlin (2002)

361 41. Wu, Y., Chen, A.: Prediction of web page accesses by proxy server log. World Wide Web 5, 67–88 (2002) 42. Mulvenna, M.D., Anand, S.S., Buchner, A.G.: Personalization on the net using web mining. Commun. ACM 43, 123– 125 (2000)