A Hierarchical Algorithm for Clustering Extremist Web Pages

2 downloads 17055 Views 621KB Size Report
to cluster these web pages using a derived hierarchical tree. The experimental .... advantage that in most cases the authors of web pages create links to another ...
2010 International Conference on Advances in Social Networks Analysis and Mining

A Hierarchical Algorithm for Clustering Extremist Web Pages Xingqin Qi, Kyle Christensen, Robert Duval, Edgar Fuller Member, IEEE, Arian Spahiu, Qin Wu, Cun-Quan Zhang more structure is developed and stored. Over time, this data is archived and we then have the opportunity to analyze the connectivity found in the data as represented by the bidirectional hyperlink structure of the stored web pages in order to determine hidden or previously unknown affiliations within the network.

Abstract—Extremist political movements have proliferated on the web in recent years due to the advent of minimal publication costs coupled with near universal access, resulting in what appears to be an abundance of groups that hover on the fringe of many socially divisive issues. Whether white-supremacist, neoNazi, anti-abortion, black separatist, radical Christian, animal rights, or violent environmentalists, all have found a home (and voice) on the Web. These groups form social networks whose ties are predicated primarily on shared political goals. Little is known about these groups, their interconnections, their animosities, and most importantly, their growth and development and studies such as the Dark Web Project, while considering domestic extremists, have focused primarily on international terrorist groups. Yet here in the US, there has been a complex social dynamic unfolding as well. While left-wing radicalism declined throughout the 80s and 90s, right wing hate groups began to flourish. Today, the web offers a place for any brand of extremism, but little is understood about their current growth and development. While there is much to gain from in-depth studies of the content provided by these sites, there is also a surprising amount of information contained in their online network structure as manifested in links between and among these web sites. Our research follows the idea that much can be known about you by the company you keep. In this paper, we propose an approach to measure the intrinsic relationships (i.e., similarities) of a set of extremist web pages. In this model, the web presence of a group is thought of as a node in a social network and the links between these pages are the ties between groups. This approach takes the bi-directional hyperlink structure of web pages and, based on similarity scores, applies an effective multi-membership clustering algorithm known as the quasi clique merger method to cluster these web pages using a derived hierarchical tree. The experimental results show that this new similarity measurement and hierarchical clustering algorithm gives an improvement over traditional link based clustering methods.

A. Clustering Websites The grouping of websites according to identifiable traits, qualities, and properties is a common application of machine learning techniques. Website clustering based upon linkage data appears in many contexts but it is an area that is still has many unanswered questions in the context of social network analysis. This body of work needs further assessment and refinement for social science research. Among the tools and techniques used to group websites most tend to focus on some type of common structure such as popularity among other websites or content. Once pre-processed into an appropriate form such as a group of vectors in a feature space or a graph structure, clustering methods such as k-means become attractive. Clustering techniques have been used extensively in information retrieval studies of large documents [1], [2], [3]. These authors have looked at different approaches document clustering. Cutting et al [1] show how clustering can be an effective tool in information retrieval. Iwayama and Tokunaga [2] utilize Hierarchical Bayesian Clustering (HBC) to assess the probability of different texts clustering together. Consequently Jones et al. [3] utilize a non-hierarchic document clustering algorithm to assess the probability of two documents clustering together. The basic implication behind these clustering algorithms is to identify groups, or clusters, with similar values and properties. These algorithms provide an excellent way of organizing cases together but they have been based upon the analysis of text. As we have noted, there have been many attempts in information retrieval studies to apply clustering algorithms to study the likelihood of documents clustering together. This approach, however, is rather absent in web structure analysis. We will show that by using such an approach, we will introduce a more costeffective way of grouping of different websites together. In website structure studies, several scholars have begun to use clustering algorithms to organize websites based on some distinguishing traits [4], [5]. For example, Ricca et al [4] do a website clustering using keyword extraction, again using text. The authors reduce the content of a website to a set of weighted keywords using the Natural Language Processing (NLP) techniques and then apply their clustering algorithm. Additionally, Xavier Polanco [5] uses clusters, graph theory

I. I NTRODUCTION Social network analysis is a powerful tool for understanding the relationships between a group of interacting individuals or organizations. In this work we consider networks of extremist political movements through their presence on the internet. Web pages for groups are thought of as nodes and the ties between groups are determined by hyperlinks connecting the webpages. In this way, we are then able to mine substantial quantities of social network data from the internet by crawling a specific set of web pages of known extremist movements. These pages in turn yield secondary and tertiary groups as Xingqin Qi, Edgar Fuller, Qin Wu, and Cun-Quan Zhang are with the Department of Mathematics, West Virginia University, Morgantown, WV 26506-6310, USA; email: [email protected] ). Robert Duval and Arian Spahiu are with the Department of Political Science, West Virginia University Kyle Christensen is with the Department of Political Science, Columbus State University

978-0-7695-4138-9/10 $26.00 © 2010 IEEE DOI 10.1109/ASONAM.2010.81

458

and social network analysis to analyze website structure. Polanco generates clusters by co-site analysis, starting from clusters and then using graph theory with SNA to recognize networks. Our approach (discussed in the methods section below) uses individual nodes as a unit of analysis and proceeds with clustering in order to recognize networks. We argue that this is a more effective way in grouping website based on their identifiable linkage properties. In his seminal work, The Physics of the Web, Barabasi [6] introduces a new way to cluster websites. Instead of assuming that the number of nodes in a network is fixed (graph theory) and that the links are distributed randomly, Barabasi [6] describes a different phenomenon in network analysis called the “referential attachment” which argues that most networks contain nodes that are more likely to link to each other. Based on these two principles he develops his model of network analysis (by adding one node at each step). This work observed that in such a context the number of nodes in the graph with valence k obeys a power law distribution. When one constrains attention to a specific subset of websites grouped by common traits it is possible that new connectivity phenomena will appear. In particular, websites with a strong affinity for certain topics such as radicalism will not establish connections with other websites in quite the same way that random websites will in the generic world wide web. A significant amount of research has been done in identifying and collecting terrorist group websites on the internet for analysis (e.g., [7], [8], [9], [10]. There is, however, relatively little research done to attempt to assess the sub-groups that might be observed in large collections of organizations labeled as terrorist groups. One of the more visible groups studying network analysis and a range of violent actors is the Dark Web Portal at the University of Arizona [8]. This large-scale open source project has produced a large body of work on the use of the Internet by extremist groups and terrorist organizations, both domestic and international. The Dark Web project does point to the utility of network analysis in ascertaining community structures or groupings. Analysis of linkages between sites has produced network clustering that matches substantive expertise with substantial understanding of the topic. For example, Qin et al [9] employ a four step process of website collection that first starts by identifying terrorist websites and their URL and proceeds by extracting their linkages for content analysis. Qin et al. [9] provide an excellent example of this strategy via the use of software tools and an analyst to locate, collect, and rate the technical sophistication of Middle Eastern extremist websites. In addition, research has started using Social Network Analysis as a tool for website collection and analysis. Zhou and Chen [10] have utilized web collection software, or spiders, to collect website information by tracing the linkage data. These sites were then coded and classified using the clustering techniques of SNA. In their attempt to construct a comprehensive collection of terrorist website on the internet in order to eliminate the problem of website disappearance and hacking, Zhou et al [10] propose a recursive procedure

that accounts for problems with website disappearance through manual filtering. B. Higher Order Linkage Information In the current work, we attempt to provide a methodology that considers not only the adjacency information such as high incidence linkage found in the direct connections within the graph, but also the higher order linkages determined by secondary and tertiary link association in order to determine the similarity of web sties, thereby improving the accuracy of clustering. As with general clustering of web pages, the key point for implementing effective extremist web pages clustering is to find intrinsic relationships (i.e., similarities) between these pages. Many properties of web page, such as web page content, hyperlinks and usage data (server log files) have been used for this purpose. Hyperlinks have the advantage that in most cases the authors of web pages create links to another web pages usually with an idea in mind that the linked web pages are similar to the linking web pages. It is relatively accurate to assess web pages similarity based on hyperlinks and results from link based clustering compare well to clustering based on synonymy and polysemy of the words in the web pages. The early successful application of hyperlink analysis could be found in many web-related areas, such as page ranking in the search engine Google [11], [12], web page community construction [13], [14], [15], and relevant page finding [14], [16]. These works reveal that hyperlinks convey semantics among the web pages and can be used in many areas. In this report we are focusing on hyperlinks of web pages. We propose an approach that measure the intrinsic relationships (i.e., similarities) of a set of extremist web pages. This approach takes the bi-direction hyperlinks structure of web pages and constructs similarity scores. The quasi-clique merger method [17] (QCM), a multimembership hierarchical clustering algorithm, is then applied for clustering these web pages. We then propose an improved method for constructing communities from the QCM derived hierarchical tree. The experimental results show that this similarity measurement and the hierarchical clustering algorithm are very effective. This report is organized as follows. In Section II, we will introduce how to define the similarity score between any two web pages; In Section III, we will describe the QCM clustering algorithm briefly and show how to choose the clustering groups from hierarchical tree; the experimental result is shown in Section IV, and Conclusions are given in Section V. II. S IMILARITY MEASUREMENT Let T be the set of web pages which is need to cluster into communities. ForSa more accurate result, the set T is expended to a superset T R: the set R is acted as a reference set which consists of webpages that link to or are linked from some member of T . In this section, we first establish S such a reference set R, then we use the hyperlinks within T R to propose a new similarity measurement.

459

A. The construction of the reference set R

linked Si,j =

The following construction of the reference set R was proposed in [18], and our approach is similar to this. 1) For each page t in T , select up to r1 pages, which link to web page t and whose domain names are different from that of t, and add them to the set R1 . 2) For each page t in T , select r2 pages which are linked by web page t and whose domain names are different from that of t, and add them to S the set R2 . 3) The reference page set R = R1 R2 . We also add the original linking information between T and R, and the links among T .

(ck,i · ck,j )

Direct linking. If there exist direct links between two pages, there would be a certain mutual semantic relationship between these two web pages in most cases based on such a assumption that the author of one web page create links to another web page usually with an idea in mind that the linked web page is similar to the linking web pages. The following Equation 3 shows the partial score of the hyperlink similarity function considering direct linking between i and j. 0 Si,j =θ·ξ

For convenience, we assume there are n web pages in S R T , and we label all n web pages with numbers from 1 to n. All the linking information of these n web pages could be described by a directed graph D = (V, A). Each web page corresponds to one vertex. There is an arc (i, j) (from vertex i to j) if and only if web page i links the web page j directly. Denote the adjacent matrix of D by AD = (xi,j ), where  1, if (i, j) ∈ A ; xi,j = 0, otherwise.

  0, if none of arcs (i, j) and (j, i) exists in D; 1, if exctly one of them exists in D; θ=  2, if both of them exist in D. and ξ is a user specified parameter to show the importance of directly linkage between i and j. Complete hyperlink similarity. The complete hyperlink similarity between any two web pages i and j is defined as in the following Equation 4 linking linked 0 SMi,j = Si,j + Si,j + Si,j

(4)

III. C LUSTERING WEB PAGES Now we obtain a weighted graph G = (T, w) from Section II, where T is the set of web pages need to cluster and the weight w(i, j) = SMi,j . Based on the similarities between any two web pages of T , we want to cluster the web pages in T . A number of clustering algorithms are presented in the literature [19], such as k-means [20], hierarchical clustering [21], graph partitioning [22] and spectral clustering [23] etc. So far, only methods such as agglomerative hierarchical clustering (AHC), k-means, and some graph partitioning methods had been used popularly for web data clustering. Now in this paper we will introduce another hierarchical algorithm called Quasi Clique Merger (abbreviated as QCM ) to cluster the web pages based on the similarity measurement. We use this clustering algorithm because it has two important feature. The first is that it constructs a much smaller hierarchical tree comparing to other clustering algorithm, which clearly highlights meaningful clusters; the second feature is the overlapping clustering or multi-membership. That is, it allows a web page to belong to more than one clustering groups, which is more appropriate way to reflect the real-world situation.. For completeness, in the following, we describe the algorithm briefly, detailed discussion could be seen at [17].

Co-linking vertices. The estimation of the similarity between two web pages i and j includes the information of the number of co-linking vertices. The analogy comes from bibliographic citations: when articles a1 , a2 , ..., an cite some set of common articles b1 , b2 , ..., bn , then there is a semantic relation between ai ’s. Let ri and rj be the ith row and jth row of adjacent matrix AD. The following Equation 1 shows the partial score of the hyperlink similarity function based on co-linking vertices. (ri,k · rj,k )

(3)

where

Our measure of the hyperlink similarity between two web pages i and j, SMi,j , take the three important aspects that imply semantic relations: the direct link between i and j, the number of common web pages that both web pages i and j link to (co-linking vertices), and the number of common web pages which link to both web pages i and j (co-linked vertices).

n X

(2)

k=1

B. Page similarity score

linking Si,j =

n X

(1)

k=1

Co-linked vertices. The estimation of the similarity between two web pages i and j also includes the information of the number of co-linked vertices. This intuition is also from the bibliographic citations: when articles a1 , a2 , ..., an cite some set of common articles b1 , b2 , ..., bn , then there is a semantic relation between bi ’s. Let ci and cj be the ith column and jth column of adjacent matrix AD. The following Equation 2 shows the partial score of the hyperlink similarity function based on co-linked vertices.

460

Substep 3.3. Suppose eµ = xy. If at least one of x, y ∈ / Sp−1 V (C ) then go to Step 2, otherwise go to Substep 3.2. i i=1

A. QCM algorithm At first, we introduce some definitions from graph theory. A subgraph H in an un-weighted graph is defined as a clique if every pair of vertices of H is joined by one edge. It is well-known that the search of cliques with maximum vertices in graphs is an N P -complete problem. For a subgraph C in weighted graph, we define the density of C by P 2 e∈E(C) w(e) d(C) = . |V (C)|(|V (C)| − 1)

Step 4. (Merge) Substep 4.1. List all members of L` as a sequence C1 , · · · , Cs such that |V (C1 )| ≥ |V (C2 )| ≥ · · · ≥ |V (Cs )| where s ← |L` |. h ← 2, j ← 1. Substep 4.2. If |Cj ∩ Ch | > β min(|Cj |, |Ch |) (where β (0 < β < 1) is a user specified parameter), then Cs+1 ← Cj ∪ Ch and the sequence L` is rearranged as follows

As seen above, if d(C) = 1 and w(e) = 1 for every edge e in C, then the subgraph C induces a clique. For a weighted graph, a subgraph C is called a ∆-quasi-clique if d(C) ≥ ∆ for some positive real number ∆. The core of the algorithm is deciding whether or not to add a vertex to an already selected dense subgraph C. For a vertex v∈ / V (C), we define the contribution of v to C by P u∈V (C) w(uv) c(v, C) = . |V (C)|

C1 , · · · , Cs−1 ← deleting Cj , Ch from C1 , · · · , Cs+1 and s ← s − 1, h ← max{h − 2, 1}, and go to Substep 4.4. Substep 4.3. j ← j + 1. If j < h go to Substep 4.2. Substep 4.4. h ← h + 1 and j ← 1. If h ≤ s go to Substep 4.2. Step 5. Contract each Cp ∈ L` as a vertex: V (G) ← [V (G) −

A vertex v is added into C if c(v, C) > αd(C) where α is a function of some user specified parameters. The QCM clustering algorithm is a process that detects all dense subgraphs in G, and constructs a hierarchically nested system to illustrate their inclusion relation. The subgraph in lower hierarchical levels will have higher density. We summarize the main part of QCM in the following. S Instance: G = (V, E) is a graph with w : E(G) 7→ R+ {0}.

s [

V (Cp )]

[ {C1 , · · · , Cs },

p=1

P w(uv) ← w(Ci0 , Ci00 ) =

e∈EC

,C 00 i0 i

w(e)

|ECi0 ,Ci00 |

if the vertex u is obtained by contracting Ci0 and v is obtained by contracting Ci00 where ECi0 ,Ci00 is the set of crossing edges which is defined as ECi0 ,Ci00 = {xy : x ∈ Ci0 , y ∈ Ci00 , x 6= y}. For t ∈ V (G) − {C1 , · · · , Cs }, define w(t, Ci0 ) = w({t}, Ci0 ). Other cases are defined similarly. If |V (G)| ≥ 2 then go to Step 6, otherwise, go to END

Task: Detect ∆-quasi-cliques in G with various levels of ∆, and construct a hierarchically nested system to illustrate their inclusion relation.

Step 6. ` ← ` + 1, L` ← ∅,

Algorithm

w0 ← γ max{w(e) : ∀e ∈ E(G)}

Step 0. ` ← 1 where ` is the indicator of the levels in the hierarchical system. w0 ← γ max{w(e) : ∀e ∈ E(G)} where γ (0 < γ < 1) is a user specified parameter.

where γ (0 < γ < 1) is a user specified parameter and go to Step 1 (to start a new search in a higher level of the hierarchical system). END.

Step 1. (The initial step) Sort the edge set {e ∈ E(G) : w(e) ≥ w0 } as a sequence S = e1 , · · · , em such that w(e1 ) ≥ w(e2 ) ≥ · · · ≥ w(em ). µ ← 1, p ← 0, and L` ← ∅.

B. Selection of communities (clusters) In the hierarchical tree generated by QCM, each vertex represents a community (it is possible some community consists of only one vertex). It is then necessary to choose cut points in the tree in order to determine the clustering subgroups we want. Other authors have suggested that communities should be selected along one layer of the hierarchical tree and in these algorithms the branches of the tree are cut horizontally so that all vertices immediately below the cut line represent the collection of communities. This heuristic approach has been criticized for being less meaningful and inaccurate. We propose a cut based on the changing density from level to level as the tree branches. Each group has a density, and the density of community with a single node is defined as infinite. Note that the larger the density of a subgraph is, the similar the vertices in the

Step 2. (Starting a new search). p ← p + 1, Cp ← V (eµ ). L` ← L` ∪ {Cp }. Step 3. (Grow) Substep 3.1. If V (G) − V (Cp ) = ∅, then go to Step 4, otherwise continue: Pick v ∈ V (G) − V (Cp ) such that c(v, Cp ) is a maximum. If c(v, Cp ) ≥ αn d(Cp ) 1 where n = |V (Cp )| and αn = 1 − 2λ(n+t) with λ ≥ 1 and t ≥ 1 as user specified parameters, then Cp ← Cp ∪ {v} and go back to Substep 3.1.

Substep 3.2. µ ← µ + 1. If µ > m go to Step 4.

461

subgraphs are. Let ρ be a real valued parameter (which can be user specified or learned) that is used to indicate the transition of community density in the hierarchical tree to a specified lower bound. More specifically, from a lower level of the tree to the immediately higher level, identify all communities such that their density is not less than ρ but the density of the parent are less than ρ. These communities are what we want. In this way we will identify cut points in the tree below which all communities will have density greater than ρ but for which the parent node is greater than ρ.

would be clustered into 6 subgroups, see Table I (the right). Note that all similarity estimation is based only on the web linkage information. No content analysis was involved. Due to the limitation of information, we do not expect 100% accuracy. Experts manually checked each web page to be clustered and gave the “real” clusters according to their judgement and this information is shown as the known categories in Table I. We chose this classification as a reasonable “ground truth” in order to benchmark our algorithm. We notice some mis-located/misgrouped items. For example, web #18 “ theoccidentalquarterly.com” should be on the topic of White nationalist, but we group it with other anti-immigration web pages.

IV. E XPERIMENTAL RESULT In this work we restrict our attention to a subset T of the collected data to show the utility of our similarity measurement and the QCM clustering method. More comprehensive analysis of the larger data set is underway and will be published in subsequent work. The goal is to cluster the following 18 extremist web sites which focus on six known topics. See Table I for the known categories. Guided by expert collaborators from political science, this goal is a reflection of an intelligence analysts need to construct a hierarchically nested system to illustrate the inclusion relation such that the web sites in a same subgroup with lower hierarchical levels will have higher similarity. T ={1. catholictradition.org, 2.americanborderpatrol.com, 3. americanpatrol.com, 4. armyofgod.com, 5. aryannations.org, 6. aryanunion.org, 7. bloodandhonourusa.com, 8. borderguardians.org, 9. ccir.net, 10. culturewars.com, 11. finalstandrecords.com, 12. kkkk.net, 13. fatima.org, 14. animalliberationfront.com, 15. rescuewithoutborders.org, 16. saveourstate.org, 17. cfnews.org, 18. theoccidentalquarterly.com} The entire reference web page set R consists of 213 extremist web pages. According the way of defining similarity in Section II, we can get a weighted graph G = (T, W ) where w(i, j) = SMi,j for any two vertices i, j of T, see Figure 1.

Fig. 2.

The hierarchical tree generated by the QCM clustering algorithm

TABLE I K NOWN E XPERT C ATEGORIES AND C LUSTERED R ESULTS

Religious

Anti immigration

White nationalist

Nero nazi Fig. 1.

Weighted graph reflecting similarity of nodes in T Animal Rights Antiabortion

After applying QCM clustering algorithm on G (here the parameters are played with λ = t = 2, γ = β = 0.5 and ξ = 3 ) we generate the hierarchical tree shown in Figure 2, where the real number above each circle is the density of the subgraph induced by the vertices in the circle. If we choose ρ = 3, according to the rule of subsection B, we will cut the hierarchical tree as the dotted lines show in Figure 2. T

Expert Categories 1. catholictradition.org 10. culturewars.com 13. fatima.org 17. cfnews.org 2. americanborderpatrol.com

Clustered subgroups 1. catholictradition.org 10. culturewars.com 13. fatima.org 17. cfnews.org 2. americanborderpatrol.com

3. americanpatrol.com 8. borderguardians.org 9. ccir.net 15. rescuewithoutborders.org 16. saveourstate.org

3. americanpatrol.com 8.borderguardians.org 9. ccir.net 15. rescuewithoutborders.org 16. saveourstate.org 18. theoccidentalquarterly.com 7. bloodandhonourusa.com

7. bloodandhonourusa.com 11. finalstandrecords.com 12. kkkk.net 18. theoccidentalquarterly.com 5. aryan-nations.org

11. finalstandrecords.com 12. kkkk.net

6 aryanunion.org 14. animalliberationfront.com 4. armyofgod.com

6 aryanunion.org 14. animalliberationfront.com 4. armyofgod.com

5. aryan-nations.org

For comparison, we apply the popular agglomerative hierarchical clustering (AHC) method to the same data set. Figure 3 is the hierarchical tree constructed from AHC. The outputs

462

R EFERENCES

from AHC and QCM are identical in the sense that both have only one misplaced item (web #18). However, we notice that the hierarchical tree from QCM has only 11 nodes as potential communities, while AHC produces 16 potential communities which is roughly the size of our population. When the size of input data grows, the QCM method will construct a much smaller tree, which will significantly reduce the human involvement of community selection. Specifically, AHC exhibits ambiguity in its various tree splitting rules that causes incorrect clustering to increase with the number of nodes, and one must run multiple trials in order to verify results which becomes computationally expensive. QCM has the advantage that its cut points in the tree are well-defined. In addition this application does not show another feature of QCM: the overlapping clustering (multi-membership clustering). Overlapping clustering tend to appear in most applications and these will be the subject of future experiments along with the comparison of the QCM technique with other methods when applied to very large data sets.

Fig. 3.

[1] Cutting, D., D. Karger, et al. (1992). Scatter/gather: A cluster-based approach to browsing large document collections, ACM New York, NY, USA. [2] Iwayama, M. and T. Tokunaga (1995). Cluster-based text categorization: a comparison of category search strategies, ACM New York, NY, USA. [3] Jones, G., A. Robertson, et al. (1995). ”Non-hierarchic document clustering using a genetic algorithm.” Information Research 1(1). [4] Ricca, F., P. Tonella, et al. An empirical study on keyword-based web site clustering. 2004. [5] Polanco, X. (2002). ”Clusters, Graphs, and Networks for Analyzing Internet-Web-Supported Communication within a Virtual Community.” ADVANCES IN KNOWLEDGE ORGANIZATION 8: 364-371. [6] Barabasi, A. (2001). “The physics of the Web.” Physics World 14(7): 3338. [7] Zhou, Y., E. Reid, et al. (2005). ”US domestic extremist groups on the Web: link and content analysis.” IEEE Intelligent Systems : 44-51. [8] Chen, H., Quin, J., Reid, E., Chung, W., Zhou, Y., Xi, W., Lai, G., G., Bonillas, A., and Sageman, M. (2004). “The Dark Web Portal: Collecting and Analyzing the Presence of Domestic and International Terrorist Groups on the Web. Proceedings of the 7th International Conference on Intelligent Transportation Systems (ITSC), Washington D.C. [9] Qin, J., Y. Zhou, et al. (2007). ”Analyzing terror campaigns on the internet: Technical sophistication, content richness, and Web interactivity.” International Journal of Human-Computer Studies 65(1): 71-84. [10] Chen, H., W. Chung, et al. (2008). “Uncovering the dark Web: A case study of Jihad on the Web.” Journal of the American Society for Information Science and Technology 59(8): 1347-1359. [11] Sergey Brin and Lawrence Page, “The anatomy of a large-scale hypertextual Web search engine,” Proceedings of the seventh international conference on World Wide Web, Brisbane, Australia, April 1998, pp. 107– 117. [12] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank Citation Ranking: Bringing Order to the Web,” Stanford Digital Libraries Working Paper, 1998. [13] Jon M. Kleinberg, “Authoritative sources in a hyperlinked environment,” Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms, San Francisco, California, United States, January 25-27 1998, pp. 668–677. [14] Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar and Suresh Venkatasubramanian, “The connectivity server: fast access to linkage information on the Web,” Proceedings of the seventh international conference on World Wide Web, Brisbane, Australia, April 1998, pp. 469477. [15] Jingyu Hou, Yanchun Zhang, “Constructing good quality web page communities,”, Proceedings of the thirteenth Australasian database conference, Melbourne, Victoria, Australia, January 2002, pp. 65–74. [16] Jeffrey Dean , Monika R. Henzinger, “Finding related pages in the World Wide Web,”, Proceeding of the eighth international conference on World Wide Web, Toronto, Canada, May 1999, pp. 1467–1479. [17] Yongbin Ou and Cunquan Zhang, “A new multi-membership clustering method,” Journal of Industrial and Management Optimization, vol 3(4), pp. 619–624, 2007. [18] Jingyu Hou and Yanchun Zhang, “Utilizing hyperlink transitivity to improve web page clustering,” Proceedings of the 14th Australasian database conference, Adelaide, Australia, 2003, pp. 49–57. [19] A.K. Jain, M.N. Murty and P.J. Flynn, “Data clustering: a review”, ACM Computing Surveys (CSUR) 31(3), 1999, 264-323. [20] J. McQueen, “Some methods for classification and analysis of multivariate observations,” Fifth Berkeley Symposium on Mathematical Statistics and Probability, Statistical Laboratory of the University of California, Berkeley, 1967, pp. 281-297. [21] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Compute, Boston: Addison-Wesley, 1989. [22] C. T. Zahn, “Graph-theoretical methods for detecting and describing gestalt structures,” IEEE Transactions on Computers, vol. C-20, pp. 68– 86, 1971. [23] X. Hea, H. Zhaa, C.H.Q. Ding, H.D. Simon, “Web document clustering using hyperlink structures,” Computational Statistics and Data Analysis, vol.41 (1), pp. 19–45, 2002.

The hierarch tree by AHC clustering algorithm

V. C ONCLUSION In this paper, we propose an approach to measure the intrinsic relationships of a set of extremist web pages which takes the bi-direction hyperlinks structure of web pages into account. Based on the similarity scores, an effective hierarchical algorithm−−quasi clique merger method is applied for clustering these web pages. The experimental results show that the new similarity measurement and the hierarchical clustering algorithm provide an accurate clustering of these web sites according to the intrinsic topical nature of the sites that agrees well with an expert classification of these sites. The promising similarity measurement and clustering method has the potential for application to other web-based research areas including semantic web analysis and social network data mining. ACKNOWLEDGMENT The authors would like to thank the three referees for their valuable suggestions.

463