Application of CURE Data Clustering Algorithm to ...

International Journal on Advances in Computing and Communication Technologies Volume 2, Issue 1, 2013

Application of CURE Data Clustering Algorithm to Batangas State University Student Database Nguyen Thi Linh Department of Information Technology ICT University – Thai Nguyen University Thai Nguyen, Vietnam

Christopher Chua Department of Informatics and Computing Sciences Batangas State University Batangas City, Philippines Abstract—Clustering is said to be one of the most complex, well-known and most studied problems in data mining theory. Data clustering is the process of grouping the data into classes or clusters, so that objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters. The increasing enrolment of students at Batangas State University (BatStateU) equates to increase of students’ database which can be mined to discover patterns in large data sets. Patterns extracted can be converted to understandable information that can be useful to the organization. A popular data clustering algorithm known as Clustering Using Representative (CURE) was implemented using C# programming language to cluster the students’ database of Batangas State University.

describing the objects and are usually distance measures are used. CURE is an agglomerative algorithm in the hierarchical method which builds clusters gradually. It identifies clusters by using c representative points that are created by choosing well-scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction α [4]. The parameter α can also be used to control the shapes of clusters. A smaller value of α contracts the dispersed points very little and thus favors elongated clusters. On the other hand, with larger values of α, the scattered points get located closer to the mean, and clusters tend to be more compact [4]. During each iteration, the clusters merged are those having the closest pair of representative points, until the desired number of clusters is reached. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers.

Keywords—CURE algorithm, data clustering, data mining

I.

INTRODUCTION

Data mining is one of the main steps in the process of knowledge discovery. It is considered a complex process where intelligent methods are applied in order to extract data patterns [1]. It involves integration of techniques from multiple disciplines such as database and data warehouse technology, statistics, machine learning, high – performance computing, pattern recognition, neural networks, data visualization, information retrieval, image and signal processing, and spatial or temporal data analysis.

In this paper, the following objectives are attained: 1. Characterize the student database of BatStateU. 2. Develop a database clustering application using CURE algorithm 3. Utilize the developed application to cluster the student database of BatStateU.

Investigating on methods of data mining still has been a main and essential subject of researchers and scientists. With the vast and diversified information resource, discovering a general method for data mining is impossible. This is because each kind of information resource or database has some correlative methods which are appropriate for mining it. Researchers‘ main objective is finding effective data mining methods for each case.

II.

METHODOLOGY

This paper used the constructive research method to come up with a data clustering application. Constructive research method deals with building of an artifact (practical, theoretical or both) which solves a domain specific problem in order to create knowledge about how the problem can be solved (or understood, explained or modeled) in principle [5]. The C# object–oriented programming language was used to design the interface, implement CURE algorithm and functions for the application. SQL Server 2005 was used as a tool for pre-processing data, designing data tables and implementing connections, queries, and stored procedures to ensure the interaction between the user and the application, as well as the application and the database system.

One of the most complex, well-known and most studied problems in data mining theory is clustering. This term refers to the process of grouping the data into classes or clusters, so that objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters [2]. As mentioned by Ma and Wu [3] dissimilarities are assessed based on the attribute values 116

International Journal on Advances in Computing and Communication Technologies Volume 2, Issue 1, 2013 Pre-processing of the raw data to prepare them for BatStateU student database is described by means of the another processing procedure is needed in data mining [6]. following parameters namely: SRCODE for student, CODE In this context, the BatStateU student database was prefor subject, COURSE CODE for course and FG for grade. processed manually and by queries using select, update, The SRCODE was used to identify each student. It is a insert procedures in SQL. Each table is represented by an unique field and students are described by enrolled year entity in SQL server database and a class in the developed followed by five numerical letters (e.g. 2002-00979). application. A total of nine tables were created namely Subject code was called CODE field and described by letters tblGrade, tblStudent, tblDepartment, tblCourse, tblSubject, which are initial letters in course name or subject name tblInstructor, tblYear, tblGradeFilter, tblGradeCluster. followed by numbers (e.g. IT 512). Courses in BatStateU In the design of interface, six menus were created namely database are represented by COURSE CODE field and are BSU Data, User Data, Parameters, View, Window, and coded using initial letters in course name (e.g. DPENGLS). Help. Under the BSU Data menu, the application has a The final grade was called FG field. If a student received an main form called frmBsuData which presents all incomplete grade in a certain subject, the developed information filtered by the user. Under this form, seven subapplication assigns a value of zero for the FG. Possible forms can be found. Forms namely frmGrade, frmStudent, values for this field are 1.00, 1.25, 1.50, 1.75, 2.00 2.25, frmCourse, frmSubject, frmInstructor, frmYear are used to 2.50, 2.75, 3.00, 4.00, 5.00 or 0.00. input information about student, subjects, courses, grades, instructors and school year. Form named frmGradeCluster B. Development of the Data Clustering Application Using allows the user to choose the object to cluster. CURE Algorithm Under the User Data menu, the application has a form called frmNewData. This allows the user to input data for In this paper, the developed clustering application clustering. For the Parameters menu, the form named includes many classes and objects which were structuralized DialogSetAll is defined in order to set the clustering as the two–tier architecture in a window-based application. parameters. The application has a main clustering form The two tiers are presentation and implement tiers. called frmOpenData. Data for clustering processing and The graphical user interface of the application is clustering results are presented in this form. View and designed at the presentation tier. This tier contains window Window menus support the display of toolbar, status bar and forms used in data presentation and accepting input from the window of the application. Under the Help menu, the users. application has a form called frmHelp which provides instructions on how the application can be used. The implement tier contains business logic, validations and calculations related with the data. This tier contains CURE and functions were applied along with the objectclasses to support for data access and the clustering process oriented structure: classes and objects. In addition, two data of CURE algorithm. Classes intended to support data access structures HEAP, KD_TREE were used and applied to are labeled clsTblCourse, clsTblSubject, clsTblStudent, CURE. The main functions of the application allow clsTblGrade, clsTblYear, clsTblInstructor, managing information about the grade, student, subjects, and clsTblGradeCluster and FILES. courses; filtering data of BatStateU desired clustering; setting clustering parameters; implementing data clustering; Classes which support the clustering process of CURE presenting the result; saving or printing results, etc. In algorithm are named POINT, CLUSTER, HEAP, addition, the application also allows inputting mined data KD_TREE and CLUSTERING_CURE. Specific methods directly or from files for clustering. for each class were created. The database of the BatStateU – Graduate School (GS) students was used for clustering purpose of student, subjects, and courses. Data clustering involves steps such as filtering data needed to cluster, transforming filtered data into mined data, choosing clustering object, setting clustering parameters of CURE algorithm, and executing clustering [2].

The clustering processing using CURE algorithm is implemented in CLUSTERING_CURE class. Its main procedure is named as cluster(). The procedure needs a set of points and the number of desired cluster k. The result of this procedure is a set of desired clusters. CLUSTERING_CURE class is defined as follow: Class CLUSTERING_CURE { //This property contains set of initialized points private POINT[] points; //Initialize a CLUSTERING_CURE object from a set of points public CLUSTERING_CURE(POINT[] points) { this.points = points; }

In order to choose the best results, the clustering processing on specific data is repeated many times using different clustering parameters of CURE algorithm. These include the number of clusters (k), number of representative points (c) and shrink coefficient (α). The clustering result changes when the value of one of parameters changes. III.

RESULTS AND DISCUSSION

public CLUSTER[] Cluster() { //Initialize Heap and kd-Tree CLUSTER[] resultArray;

A. Characteristics of Student Database of BatStateU

117

International Journal on Advances in Computing and Communication Technologies Volume 2, Issue 1, 2013 x.DistCloset = x.distCluster(x.Closest); Q.Relocate(i); }

CLUSTER[] clusters = TOOLS.Points2Clusters(points); KD_TREE T = new KD_TREE(points); HEAP Q = new HEAP(clusters); //Clustering loop while (Q.size() > TOOLS.k) { CLUSTER u = Q.DeleteMin(); CLUSTER v = u.Closest; Q.Delete(v); CLUSTER w = u.merge(v); //Delete u.Rep, v.Rep and insert w.Rep into T tree bool deleteOK = true; foreach (POINT p in u.Rep) T.Delete(p, 0, ref deleteOK); foreach (POINT p in v.Rep) T.Delete(p, 0, ref deleteOK); foreach (POINT p in w.Rep) T.Insert(p); //Initialize Closest for w w.Closest = Q.Data[0]; w.DistCloset = w.distCluster(w.Closest);

} Q.Insert(w); } //End While //return the clustering result resultArray = new CLUSTER[Q.size()]; for (int i = 0; i < = Q.Last; i+ + ) resultArray[i] = Q.Data[i]; return resultArray; } }

Finally, to interact between presentation tier and implement tier, the TOOLS class was designed. Properties and methods supporting the clustering process in the implement tier are created in this class. C. Clustering the Student Database of BatStateU

//Start searching w.closest and closest for the other clusters in Q for (int i = 0; i < = Q.Last; i+ + ) { CLUSTER x = Q.Data[i]; // Find out w.closets if (w.distCluster(x) < w.distCluster(w.Closest)) { w.Closest = x; w.DistCloset = w.distCluster(w.Closest); } //Find out x.closest: if (TOOLS.Equals(x.Closest, u) || TOOLS.Equals(x.Closest, v))

Specifically, BatStateU - GS database from 2008-2009 to 2009-2010 academic years which includes 529 students was used for clustering. Steps performed in the clustering process were filtering data, transforming filtered data into mined data, choosing clustering object, setting clustering parameters and executing clustering. Clustering of student was based on statistical ratio of grades that each student achieved in all subjects. Table 1 shows the computation of grade of a student. TABLE I. COMPUTATION OF GRADE OF A STUDENT

{ if ( x.distCluster(x.Closest< x.distCluster(w) ) { CLUSTER closest = T.Closest_Cluster(x, x.distCluster(w), Q); if (!TOOLS.Equals(x, closest)) { x.Closest = closest; x.DistCloset = x.distCluster(x.Closest); } } else if (!TOOLS.Equals(x, w)) { x.Closest = w; x.DistCloset = x.distCluster(x.Closest); } Q.Relocate(i); } Else if (x.distCluster(x.Closest) > x.distCluster(w)) { x.Closest = w;

The statistical ratio of each kind of grade is calculated by dividing the frequency of a certain kind of grade by the overall total frequency of all kinds of grade. Hence, the statistical ratio of grade 1.25 is computed 6/13 = 0.4615. The student has statistical ratios of 0.00, 1.00, 1.25, 1.50, 1.75, 2.00 grades with values 0.0000, 0.0000, 0.4615, 0.4615, 0.0769, 0.0000, respectively. The statistical ratio of each kind of grade is automatically calculated by the application and creates a data point (e.g. G2008-00148, 0.0000, 0.0000, 0.4615, 0.4615, 0.0769, 0.0000). This data point is saved on a table called tblGradeCluster which resides in SQL server database. All data saved on tblGradeCluster table are used for clustering. In using the developed application, the following steps are performed: 1. Open the grade table of students from BSU data menu. 2. Transform the data to mined data by clicking the ‗Transform into mined data’ button. 118

International Journal on Advances in Computing and Communication Technologies Volume 2, Issue 1, 2013 3. Click the ‗Yes‘ button to cluster the data. 4. Choose the desired clustering object (e.g. students) 5. Set the parameters (e.g. shrink coefficient=0.7, no. of representative=4, no. of cluster=15) 6. Click ‗OK‘ button. Fig. 1 shows the student data clustering result. The clustering process of 526 students is equivalent to 526 data points. The number of cluster (k) can be set as desired from 1 to 526. The number of representative points (c) varies from 1 to 10 and the shrink coefficient (α) from 0.1 to 0.9. After several student clustering experiments with various parameters, the researchers found out that using k = 15, c = 4 and α = 0.7, gave the best clustering result. Under these Figure 2. Course Clustering Result settings, most students are seen to have 1.5 grades and rated Hence, the statistical ratio of grade 0.00 as computed was as good in terms of their performance. 36/405 = 0.0888. The MSINTEC course had statistical ratios of 0.00, 1.00, 1.25, 1.50, 1.75, 2.00 grades with values 0.0888, 0.0123, 0.2716, 0.4567, 0.1753, 0.0024, respectively. The statistical ratio of each kind of grade was automatically calculated by the application and created a data point (e.g. MSINTEC, 0.0888, 0.0123, 0.2716, 0.4567, 0.1753, 0.0024). Computation is illustrated in Table 2.

TABLE II. OBTAINED GRADES OF STUDENTS IN COURSE MSINTEC

Figure 1. Student Data Clustering Result

There is no method to select the number of clusters or representative points or shrink coefficient which will give the best clustering result. Only known is that, the greater the similarity coefficient, the more similar are the two data points of the two clusters [3]. And so, authors performed many experiments and use empirical evaluations to choose the optimal results. The clustering result is considered the best when found data points in a cluster have highest similarity in comparison to each other.

For subject clustering, subject code and FG fields were used. The clustering result was used to evaluate the level of difficulty of each subject. The clustering process of subjects was based on statistical ratio of grades of each subject. The statistical ratio of grades in the subject IT 509 - E-learning and Related Technology is presented in Table 3.

For clustering of courses, the developed application clustered 25 courses which correspond to 25 data points. Clustering process of courses is similar to the steps executed in student clustering except when choosing the desired clustering object, the ‗course object‘ is selected instead. After many course clustering experiments with various parameters, authors found out that using k=8, c=3, α = 0.6 gave the best result.

TABLE III. GRADES OBTAINED BY STUDENTS IN THE SUBJECT IT 509

Fig. 2 shows the course clustering result. The statistical ratios of grades in each course were obtained. The frequency of each grade for each the course is counted. The statistical ratio of each kind of grade was calculated by dividing the frequency of a certain kind of grade by the total frequency of all the kinds of grade.

The statistical ratio of each kind of grade is automatically calculated by the application and creates a data point (e.g. IT 509, 0.1569, 0.0000, 0.2745, 0.3529, 0.2157, 0.0000). This data point is saved in tblGradeCluster table in SQL server database. After several subject clustering experiments with various parameters, the authors found out that using k=10, c=4 and α = 0.7 achieved the best result. Fig. 3 shows the subject clustering result.

119

International Journal on Advances in Computing and Communication Technologies Volume 2, Issue 1, 2013 International Conference on Frontiers of Information Technology, ISBN: 978-1-60558-642-7, 2009.

Figure 3. Subject Clustering Result

IV.

CONCLUSION AND FUTURE WORK

In this paper the student database of BatStateU is described. The database uses SRCODE field for student identification, CODE field for subject name, COURSE CODE field for course name and FG field for grading. A database clustering application using CURE algorithm was successfully developed using C# and SQL Server 2005. The application is based on a two-tier architecture in which several classes with specific methods were created to support data access and the clustering process of CURE algorithm. With regard to the clustering of the database, the best clustering results for students, courses, and subjects are achieved when (k = 15, c = 4 and α = 0.7), (k=8, c=3, α = 0.6) and (k = 10, c = 4, α = 0.7), respectively. Further analysis revealed that students are performing well and subjects in BatStateU – GS are moderately difficult. The application developed in this paper can be modified focusing on the method to select the number of clusters k, number of representatives c, or shrink coefficient α which will give the best clustering result using CURE algorithm. Similar studies can be conducted using the improved algorithms of CURE and apply them to complicated databases which have mixed types of data such as weather, business and geographical databases.

REFERENCES [1] M.J.A. Berry and G.S. Linoff. Mining the Web: Transforming Customer Data. John Wiley & Sons, New York, 2002. [2] J. R. Dubes, Algorithms for Clustering Data, Prentice-Hall, 1998. [3] G.G. Ma and J. Wu, Data Clustering: Theory, Algorithms, and Applications, ASA-SIAM Series on Statistics and Applied Probability, 2007. [4] Guha, R. Rastogi, and K. Shim, CURE: An Efficient Clustering Algorithm for Large Databases, ACM 089791~996.5/98/006, 1998. [5] G.D. Crnkovic, Model-Based Reasoning in Science and Technology Studies in Computational Intelligence – ―Constructive Research and Info-Computational Knowledge Generation‖ Vol. 314, 2010, pp. 359-380. [6] I.B. Gul, and A. Nosheen, MFP: A Mechanism for Determining Associated Patterns of Stock, Proceedings of the 6th

120