an Integrated Rule-Based Data Mining System - Computer Science

4 downloads 7881 Views 50KB Size Report
mining system that is capable of creating rule- ... steps: (1) select an application domain and create a target .... domain name and description, and it guarantees a ...
From Data to Knowledge: an Integrated Rule-Based Data Mining System Chien-Chung Chan and Zhicheng Su Department of Computer Science University of Akron Akron, OH 44325-4003 [email protected] Abstract This paper presents an integrated rule-based data mining system that is capable of creating rulebased classifiers with web-based user interface from data sets provided by end users. It provides a streamlined integration of three technologies, namely, database systems, machine learning systems , and rule-based systems. Rules generated by the system are based on rough set theory. Thus, it is capable of dealing with uncertain rules. The system generates user interface dynamically, therefore, end users are released from the burden of programming. The generated classifiers are stored in a database system and are delivered as web-based applications. Therefore, they can be managed and accessed easily by using web browsers. 1. Introduction The development of computer technologies has provided many useful and efficient tools to produce and store huge amount of data stored in various forms. The raw data stored in databases or computer files have become important assets of modern enterprises. Therefore, it has become a common challenge to enterprises to make the best use of these data. The task of extracting useful information from data is not a new one. It has been a common interest in research areas such as statistical data analysis, machine learning, and pattern recognition. A new emerging research area called Knowledge Discovery in Databases (KDD) is mainly concerned with how to develop new approaches and techniques to enable efficient extracting of useful information from databases. In general, KDD refers to the overall process of finding and interpreting useful patterns from

data. A typical KDD process may consist of six steps: (1) select an application domain and create a target data set, (2) preprocess and clean the data set, (3) transform the data set into a proper form, (4) choose the functions and algorithms of data mining, (5) validate and verify discovered patterns, and (6) apply the discovered patterns [1, 2]. Research in KDD has contributed to the recent development of commercial systems. Major vendors of database systems such as Oracle, IBM, and Microsoft have extended their systems to support some basic functionalities of KDD. These systems may be used to create data mining models based on supported clustering and classification algorithms. Since they are rooted in database technologies, no inference or reasoning tools are supported. On the other hand, research in AI has contributed to the development of expert systems since early 80’s [3]. Most successful expert systems has used if-then rules to represent experts’ knowledge, hence, they are also called rule-based systems. The basic structure of a rule-based expert system includes a rule base, which is a set of if-then rules, and an inference engine, which can be used to infer answers to given inputs by using the rules in the rule base. One major challenge of building expert systems is how to construct the rule base. Another issue is whether the system is capable of dealing with uncertain rules. Expert system development environments such as CLIPS [4] and JESS [5] provide tools for end users to program or enter rules into rule base. However, they do not provide machine learning tools to generate rules from data sets. In addition, inference engines may not support reasoning with uncertain rules. Most data mining tools are based on results of machine learning research, which is a wellestablished area in AI [6, 7], and there have been many successful applications. One of the wellknown programs is Quinlan’s C4.5 [8], a revised commercial version is called C5.0, which can be used to generate classifier programs from collections of training examples. The generated classifier can be represented as a decision tree or a set of production rules. The C4.5 system can also generate an inference engine that will apply the generated decision trees or rules to produce answers for queries entered interactively. In C5.0, the system can generate C souce codes of classifiers for developing embedded applications. Another well-known open-source data mining

package is called WEKA [7], which is implemented in java. The WEKA package is quite comprehensive, and it provides tools for pre-processing and typical data mining tasks such as classifying, clustering, and association rule mining. However, no inference engines for using the generated rules are provided. Because it is implemented in java, it is easy to embed WEKA’s tools in java applications. The capability to support embedded applications is quite attractive, however, it requires some programming to provide end-user interface. In addition, there is one issue that has not been addressed by most data mining packages, namely, how to manage the classifiers generated by these systems? In the proposed system, our focus is more from the perspectives of end users who do not have programming skills. The requirements from the end users are to provide data sets and the knowledge of pre-processing the data. From a pre-processed data set, our system will generate a metadata file containing attribute information of the data and a set of if-then rules with uncertainties. The rules are generated using the BLEM2 learning program [9]. The resulting metadata file and rule file are used by a webbased classifier generating system to generate a web-based user interface for running the target classifier. For each target classifier, the corresponding meta-file and rule-file are stored in a database system to provide manageability. The organization of the paper is as follows. In Section 2, we provide a brief overview of rulebased classifiers and the components and workflow of our system. Section 3 presents underlying database and components design of the system. Implementation is covered in Section 4. Section 5 is conclusions and references are given in Section 6. 2. System Overview 2.1. Rule-Based Classifier A rule-based classifier system is a set of if-then rules that implements a classifier. In general, the inference step in a rule-based classifier is quite simple. That is, if the input data match the conditions on the Left Hand Side (LHS) of a rule, the rule is called firable, the firing of the rule is to return the decision on the Right Hand Side (RHS) of the rule. Because conditions of different rules are not necessarily mutually exclusive, it is

possible that multiple rules may be firable for a given input. When the RHS returned by the rules are different, then the answer returned by the system is not unique. We call this condition rule conflict. One way for resolving rule conflict is to use the majority voting method, i.e., to return the value produced by most of the rules, and break ties arbitrarily. It is also possible to use methods based on theories such as fuzzy sets [10, 11] and rough sets [12, 13]. In our system, a maximum sum approach based on rough set theory has been implemented. It will be presented in the following subsection. 2.2. System Components and Workflow As shown in Figure 1, there are three components of the proposed system: Data Preprocessing, Rule Generator, and RBC Generator.

Input Data Set

Data Preprocessing

Rule Generator

RBC Generator

Figure 1. System components and workflow. Input Data Set: The required format for input data set is a text file with comma separated values (CSV), which can be created using MS Excel program. It is assumed that there are N columns of values corresponding to N attributes or variables, which may be real or symbolic values. The first N – 1 attributes are considered as condition attributes and the last one is the decision attribute. Data Preprocessing: This component is used to discretize domains of real attributes into a finite number of intervals. The discretized data file is then used to generate a metadata attribute information file and a training data file. Rule Generator: A symbolic learning program BLEM2 [] is used to generate rules with uncertainty from the discretized training data file and corresponding attribute file. RBC Generator:

This component is used to generate a webbased Rule-Based Classifier (RBC) from a rule file and a metadata file. The idea was first introduced in [14]. For each pair of metadata and rule files, the user interface for running the classifier is generated dynamically. It includes an inference engine shared by all target classifiers. In addition, relevant information of target classifiers and user interfaces are stored in a database for manageability. The architecture of RBC generator is a multi-tier client-server system shown in Figure 2. Clients interact with the backend SQL server through services provided by the application server in the middle tier using a web-browser. The application server is responsible for dispatching requests to the intended backend server, receiving responses from backend server, and presenting the final results back to the clients. The detailed design of middle-tier components and database of the RBC generator will be presented in Section 3.

statements generated from user inputs. It uses the pattern matching ability of a SQL processor to determine the rules that are firable. Rule conflict resolutions are implemented using stored procedures. For instance, to implement the MaxSum approach using the Rough Set theory, the SQL statement can be constructed as follows: SELECT TOP 1 as decision, sum(certainty*strength) as certainty from WHERE = GROUP BY ORDER BY certainty DESC This SQL statement returns the decision-attribute which has the maximum sum(certainty*strength) value whenever the selected condition-attributes meet the user selected values. The majority voting method can be implemented as:

Requests

Middle Tier

Client

SQL DB server

Responses

Figure 2. Architecture of RBC generator. 2.3. Workflow of RBC Generator The workflow of the RBC generator is shown in Figure 3. Rule set File

Metadata File

RBC Generator

SQL Rule Table

Rule Table Definition

Figure 3. Workflow of RBC generator. As shown in Figure 3, the generator takes a rule set file and a metadata file as its inputs, and it dynamically generates SQL statements for creating tables in the database. Rules of different target classifiers are stored in dynamically generated relational tables separately. The inference engine that is shared by all target classifiers is implemented as dynamic SQL

SELECT TOP 1 , COUNT(*) AS MAX_NUM FROM WHERE = OR < not selected condition-attributes> is NULL GROUP BY ORDER BY MAX_NUM This SQL statement returns one of the decisionattribute that has maximum number of matches whenever the selected condition-attributes meet the condition-attributes values stored in the database or the non selected condition-attributes has NULL values in the database. Conflict resolution approaches are stored in a relation table and retrieved dynamically based on user’s preference during the execution of a RBC. 3.

Design of the RBC Generator

3.1. Backend Database Design 3.1.1. Domain Information Tables The information retrieved from the metadata file is stored in three tables: Domain, Attribute and Lookup tables. The domain table contains domain name and description, and it guarantees a

unique domain id for each domain. Attribute table contains the information of each attribute and the mapping relation to the corresponding rule set table such as the actual data type and size when creating the rule set table. Each attribute refers to several rows in the Lookup table; and each row contains one possible value for that attribute. The relationship of these tables is shown in Figure 3 where endpoints of the line indicate whether the relationship is one-to-one or one-to-many. If a relationship has a key at one endpoint and a figure-eight at the other, it is a one-to-many relationship. If a relationship has a key at each endpoint, it is a one-to-one relationship. The relationship line indicates that a foreign key relationship exists between one table and another. For a one-to-many relationship, the foreign key table is the table near the line's figure-eight symbol. If both endpoints of the line attach to the same table, the relationship is a reflexive relationship.

‘INTEGER’. A type field is used to record attribute type as condition or decision attribute. Lookup table Lookup table stores the actual values of attributes. It is also extracted from metadata file line by line. Rule Set table The rule set file is parsed and stored in a rule set table. Each domain has its own rule set table whose name is the same as the domain name. The rule set table is created dynamically according to the information in the Attribute table together with four extra fields: support, certainty, strength and coverage which are generated by the BLEM2 learning program. These fields are used in conflict resolution. 3.1.2.

Management Tables

For flexibility and ease of maintenance, menu items are stored in the database and retrieved dynamically.

Figure 3. tables.

Structure of domain information

Domain table From the metadata file, domain name and description are extracted to Domain table and a unique domain ID is generated for each new domain. Attribute table The attribute information is extracted line by line from the metadata file. The attribute name and number of values are extracted directly from the metadata file. The data type and size field are computed from the attribute category. For example, if the attribute category is ‘C’, the mapping data type would be ‘VARCHAR(127)’; if the attribute category is ‘D’, the mapping data type would be ‘DECIMAL(8,4)’; if the attribute category is ‘I’, the mapping data type would be

Figure 4. Structure of management tables. Users of the system are grouped into different user roles such as Administrator, Author and Operator, etc. These user roles are used to enforce permission control on menu items. In other words, different users are mapped to different sets of menu items. The structure of management tables used in RBC generator is shown in Figure 4. 3.1.3.

Miscellaneous Tables

There are two miscellaneous tables used in the system. The tblApproach table is used to map a conflict resolution approach to the stored procedure that implements it. The approach names are also dynamically retrieved when the application runs. The tblMetaFile table is a temporary place to store the file uploaded from the client side.

Figure 5. Miscellaneous tables. 3.2 Middle-Tier Components

proper roles. This is accomplished by a web user control component, which takes a user role as input and retrieves the menu items dynamically from the table tblUserSideMenuItemAssoc. Authentication: Authentication is enforced by the web configuration stored in web.config file. Access to any pages of this application will be redirected to a logon page. It is allowed to specify only some of the pages as protected by changing the default policy in the configure file. Ses sion Control: After the user login, some user information such as user id and user name will be stored throughout the session in order to customize the outlook for specific user or tracking the activities of the user. This information will be deleted after the user logout or the session is timeout.

3.2.1. Database Access For the easiness of maintenance and reusability reason, we wrap all the necessary database operations in one component. The component hides the detail of database connection, pooling and permission control, it provides many overloaded functions to run stored procedures and SQL statements, and it can return either one value or a disconnected dataset. It can be reused in any other .Net applications without changes. The connection string which contains the database name and login information is read from the standard XML-based web application configuration file: web.config. 3.2.2. Account Management User role based permission management: Users are grouped into different roles and the resource permission such as the menu items are assigned to the roles rather than a specific user. Currently there are three roles: Administrator, Author and operator. Dynamic menu retrieval: Menu items are not hard coded in the system. They are stored in the database table tblSideMenuItem. Database Administrator can insert new item into this table and create a new row in table tblUserSideMenuItemAssoc which contains the permission mapping between the user roles and menu items. Once done, new menu items will be available to all users with

Add and Delete Account: While adding or deleting an account, there are two tables related: tblPerson and tblUser. We use transaction to guarantee the data integrity. 3.2.3.

Domain Operations

Domain operations are encapsulated in a Domain class. While creating or deleting a domain, there are several tables need to be handled in a correct order: Domain, Attribute, Look and rule set table. They have referential relationship and the data integrity should be maintained. For instance, when deleting a domain, the rule set table need to be dropped. And the corresponding information should be first deleted in the Lookup table, then the Attribute table and finally the Domain table. All this operations must be done in one transaction. Conflict resolutions are implemented using stored procedures. The conflict resolution approaches are stored in the tblApproach table and retrieved dynamically during application execution. This enables users to add more approaches using our stored procedure builder dynamically to meet their requirements. Stored procedure builder: We have developed a Stored procedure builder tool for dynamically creating conflict resolution approaches. The builder will first generate a stored procedure according to the SQL statement

input by the user and add a new record to the tblApproach table to map the approach name to the stored procedure. Actually, all these jobs are done in a stored procedure. In fact, a stored procedure is used to generate other stored procedures. Dynamic classifier generator: Users can run a target classifier any time by choosing the “Run Application” menu. After choosing the domain name, the attributes will change accordingly. Actually the attribute information is retrieved from the database dynamically, so is the construction of the WHERE statement. The WHERE statement will be sent to the corresponding approach stored procedure and the decision and fired rules will be returned. 4. Implementation The proposed system has been implemented using MS Dot NET framework and MS SQL server 2000 [15]. It has been deployed as a Dot NET application on IIS web server running on a Pentium 4/700 MHz machine since summer 2003. Due to space limitation, a running example is omitted; it will be presented at the conference presentation. 5. Conclusions This paper presents a web-based system for automated generation of rule-based classifiers from data sets provided by end users. There is no programming required from end users , since user interfaces for running classifiers are generated automatically by the system. Generated classifiers and related user interfaces are stored in relational tables, so they are quite extensible and manageable. Created classifiers are web-based, so they are easily accessible by thin clients. 6. References [1] Fayyad, U., Editorial, Int. J. of Data Mining and Knowledge Discovery, Vol.1, Issue 1, 1997. [2] Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth, "From data mining to knowledge discovery: an overview," in Advances in Knowledge Discovery and Data Mining, Fayyad et al (Eds.), MIT Press, 1996.

[3] Buchanan, B.G. and E.H. Shortliffe, eds. RuleBased Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project. Reading, MA, Addison-Wesley, 1984. [4] Giarratano, J. and G. Riley, Expert Systems Principles and Programming, 3rd ed., PWS Publishing Company, 1998. [5] Friedman-Hill, E., Java Expert System Shell, Sandia National Laboratories, Livermore, CA., http://herzberg.ca.sandia.gov/jess [6] Mitchell, T.M., Machine Learning, The McGraw-Hill Companies, Inc., 1997. [7] Witten I.H. and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, San Francisco, Morgan Kaufmann, 2000. [8] Quinlan, J.R., C4.5: Programs for machine learning. San Francisco, Morgan Kaufmann, 1993. [9] Chan, C.-C. and S. Santhosh, “BLEM2: Learning Bayes’ rules from examples using rough sets,” Proc. NAFIPS 2003, 22nd Int. Conf. of the North American Fuzzy Information Processing Society, July 24 – 26, 2003, Chicago, Illinois, pp. 187190. [10] Zadeh, L.A., “Fuzzy sets,” Information and Control, 8:338-353, 1965. [11] Gupta, M.M., R.K. Ragade, and R.R. Yager, Advances in Fuzzy Set Theory and Applications, editors, North-Holland, Amsterdam, 1979. [12] Pawlak, Z., "Rough sets: basic notion," Int. J. of Computer and Information Science 11, pp. 34456, 1982. [13] Pawlak, Z., J. Grzymala-Busse, R. Slowinski, and W. Ziarko, "Rough sets," Communication of ACM, Vol. 38, No. 11, November, 1995, pp. 8995. [14] Khasawneh, N. and C.-C. Chan, “Servlet-based implementation for rule-based classifiers,” Proc. 12th MidWest Artificial Intelligence and Cognitive Science Conference MAICS 2001, March 31 – April 1, 2001, Miami University, Oxford, Ohio, pp. 70-74. [15] Su, Zhicheng, “Dot Net implementation for rulebased classifiers,” Master’s research report, Department of Computer Science, University of Akron, June, 2003.