Combining multiple neural networks for classification ... - CiteSeerX

14 downloads 2413 Views 352KB Size Report
majority voting strategy are introduced to combine the ... combine the decisions &om different classifiers. For the ... call 4-tuple IS= as an information.
IEEE Int. Conf. Neural Networks d Signal Processing Nanjing. China, December 14-17,2003

COMBINING MULTIPLE NEURAL NETWORKS FOR CLASSIFICATION BASED ON ROUGH SET REDUCTION Daren Yr. Qinghua Hu and Wen Bao Harbin Institute of Technology Harbin, Heilongjiang Province, 150001, China [5,6], and each classifier can be built from different classifying models [7] or a single model [2,3,4,5,8]. Another interesting issue in the research on multi-classifiers fusion is the scheme they are combined. Linear combination, majority voting, fuzzy fusion, genetic algorithm and dynamic classifier selection were developed. Hashem [9] developed a way to improve model accuracy using optimal linear combinations of trained neural networks. Kittler [lo] gave theoretical tamework of several combination strategies. Giacinto [111 proposed another approach so-called “dynamic classifier s&don”(DCS). DCS methods are aimed to select a classifier that will most likely classify it correctly for each test sample. Rough set methodology is an emerging theory in machine learnkg and data mining. In the last decade we have witnessed a rapid growth of interest in rough set theory and its application. Feature extraction and feature selection belongs to the most fundamental steps in pattern recognition. Reduction of pattern dimensionality and feature selection is one of the main applications of rough set theory. According to the theory, reducts are the minimal subsets of amibutes that can keep the discernibility between objects. Reduction of high dimensional samples will decrease the difficulty in building a classifying model and improve its performance. It is worth remarking that there are not only a single reduct at most time. Namely as to a high dimensional sample set we can get several minimal subsets of attributions that can preserve the discerning power between the objects as all of the athibutions do. In this paper we present an approach to integrate rough set theory with neural networks for classification. Rough set methodology is introduced to reduce the high dimensional samples. So we can get several different fature subsets. Some of the subsets are selected to train neural networks respectively. Then outputs t o m different neural networks are combined with a decision fusion strategy. Tests show the fusion system outperforms a single classifier. The paper is organized as follows. In section 2 we introduce rough set theory and give a reduction algorithm. Then we propose a multiple classifier combination method in section 3. In this section neural networks are selected as

ABSTRACT Generalization ability is a measure of performance of neural networks. Multiple neural networks combination based on the combination of a set of networks is used to achieve high pattern recognition performance. In our work rough set theory is introduced to reduce high dimensional data and get multiple concise representations (reducts) of a single sample set. Multiple neural networks classifiers are built based on different reducts. Average strategy and majority voting strategy are introduced to combine the outputs kom different classifiers. The experimental results show the combined system outperforms a single classifier.

1. INTRODUCTION The recognition accuracy is one of the primary objectives of research in the field of pattern classification. A Beat number of recognition schemes have been developed for different application, such as K-nearest neighbors, decision t~ee, neural networks, rough set and support vector machine. Multiple classifier systems based on the combination of several different classifiers are currently used to achieve high pattern recognition performance. It had been observed that the sets of patterns misclassified by different classifiers would not necessarily overlap. This suggested that different classifiers potentially offered complementary information about the patterns to be recognized h i c b could be harnessed to improve the performance of the classifiers selected. This idea is not to rely on a single classifier. Instead, The decisions &om several classifying systems are combined with a decision fusion scheme to get high performance. As to pattern cognition Shaker [I] pointed that as the number of combined classifiers increases, the performance will improve when the accuracy of each classifier is over 50%. There are two issues of importance to research: how to build multiple classifiers with finite samples and how to combine the decisions &om different classifiers. For the first problem, it can be achieved by using different feature subsets [2,3,4] as well as by different training samples

0-7803-7702-8/03/ $17.00 0 2 0 0 3 IEEE.

543

classifier model, and two combination strategies are introduced. Some experimental results are given in section 4 based on Australian credit approval data set. 2. ROUGH SET METHODOLOGY

R is a family of equivalency relation, r E R , if Znd(R) = Ind(R - r ) ,we call r is super5uous. Otherwise, the attribute r is indispensable in R . Given Q c R , Znd(Q) = Ind(R) and for any q E Q ,q is indispensable. We call Q is a reduct of

Rough set theory; a new mathematical approach to data analysis has been introduced by Zdzislaw Pawlak 1121 to deal with imprecise or vague concepts. The basic idea of this theory hinges on classifying objects into similar classes containing objects that are indiscernible with respect to some features [13]. It has been developed for machine learning and knowledge discovery in databases. 2.1 Basic concepts of rough set theory

Given two finite, non-empty sets U a n d A , where U is the universe (a finite set of objects) and A is the set of attributes (features or variables). Each atkibute a E A , V is the domain of values of A , Vais the set of values of a .defiing an information function

f, : U + 5 ,we

call 4-tuple I S = < U , A , V , f > as an information system. a ( x ) denotes the value of attribute a for object x . Any subset E A determines a binary relation Znd(E)on U , called indiscemibilityrelation:

R , Obviously there is not only a single reduct at most time. Core is defmed as the common part of all reducts: Core = nred(R) Where red(R) denotes all of reducts of R . Reduct and core are two timdamental concepts of rough set theory. The reduct is the essential part of the information system, which can discern all objects discemible by the original one. 2.2 Relative reduct and reduction algorithm

For information system, composed of a 4-h1ple, denotes as follows: S = < U , A , V , f > where A = C U D . C is a set of condition attributes; D is a set of decision attributes. For a given subset of condition athibutes P C we define positive region POS,(D) in the relation Ind(D),as

POSp(D)= U{eXIX E Ind(D)} Let P,D be families of equivalence relations over U. P E P is D-dispensablein P i f

if a ( x ) = a ( y ) , for every a E B The family of all equivalence classes of Ind(B), namely the partition determined by E , will be denoted by U / B . An equivalence class of Znd(E) containiig xwill be denoted by E(x). If ( x , ~ ) EInd(E) we will call that x and y are E-indiscernible . Equivalence classes of the relation Znd(E) are refmed to as B - elementuy sets. The elementary sets are the basic blocks of our knowledge about reality, sometimes called as concepts. Given an object subset X C U , we call

Otherwise p is D-indispensable in P . If every P E P is D-indispensable , then P is D - independent . P will be called a D - reduct of if and only if every P E P is D-indispensuble and

c

g ( X ) , B ( X ) as the E - lower and E -upper approximation of X , respectively. B ( X ) and B ( X )

POS,d(,,(lnd(D))=POSln4y,(Znd(D)). The common set of all D - indispensable relations in C will be called the D-CORE of C and will be denoted as CORED( C ) . Let C, D denote condition attribute set and decision

are defined as follows: = { x E U :E ( x ) t X } ,

attribute set. We say D depends on k(OSk