Learning Tolerance Relations by Boolean Descriptors - CiteSeerX

58 downloads 16812 Views 158KB Size Report
We call a relation. X X a tolerance relation ..... tuitively as centers of tolerance clusters. These stan- ... Intelligent Systems, Charlotte, NC, October 16-. 19, 1994 ...
Learning Tolerance Relations by Boolean Descriptors: Automatic Feature Extraction from Data Tables A. Skowron1 L. Polkowski2 J. Komorowski3 1 Institute of Mathematics

[email protected] [email protected] [email protected]

Warsaw University 02-097 Warsaw, Banacha 2 Poland 2 Institute of Mathematics, Warsaw University of Technology, 00-650 Warsaw, Poland, Pl. Politechniki 1, 3 Knowledge System Group, Norwegian University of Science and Technology, 7034 Trondheim, Norway,

Abstract: We present an approach to the problem of how to learn tolerance relations from decision

tables. The tolerance relations constructed (synthesized) in the learning process are applied in the process of extraction of new features which often can be better predisposed than original features (from a given decision table) to the task of the classi cation of new objects. These tolerances are synthesized from boolean combinations of descriptors and they lead to simpler approximate descriptions of decision classes. The method proposed in the paper extracts tolerance relations encoded in the decision table, contrary to the approach known from literature which consists in taking some a priori xed tolerance relations and applying them in clustering.

Keywords: tolerance relation, feature extraction, decision rules.

1 Introduction A great e ort has been made to achieve a progress in clustering methods (see e.g. [2], [6], [9], [11]) and to develop ecient methods for feature extraction (see e.g. [5], [7], [12]) for di erent applications. In this paper we propose a method of searching for a strong, in a sense, decision rules built from new conditional attributes (features) by applying to them standard methods (see e.g. [20], [21], [22], [23]). These new features are synthesized from a tolerance relation extracted from the available data encoded in a given decision table. They are de ned as characteristic functions of safe clusters constructed from that tolerance relation. The main steps in searching for new features are the following: (i) A set TOL(A) of decision tables is de ned which can be treated as speci cations of characteristic functions of the tolerance relation encoded in the decision table A. For any table B from TOL(A) we de ne a tolerance relation B on some information vectors (attribute value vectors) from

(ii)

(iii) (iv)

(v) (vi)

B. These vectors are obtained by applying to B rough set methods (see e.g. [20-23]); any such vector de nes a boolean formula being the conjunction of all descriptors included in it. From any tolerance relation B a set of \safe" cluster formulas is extracted; any cluster formula is represented by a disjunction of formulas built in step (i). The characteristic functions of the cluster formulas are taken as new features. The new decision table B0 is created from B by taking the characteristic functions from (iii) as conditional attributes and the same decision as in B. The tness of the new decision table B0 is measured as the classi cation quality [18] of decision rules generated from B0 on test data. An optimization strategy is applied to search for the optimal decision table from TOL(A) with respect to the quality measure de ned in step (v).

Let us observe that the new decision rules for the decision table A allow for more compressed (with respect to the length) description of decision classes, so (see e.g. [10]) they can o er better classi cation quality. The reason is that the left hand sides of our new decision rules are expressed as conjunctions of formulas which are disjunctions of conjunctions of descriptors. In the case when decision rules are generated directly from original conditional attributes the left hand sides of decision rules are conjunctions of descriptors only. In our ongoing project we implement the above strategy. The rst tests are promising.

2 Preliminaries: Tolerance Relations Tolerance relations provide an attractive and general tool for studying indiscernibility phenomena (cf. [14], [19], [24]). The importance of those phenomena was already noticed by Poincare and Carnap. Studies by, among others, Menger, Zadeh, and Pawlak have led to the emergence of new approaches to indiscernibility. We call a relation   X  X a tolerance relation on X if  is re exive and symmetric. The pair (X;  ) is called a tolerance space. It leads to a metric space with the distance function d (x; y) = minfk : 9x0 ;:::;xk x0 = x ^ : : : ^ xk = y ^ (xi xi+1 for i = 0; : : : ; k ? 1)g. Sets of the form  (x) = fy 2 X : xyg are called tolerance sets. We de ne a  -threshold p by

p = min fd (x; y) : (x; y) 2 X  X ?  g :

3 Preliminaries: Rough Sets | Theoretic Notions Information systems (cf. [13]) (sometimes called data tables, attribute-value systems, condition-action tables, knowledge representation systems etc.) are used for representing knowledge. Rough sets have been introduced (cf. [13], [14]) as a tool to deal with inexact, uncertain or vague knowledge in arti cial intelligence applications. In this section we recall some basic notions related to information systems and rough sets. An information system is a pair A = (U; A), where U is a non-empty, nite set called the universe and A a non-empty, nite set of attributes, i.e. a: U ! Va for a 2 A, where Va is called the value set of a. Elements of U are called objects and interpreted as, e.g. cases, states, processes, patients, observations. Attributes are interpreted as features, variables, characteristic conditions etc. Every information system A = (U; A) and nonempty set B  A determine a B -information function

S

InfB : U ! P (B  a2B Va ) de ned by InfB (x) = f(a; a(x)) : a 2 B g. The set fInfA(x) : x 2 U g is called the A-information set and it is denoted by INF(A). We consider a special case of information systems called decision tables. A decision table (cf. [13], [14]) is any information system of the form A = (U; A [ fdg), where d 62 A is a distinguished attribute called decision. The elements of A are called conditions. The cardinality of the image d(U ) = fk : d(s) = k for some s 2 U g is denoted by r(d). We assume that the set Vd of values of the decision d is equal to f1; : : : ; r(d)g. Let us observe that the decision d determines the partition CLASSA(d) = fX1; : : : ; Xr(d)g of the universe U , where Xk = fx 2 U : d(x) = kg for 1  k  r(d). The set X is called the i-th decision class of A. Let A = (U; A) be an information system. With every subset of attributes B  A, an equivalence relation, denoted by INDA (B ) (or IND(B )) called the B-indiscernibility relation, is associated and de ned by IND(B ) = f(s; s0 ) 2 U 2 : for every a 2 B , a(s) = a(s0 )g.

Objects s; s0 satisfying relation IND(B ) are indiscernible by attributes from B . Any minimal subset B  A such that IND(A) = IND(B ) is called a reduct in information system A. The set of all reducts in A is denoted by RED(A). Let A be an information system with n objects. By M (A), we denote an n  n matrix (cij ) called the discernibility matrix of A such that cij = fa 2 A : a(xi ) 6= a(xj )g for i; j = 1; : : : ; n. A discernibility function fA for an information system A is a boolean function of m boolean variables a1 ; : : : ; am corresponding to the attributes a1 ; : : : ; am respectively, and de ned by

fA (a1 ; : : : ; am ) = ^f_cij : 1  j  i  n; cij 62 ;g where cij = fa : a 2 cij g. The set of all prime implicants of fA determines the set RED(A) of all reducts of A i.e. ai1 ^ : : : ^ aik is a prime implicant of fA i fai1 ; : : : ; aik g 2 RED(A). The problem of minimal (with respect to the cardinality) reduct nding is NP-hard. In general the number of reducts of a given information system can be exponential with respect to the number of attributes. Nevertheless, existing procedures for reduct computation are ecient in many practical applications and for more complex cases one can apply some ecient heuristics (see e.g.[21]). If A = (U; A) is an information system, B 2 A is a set of attributes and X  U is a set of objects then the

sets fs 2 U : [s]B  X g and fs 2 U : [s]B \ X 62 ;g are called the B-lower and the B-upper approximation of X in A, and they are denoted by BX and BX , respectively. The set BNB (X ) = BX ? BX , will be called the Bboundary of X . When B = A we write also BNA (X ) instead of BN (X ). Sets which are unions of some classes of the indiscernibility relation IND(B ) are called de nable by B . The set X is B-de nable i BX = BX . Some subsets (categories) of objects in an information system cannot be expressed exactly by employing available attributes but they can be roughly de ned. The set BX is the set of all elements of U which can be with certainty classi ed as elements of X , having the knowledge represented by attributes from B ; BX is the set of elements of U which can be possibly classi ed as elements of X , employing the knowledge represented by attributes from B ; set BNB (X ) is the set of elements which can be classi ed neither in X nor in -X having knowledge B . If X1 ; : : : ; Xr(d) are decision classes of A then the set BX1 [ : : : [ BXr(d) is called the B-positive region of A and is denoted by POSB (d). If C  A than the set POSB (C ) is de ned as POSB (d) where d(x) = (a(x) : a 2 C ) for x 2 U is an attribute representing the set C of attributes. If A = (U; A [fdg) is a decision table then we de ne a function A (x) : U ! P (f1; : : : ; r(d)g, called the generalized decision in A, by

A (x) = fi : 9x0 2 Ux0 IND(A)x and d(x) = ig: A decision table A is called consistent (deterministic) if card(A (x)) = 1 for any x 2 U , otherwise A is inconsistent (non-deterministic). It is easy to see that a decision table A is consistent i POSA (d) = U . Moreover, if B = B , then POSB (d) = POSB (d) for any non-empty sets B; B 0  A. A subset B of the set A of attributes of decision table A = (U; A [ fdg) is a relative reduct of A i B is a minimal set with the following property: B = A . The set of all relative reducts in A is denoted by RED(A; d). Now we recall the de nition of decision rules. Let A dg) be a decision table and let V = S = (VU;[AV[. fThe atomic formulas over B  A [fdg a d a2A and V are expressions of the form a = v, called descriptors over B and V , where a 2 B and v 2 Va . The set C(B; V ) of formulas over B and V is the least set containing all atomic formulas over B and V and closed with respect to the classical propositional connectives _(disjunction) and ^(conjunction). Let  2 C(B; V ). Then by [ ]A we denote the meaning of  in the decision table A, i.e. the set of all objects in U with property  , de ned inductively as follows: 1. if  is of the form a = v then 0

0

[ ]A = fx 2 U : a(x) = vg; 2. [ ^  0 ]A = [ ]A \ [ 0 ]A ; [ _  0 ]A = [ ]A [ [ 0 ]A . We do not use the negation connective because in the case of information systems the complement of any de nable set is de nable by union and intersection. The set C(A; V ) is called the set of conditional formulas of A. A decision rule of A is any expression of the form

) d = v where  2 C(A; V ) and v 2 Vd: The decision rule ) d = v for A is true in A i A  (d = v)A ; if A = (d = v)A then we say that

the rule is A-exact. The problems concerning decision rules generation are discussed in [21 - 23].

4 The Language of Boolean Combinations of Descriptors (LBCD) The language of boolean combinations of descriptors (LBCD for short) is built from descriptors by taking their boolean combinations; let us observe that we can eliminate negation from the set of boolean connectives as it can be replaced by disjunction:

:(a = v) , _f(a = v0 ) : v0 62 vg: Any formula of LBCD de nes a function f ( ) which is the characteristic function of the set of objects which satisfy . We apply the language LBCD to the task of expressing new features of objects, de ned from the original descriptors but possibly better predisposed to classify objects: any formula (word) of LBCD is interpreted as a feature of objects. The search space of functions de ned by all formulas of the full language LBCD is too complex hence we restrict ourselves to the rst two levels of LBCD: the rst level consists of conjunctions of descriptors and the second level is formed by disjunctions of formulas of the rst level. We recall that formulas of the rst level (i.e. conjunctions of descriptors) are instrumental in de nitions of decision rules. For a given decision class corresponding to (d = v), the disjunctions of left - hand sides of decision rules with the xed right - hand side (d = v) either can describe exactly the decision class (d = v) (in the deterministic case) or they can describe (cover) this decision class approximately (cf. [20 - 23]). Experiments (cf. [4], [20], [23]) show that the classi ers obtained in this way are not always adequate to the task of classifying new objects. One can thus pose a question on how to distinguish among decision rules i.e. how to select the rules \better" (or, \stronger") than the others. One can mention some criteria for

such a choice e.g. the number of supporting examples in the training set (cf. [4]) or/and the degree in which the set of objects satisfying is contained in the decision class (cf. [25]) or/and the degree in which the set of objects satisfying covers the decision class (cf. [25]). Let us stress that in this approach one considers coverings of decision classes obtained by taking some conjunctions of descriptors (left - hand sides of decision rules) as the basic elements of the covering; any basic element can be interpreted as a \safe region" : any object satisfying it can be classi ed into the given class corresponding to that basic element. The following question naturally comes up in this context: how to de ne possibly larger safe regions starting with conjunctions of descriptors. The idea which can be applied here consists in taking some appropriate set - theoretic unions of the meanings of conjunctions of descriptors; given such a union, say C = [ 1 ] [ [ 2 ] [ : : : [ [ k ], we then take its characteristic function fC as a new feature (a conditional attribute). Our goal now reduces to the construction for a given decision table of a set, say fC1 ; C2 ; : : : ; Ck g of these unions having the property that the new set ffC1 ; fC2 ; : : : ; fCk g of conditional attributes has better classi cation quality on test samples of new objects than the previous set of attributes. In the above discussion, a formula was taken to be a left - hand side of a decision rule; however, one can take instead of some of its subformulas. This is the approach which we adopt here. We begin with a technique for constructing a tolerance relation on information (= attribute-valued) vectors. This tolerance relation allows for building collections of objects de ned by tolerant information vectors. These collections should satisfy the condition that they are within their corresponding decision classes, moreover, their distance from other classes is suciently large. This informal description will be rendered in precise technical way in the next section.

5 Feature Extraction from Tolerances

Consider a decision system A = (U; A [ fdg) along with S a given tolerance relation  on the set INF(A) = fINF (B ) : B  Ag. For a given u 2 INF(A), and a natural number p > 0 where [u]A  Xi - the i-the decision class, we will de ne a set, say C, such that (i) [u]A  C  Xi ; (ii) dist (Xj ; C ) )  p for each j 6= i. To this end, we de ne a sequence (un )n in INF(A). There are quite a few strategies for selecting this sequence. We describe three of them. Let u0 = u; assuming u1 ; u2 ; : : : ; uk are already de ned, we select uk+1 in case there exists v 2 INF(A), such that

(iii) uk  v; (iv) the union C = [u0] [ [u1 ] [ : : : [ [uk ] [ [v] satis es dist (Xj ; C )  p for each j 6= i. We take as uk+1 any v satisfying (iii) and (iv). The procedure halts after, say, m steps, because neither of conditions (i), (ii) can be satis ed. We obtain a sequence u0 ; u1 ; u2 ; : : : ; um . The set C = [u0 ] [ [u1 ] [ : : : [ [um ] is as required. The other procedure consists in modifying the condition (iii) into a condition (iii)0 there exists i = 0; 1; : : : ; k with the property that ui v. The last procedure, which is the most time - consuming consists in de ning at each step u1 ; u2; : : : ; uk the set T = fv 2 INF(A): there exists i = 0; 1; : : :; k with the property that ui vg and selecting a maximal subset T0  T with the property that S (ii)0 dist (Xj ; f[^v]A : v 2 T0 g)  p for each j 6= i. Any of these procedures yields the set of information vectors S (u) and the set

[

C (u) = f[^v]A : v 2 S (u)g: The corresponding property of objects is given by the formula _f^v : v 2 S (u)g. It determines the characteristic function of a new feature. The threshold distance value p can be taken as the characteristic parameter p , a minimal distance which secures separation of tolerance approximations of distinct decision classes. We now proceed with the algorithms for determining a tolerance  ; these tolerances will be determined from the given decision table by implementing a learning strategy presented in the next section.

6 Learning Tolerance A tolerance relation leads, as outlined above, to a clustering of objects into distinct aggregates (clusters). We adopt an analytic approach in the sense that tolerance relations proper for discerning among decision classes should be extracted from the given decision system by means of the information provided by the system itself. We now give the details of our approach. Consider a decision system (decision table) A = (U; A [ fdg): The decision system A induces the family TOL(A) of decision tables; any decision table B in TOL(A) provides us with a description of a tolerance relation  (B) on the set INF(A). The relation  (B) is an approximation to a tolerance relation on objects in the universe U of A.

Any decision table B in TOL(A) is of the form B = (U 0 ; A0 [ fDg) where: (i) U 0  U  U is a non-empty set of pairs; (ii) A0 = A# [ A$ where A# = A  f#g, A$ = f$g  A, (a; #)(x; y) = a(x) and ($; a)(x; y) = a(y) for any (x; y) 2 U 0 ; (iii) D(x; y) = 0 if d(x) 6= d(y). Thus, conditional attributes of B act coordinate wise i.e. the description of any pair (x; y) is obtained by concatenating the descriptions of x, and then y, provided by attributes in A. Keeping in mind that our ultimate set of attributes are characteristic functions fC of tolerance clusters of  (B), we set the condition (iii), which secures that objects from distinct decision classes are not  (B) - tolerant. On the other hand, the decision whether to assign the value 1 or 0 as D(x; y) in the case when d(x) = d(y) can be regarded as a parameter to be learned. The method to determine the tolerance  (B) from the decision table B is as follows. Step 1. Apply the standard (cf. [20 - 23]) methods for determining the decision rules for the decision table B. This step yields the decision rules of the form

V u ^ V v ) (D = i) where i = 0; 1,

u 2 INF (A# ) and v 2 INF (A$ ).

StepV2. Select a set of decision rules of the form V u^ v ) (D = 1) obtained in Step 1. StepV3. De ne the set Pre ?  (B) = f(u; v) : V u ^ v ) (D = 1) is selected in Step 2g and let  (B) be the re exive and symmetric closure of the relation Pre ?  (B). The set selected in Step 2 can be regarded as a parameter of the algorithm; one can select a set of decision rules which forms a covering of the set of objects. The above decomposition algorithm yields a family f (B): B 2 TOL (A)g of tolerance relations. Each of these relations, say  (B), yields in turn a set of new features F ( (B)) for object classi cation. Thus, each of these relations, say  (B), can be judged by the quality of classi cation (cf. [18]) provided by the set of decision rules generated from the features in F ( (B)). The quality of classi cation can be regarded as a cost function ( tness function) of a tolerance relation and one can search for optimal or suboptimal tolerance relation(s) for a given decision system A. We are now implementing several strategies of searching for optimal tolerance relations. We include a brief description of some of them. 1. Dynamic random sampling By analogy with dynamic reducts (cf. [4], [21]), we consider a set R of random samples from the set TOL(A) containing a given object x. For

each sample B, we apply our method to nd the set (cluster) S (B) (u) where u is the information vector of x. Next, we select from the family fS (B) (u) : B 2 Rg the most frequent information vectors. These vectors are regarded as the most stable and clusters provided by them are expected to produce the reliable features, a fortiori, classi cation rules. 2. Simulated annealing The two parameters of our procedure viz. the assignment of 0 or 1 as a value of D(x; y) (when d(x) = d(y)) and the choice of the set of decision rules in Step 2 can be tuned in order to proceed from a given tolerance towards a possibly better one. The strategy of this search can be based on the simulated annealing (cf. [1]). 3. Genetic strategies Given a population R of samples from the set TOL(A), one can de ne genetic operators (cf. [8]) on R by performing crossing - over on tables from R (exchanging sets of rows between a pair of tables) and de ning the mutation in a table as the process of changing the assignment of 1 as a value of D(x; y) in the table or/and removing/adding a row(s) from/to the table. The tness function of a chromosome (a sample decision table) is de ned as the quality of classi cation on test samples of objects by the rules de ned from this table. From the point of view of time complexity, the rst method is the most ecient. Preliminary experiments are showing that the proposed methods o er promising tools for automated search for new features.

Conclusions We have proposed a technique of searching for new features and, a fortiori, for new classi ers of new objects. Searching for tolerance relations can be also perceived as searching for standards, understood intuitively as centers of tolerance clusters. These standards play a fundamental role in approximate reasoning about complex objects carried out in distributed environment and based on rough mereological techniques and ideas (cf. [15 - 17]). This work was supported by grant 8T11 C01 011 from the State Committee for Scienti c Research.

References [1] Aarts, E. and Korst, J. (1989). Simulated Annealing and Boltzmann Machines. Wiley, New York. [2] Anderberg, M.R. (1973). Cluster Analysis for Applications. Academic Press, New York.

[3] Brown, E.M. (1980). Boolean reasoning. Kluwer, Dordrecht. [4] Bazan, J., Skowron, A. and Synak, P. (1994). Dynamic reducts as a tool for extracting laws from decision tables. in: Ras, W.Z. and Zemankova, M. (eds.): Proc. of the Symp. on Methodologies for Intelligent Systems, Charlotte, NC, October 1619, 1994, Lecture Notes in Arti cial Intelligence Vol. 869, Springer-Verlag, Berlin, 346-355. [5] Bazan, J., Nguyen, Son H., Nguyen, Trung T., Skowron, A. and Stepaniuk, J. (1995). Application of modal logics and rough sets for classifying objects, in: M. De Glass, and Z.Pawlak (eds.), Proc. Second World Conference on Fundamentals of Arti cial Intelligence, Angkor, Paris, 15-26. [6] Dubes, R.C. (1993). Cluster analysis and related issues. In: C.H.Chen, L.F.Pau, P.S.P.Wang (eds.), Handbook of Pattern Recognition and Computer Vision, 3-32. [7] Fawcett, T.E. and Utgo , P.E. (1992). Automatic feature generation for problem solving systems. in: D.Sleeman (ed.): Proc. of the Ninth International Workshop on Machine Learning (ML92), Morgan Kaufmann 1992, 144-153. [8] Goldberg, D.E. (1989). Genetic Algorithms in Search Optimization and Machine Learning. Addison-Wesley, Reading MA. [9] Michalski, R. and Stepp, R. (1985). Learning from observation; Conceptual Clustering. in: Michalski R., Carbonell J.G., Mitchel T.M. (eds.). Machine Learning vol I, Tioga/Morgan Kaufmann, Los Altos, CA. [10] Ming, Li and Vitanyi, P. (1993). An Introduction to Kolmogorov Complexity and its Applications. Springer-Verlag, New York. [11] Nadler, M. and Smith, E.P. (1993). Pattern Recognition Engineering. Wiley, New York. [12] Nguyen, Son H. and Skowron, A. (1995). Quantization of real value attributes. Proc. Second Joint Annual Conf. on Information Sciences, Wrightsville Beach, North Carolina, September 28October 1, 1995, USA, 34-37. [13] Pawlak, Z. (1991). Rough sets: Theoretical aspects of reasoning about data. Kluwer Academic Publishers, Dordrecht. [14] Pawlak, Z. (1994). Vagueness and Uncertainty a Rough Set Perspective, ICS Research Report, 19/94, Warsaw University of Technology.

[15] Polkowski, L. and Skowron, A. (1995). Rough mereology and analytical morphology: New developments in rough set theory. in: De Glas, M. and Pawlak, Z. (eds.), Proceedings of Second World Conference on Fundamentals of Arti cial Intelligence, Angkor, Paris, 343-354, 1995. [16] Polkowski, L. and Skowron, A. (1996). Rough mereological approach to knowledge-based distributed AI. in: J. K. Lee, J. Liebowitz and J. M. Chae, (eds.), Critical Technology, Proc. Third World Congress on Expert Systems, February 5-9, Seoul, Korea, Cognizant Communication Corporation, New York, 774-781, 1996. [17] Polkowski, L. and Skowron, A. (1996). Rough mereology: A new paradigm for approximate reasoning, to appear in: Journ. of Approximate Reasoning. [18] Shavlik, J.W. and Dietterich, T. (1990). Readings in Machine Learning. Morgan Kaufmann. [19] Skowron, A. and Stepaniuk, J. (1994). Generalized approximation spaces, to appear in: Fundamenta Informaticae; see also: ICS Research Report 41/94, Warsaw University of Techology. [20] Skowron, A. (1995). Extracting laws from decision tables. Computational Intelligence, 11(2), 371-388. [21] Skowron, A. (1995). Synthesis of adaptive decision systems from experimental data. in: A. Aamodt, and J.Komorowski (eds.), Proc. of the Fifth Scandinavian Conference on Arti cial Intelligence SCAI-95, IOS Press, Amsterdam, 220-238. [22] Skowron, A. and Polkowski, L. (1995). Synthesis of decision systems from data tables, appear in to the book edited by T.Y.Lin, Kluwer. [23] Skowron, A. and Polkowski, L. (1996). Decision algorithms: A survey of rough set theoretic methods, to appear in: Fundamenta Informaticae. [24] Slowinski, R. and Stefanowski, J. (1994). Rough classi cation with valued closeness relation, in: Diday, E. et al (eds.), New Approaches in Classi cation and Data Analysis, Springer Verlag, Berlin, 482-488. [25] Tsumoto, S. and Tanaka, H. (1996). Automated induction of medical expert system rules from clinical databases based on rough sets. in: Proceedings, EUFIT- 96, to appear.