Hyperrelations in version space - Semantic Scholar

6 downloads 56434 Views 265KB Size Report
41 learning: a version space is the set of all hypotheses in a hypothesis space. 42 consistent .... X ј fa0, ... , aT g is a nonempty finite set of mappings ai : U ! Vai .
IJA 6238 ARTICLE 21 November 2003 Disk used

No. of Pages 19, DTD = 4.3.1

IN PRESS

International Journal of Approximate Reasoning xxx (2003) xxx–xxx

www.elsevier.com/locate/ijar

Hui Wang a

PRO OF

Hyperrelations in version space a,*

, Ivo D€ untsch b, G€ unther Gediga c, Andrzej Skowron d

School of Information and Software Engineering, University of Ulster, Faculty of Informatics, Newtownabbey, Co. Antrim BT 37 0QB, Ireland b Department of Computer Science, Brock University, St. Catherines, Ont., Canada L2S 3AI c Institut f€ ur Evaluation und Marktanalysen, Brinkstr. 19, 49143 Jeggen, Germany d Institute of Mathematics, University of Warsaw, 02-097 Warszawa, Poland

TED

Received 1 October 2002; accepted 1 October 2003

Abstract

COR

REC

A version space is a set of all hypotheses consistent with a given set of training examples, delimited by the specific boundary and the general boundary. In existing studies [Machine Learning 17(1) (1994) 5; Proc. 5th IJCAI (1977) 305; Artificial Intelligence 18 (1982)] a hypothesis is a conjunction of attribute-value pairs, which is shown to have limited expressive power [Machine Learning, The McGraw-Hill Companies, Inc (1997)]. In a more expressive hypothesis space, e.g., disjunction of conjunction of attribute-value pairs, a general version space becomes uninteresting unless some restriction (inductive bias) is imposed [Machine Learning, The McGraw-Hill Companies, Inc (1997)]. In this paper we investigate version space in a hypothesis space where a hypothesis is a hyperrelation, which is in effect a disjunction of conjunctions of disjunctions of attribute-value pairs. Such a hypothesis space is more expressive than the conjunction of attribute-value pairs and the disjunction of conjunction of attribute-value pairs. However, given a dataset, we focus our attention only on those hypotheses which are consistent with given data and are maximal in the sense that the elements in a hypothesis cannot be merged further. Such a hypothesis is called an E-set for the given data, and the set of all E-sets is the version space which is delimited by the least E-set (specific boundary) and the greatest E-set (general boundary).

UN

12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

*

Corresponding author. Tel.: +44-1232-368981; fax: 44-1232-366068. E-mail addresses: [email protected] (H. Wang), [email protected] (I. D€ untsch), [email protected] (G. Gediga), [email protected] (A. Skowron). 0888-613X/$ - see front matter  2003 Published by Elsevier Inc. doi:10.1016/j.ijar.2003.10.007

IJA 6238 ARTICLE 21 November 2003 Disk used

2

H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx

Based on this version space we propose three classification rules for use in different situations. The first two are based on E-sets, and the third one is based on ‘‘degraded’’ E-sets called weak hypotheses, where the maximality constraint is relaxed. We present an algorithm to calculate E-sets, though it is computationally expensive in the worst case. We also present an efficient algorithm to calculate weak hypotheses. The third rule is evaluated using public datasets, and the results compare well with C5.0 decision tree classifier.  2003 Published by Elsevier Inc.

PRO OF

30 31 32 33 34 35 36 37 38

39 1. Introduction

COR

REC

TED

Version space is a useful and powerful concept in the area of concept learning: a version space is the set of all hypotheses in a hypothesis space consistent with a dataset. A version space is delimited by the specific boundary and the general boundary––the sets of most specific and most general hypotheses respectively, to be explained below. The E L I M I N A T E - C A N D I D A T E algorithm [6,8] is the flagship algorithm used to construct a version space from a dataset. The hypothesis space used in this algorithm is the conjunction of attribute-value pairs, and it is shown to have limited expressive power [9]. It is also shown by [4] that the size of the general boundary can grow exponentially in the number of training examples, even when the hypothesis space consists of simple conjunctions of attribute-value pairs. Table 3 was used in [9] to show the limitation of Mitchell’s representation. It was shown that the E L I M I N A T E - C A N D I D A T E algorithm cannot come up with a consistent specific boundary or general boundary for this example due to this limitation (i.e., the chosen hypothesis space). A possible solution, by increasing the expressive power of the representation, is to extend the hypothesis space to disjunctions of conjunctions of attribute-value pairs from the restrictive conjunctions of attribute-value pairs. But it turned out that, without proper inductive bias, the specific boundary would always be overly specific (the disjunction of the observed positive examples) and the general boundary would always be overly general (the negated disjunction of the observed negative examples) [9], so they do not carry useful information––they are ‘‘uninteresting’’. The desired inductive bias, however, has not been explored. In this paper we report a study on version space which is aimed at increasing the expressive power of the hypothesis space and, at the same time, keeping the hypotheses ‘‘interesting’’ through introducing an inductive bias. We explore a semilattice structure that exists in the set of all hypertuples of a domain, where hypertuples generalise the traditional tuples from value-based to set-based. The semilattice structure can be used as a base for a hypothesis space. We take a hypothesis to be a hyperrelation, i.e., a set of hypertuples. A hyperrelation can

UN

40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69

No. of Pages 19, DTD = 4.3.1

IN PRESS

IJA 6238 ARTICLE 21 November 2003 Disk used

No. of Pages 19, DTD = 4.3.1

IN PRESS

H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx

PRO OF

be interpreted as a disjunction of hypertuples, which can be further interpreted as a conjunction of disjunctions of attribute-value pairs. Such a hypothesis space is much more expressive than the conjunction of attribute-value pairs and the disjunction of conjunction of attribute-value pairs. For a dataset there is a large number of hypertuples which are consistent with the data, some of which can be merged (through the semilattice operation) to form a different consistent hypertuple. We propose to focus on those hypertuples which are consistent with the given data and cannot be merged further––they are said to be maximal. A hypothesis is then a set of these hypertuples, and the version space is the set of all these hypertuples. Clearly this version space is a subset of the semilattice. An algorithm is presented which is able to construct the version space. A detailed analysis of Mitchell’s version space shows that our version space is indeed a generalisation of Mitchell’s and that it is more expressive than Mitchell’s. Classification within our version space is not trivial. We explore ways in which data can be classified using different hypotheses in the version space. We also examine how this whole process can be speeded up for practical benefits. Experimental results are presented to show the effectiveness of this approach. Most of the theory (and results) of version space can be expressed by the tools offered by Boolean reasoning as investigated, for example, in [10–13,15]. Unlike some of these publications we will only be concerned with consistent decision systems and exact classification rules.

REC

93 2. Definitions and notation

TED

70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92

3

94 For a set U , we let 2U be the powerset of U . If 6 is a partial ordering on U , 95 and X  U , we let

If X ; Y  U , we say that Y covers X , written as X ^ Y , if X # Y , i.e. if for each x 2 X there is some y 2 Y with x 6 y. We call X dense for Y , written X E Y , if for any y 2 Y there is some x 2 X such that x 6 y, i.e. Y " X . Note that Y covers X iff Y is dually dense for X . A semilattice is an algebra hA; þi, such that + is a binary operation on A satisfying

UN

97 98 99 100 101 102

COR

" X ¼ fy 2 U : ð9x 2 X Þx 6 yg; # X ¼ fy 2 U : ð9x 2 X Þy 6 xg:

x þ y ¼ y þ x;

ðx þ yÞ þ z ¼ x þ ðy þ zÞ; x þ x ¼ x:

IJA 6238 ARTICLE 21 November 2003 Disk used

4

No. of Pages 19, DTD = 4.3.1

IN PRESS

H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx

104 If X  A, we denote by ½X  the sub-semilattice of A generated by X . With 105 some abuse of notation, we also use + for the complex sum, i.e. 106 X þ Y ¼ fx þ y : x 2 X ; y 2 Y g. 107 Each semilattice has an intrinsic order defined by x 6 y () x þ y ¼ y:

ð2:1Þ

PRO OF

109 x is called minimal (maximal) in X  U , if y 6 x ðx 6 yÞ implies x ¼ y. The set 110 of all minimal (maximal) elements of X is denoted by min X ðmax X Þ. 111 A decision system is a tuple I ¼ hU ; X; fVa : a 2 Xg; d; Vd i, where 112 1. U ¼ fx0 ; . . . ; xN g is a nonempty finite set. 113 2. X ¼ fa0 ; . . . ; aT g is a nonempty finite set of mappings ai : U ! Vai . 114 3. d : U ! Vd is a mapping. 115

Set V ¼

Q

a2X

Va . We let I : U ! V be the mapping

x 7! ha0 ðxÞ; a1 ðxÞ; . . . ; aT ðxÞi

TED

which assigns to each object x its feature vector. V is called data space, and the collection D ¼ fIðxÞ : x 2 U g is called the training space. In the sequel V is assumed finite (consequently D is finite) unless otherwise stated. The mapping d defines a labelling (or partition) of D into classes D0 ; D1 ; . . . ; DK , where IðxÞ and IðyÞ are in the same class Di if and only if dðxÞ ¼ dðyÞ. If a 2 X, v 2 Va , then ha; vi is called a descriptor. We call Y L¼ 2V a ð2:3Þ a2X

the extended data space, its elements hypertuples, and its subsets hyperrelations. In contrast we call the elements of V simple tuples and its subsets simple relations. If s 2 L and a 2 X we let sðaÞ be the projection of s with respect to a, i.e. the set appearing in the ath component. L is a lattice under the following natural order

COR

125 126 127 128 129

REC

117 118 119 120 121 122 123

ð2:2Þ

s 6 t () sðaÞ  tðaÞfor all a 2 X; 131 with the least upper bound (+) and greatest lower bound ðÞ operations given 132 by

UN

ðt þ sÞðaÞ ¼ tðaÞ [ sðaÞ; ðt  sÞðaÞ ¼ tðaÞ \ sðaÞ

134 for all a 2 X. s and t are said to be overlapping if there is some simple tuple d 135 such that d 6 t  s. By identifying a singleton with the element it contains, we 136 may suppose that V  L; in particular, we have D  L.

IJA 6238 ARTICLE 21 November 2003 Disk used

No. of Pages 19, DTD = 4.3.1

IN PRESS

H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx

5

Table 1 a2

y

0 1 1 2

a a a b

(b) A set of hypertuples fa; bg fbg

{0,1} {2}

a b

PRO OF

a1 (a) A set of simple tuples a a b b

137 Table 1(a) below shows a simple relation (i.e., a set of simple tuples) and 138 Table 1(b) a hyperrelation. 139 For unexplained notation and background reading in lattice theory, we 140 invite the reader to consult [3].

TED

141 3. Hypertuples and hyperrelations

142 Suppose that Dq is a labelled class. The idea of [2,19] was to collect, if 143 possible, across the elements of Dq the values of the same attributes without 144 destroying the labelling information. For example, if the system generates the 145 rules

REC

If x is blue; then sell; if x is green; then sell;

147 we can identify the values ‘‘blue’’ and ‘‘green’’ and collect these into the single 148 rule If x is blue or green; then sell:

COR

Our aim is to maximise the number of attribute values on the left hand side of the rule in order to increase the generality of the rule and also increase its base as to guard against random influences [1]. This leads to the following definition [18]: We call an element r 2 L equilabelled with respect to the class Dq , if # r \ D 6¼ ;

ð3:1Þ

UN

150 151 152 153 154

ð3:2Þ

156 and

# r \ D  Dq :

158 These two conditions express two fundamental principles in machine learning: 159 (3.1) says that r is supported by some d 2 D; it is a minimality condition in the

IJA 6238 ARTICLE 21 November 2003 Disk used

6

160 161 162 163 164 165

No. of Pages 19, DTD = 4.3.1

IN PRESS

H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx

sense that we do not want to consider hypertuples that have nothing to do with the training set. Condition (3.2) tells us that r needs to be consistent with the training set: all elements below r which are in the training set D are in the unique class Dq . We denote the set of all elements equilabelled with respect to S Dq by Eq , and let E ¼ q 6 K Eq be the set of all equilabelled elements. Note that D  E, and that q; r 6 K; q 6¼ rimpliesEq \ Er ¼ ;; " Eq \ E ¼ Eq :

PRO OF

167 since fD0 ; . . . ; DK g partitions D. Furthermore,

ð3:3Þ ð3:4Þ

169 If D SP  L, we let EðP Þ ¼ ft 2 L : t is maximal in ½P  \ Eg, and set 170 VSp ¼ fEðP Þ : D  P  Vg. For each q 6 K, let Eq ðP Þ ¼ EðP Þ \ Eq . 171 Each set of the form EðP Þ is called an E-set or a hypothesis. We now have 172 Lemma 3.1. Let D  P  Q  L. Then,

TED

173 1. EðP Þ ^ EðQÞ. 174 2. EðP Þ E EðQÞ.

175 Proof. Both claims follow immediately from 176 ; 6¼ ½P  \ E  ½Q \ E, and that ½Q \ E is finite. h

the

fact

that

177 Theorem 3.2. min VSp ¼ EðDÞ, max VSp ¼ EðVÞ.

Proof. We only prove the first part; the second part can be similarly proved. ‘‘’’: Suppose that t is minimal in VSp.Then, there is some D  P  V such that t 2 EðP Þ. Since EðDÞ E EðP Þ by Lemma 3.1, there is some s 2 EðDÞ such that s 6 t, and the minimality of t implies s ¼ t. ‘‘’’: If t 2 EðDÞ and s 6 t is minimal in VSp, then s 2 EðDÞ as well. Since the elements of EðDÞ are pairwise incomparable, we have s ¼ t. h

184 185 186 187 188 189

In the sequel, we will denote EðDÞ by S, and EðVÞ by G, since they correspond, respectively, to the specific and general boundaries of D in the sense of 7; we also set Sq ¼ Eq ðDÞ and Gq ¼ Eq ðVÞ. By Lemma 3.1, S is the least E-set and G is the greatest E-set. An algorithm to find EðDÞ, called the lattice machine (LM), has been presented in [18], and it can easily be generalised to find EðP Þ for each D  P  V:

190 191 192 193

Algorithm (L M algorithm). def 1. H0 ¼ P . def 2. Hkþ1 ¼ The set of maximal elements of ½# ðHk þ P Þ \ E. 3. Continue until Hn ¼ Hnþ1 for some n.

UN

COR

REC

178 179 180 181 182 183

IJA 6238 ARTICLE 21 November 2003 Disk used

No. of Pages 19, DTD = 4.3.1

IN PRESS

H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx

7

194 It was shown in [19] that Hn ¼ EðP Þ. If jP j ¼ m, then the complexity of the 195 algorithm is Oð2m Þ in the worst case.

Example. A dataset is shown in Fig. 1. The small circles and crosses are simple tuples of different classes while the rectangles are hypertuples. Every hypertuples depicted here are equilabelled since they cover only simple tuples of the same class. All the simple tuples are also equilabelled. Each hypertuple here is maximal since we cannot move its boundary in any dimension while maintaining it being equilabelled. The set of all hypertuples in the figure is an E-set or a hypothesis for the underlying concept implied by the dataset, since all hypertuples are equilabelled and maximal, and they together cover all the simple tuples. In fact this E-set is EðDÞ––the specific boundary. The general boundary is not depicted here as we need knowledge of the domain of each attribute to calculate it. Clearly the E-set does not cover the whole data space; in other word, there are regions that are not covered by the E-set. Primary elements are those that are covered explicitly by some equilabelled hypertuples; secondary elements are those not covered explicitly but can be covered by extending some equilabelled hypertuple without compromising its equilabelledness. We can easily find primary and secondary elements from this figure.

PRO OF

197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214

To illustrate the above notions we consider the following example.

TED

196

REC

215 Our next aim is to show that every simple tuple is covered by some equi216 labelled element:

UN

COR

217 Lemma 3.3. For each t 2 V there is some h 2 E such that t 6 h.

Fig. 1. A dataset and its specific boundary. Each circle or cross is a simple tuple, and each rectangle represents a hypertuple.

IJA 6238 ARTICLE 21 November 2003 Disk used

8

H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx

Proof. We show that there is some x 2 D such that # ðt þ xÞ \ D ¼ fxg; then, t 6 t þ x 2 E: Let t 2 V , and set M ¼ ft þ x : x 2 Dg. Suppose that x 2 D such that t þ x is minimal in M with respect to 6 . Let y 2 D such that y 6 t þ x, and assume that y 6¼ x; then, there is some a 2 X such that yðaÞ 6¼ xðaÞ. Since both y and x are simple tuples, we have in fact yðaÞ \ xðaÞ ¼ ;. Now, y 6 t þ x implies that yðaÞ  tðaÞ [ xðaÞ, and it follows from yðaÞ \ xðaÞ ¼ ;––and the fact that t is simple––that yðaÞ ¼ tðaÞ(tðaÞ [ xðaÞ. Hence, t þ yflt þ x, contradicting the minimality of t þ x in M. h

PRO OF

218 219 220 221 222 223 224 225

No. of Pages 19, DTD = 4.3.1

IN PRESS

226 Since each element of E is covered by some element of G, and the sets Gi , 227 i 6 K, partition G, we obtain 228 Corollary 3.4. For each t 2 V, there is some i 6 K such that t ^ Gi .

229 Observe that such Gi need not be unique, and we need a selection procedure 230 to label t. Our chosen method is majority voting, combined with random 231 selection for tied maxima: Let

ð3:5Þ

mðt; GÞ ¼ maxfMðt; n; GÞ : n 6 Kg;

ð3:6Þ

Mðt; GÞ ¼ fi 6 K : m0 ðt; i; GÞ ¼ mðt; GÞg:

ð3:7Þ

REC

Rule 1. If t 2 V, then label t by a randomly chosen element of Mðt; GÞ. Now we consider an arbitrary hypothesis H . We can replace the G by H in Eqs. (3.5)–(3.7), and apply Rule 1 for classification. However, unlike the case where H ¼ G, there is no guarantee that t ^ H , i.e. that mðt; H Þ 6¼ 0. This motivates us to distinguish between different elements in V given H ¼ EðP Þ : t 2 V is called primary if there is Hq with t ^ Hq , t is secondary if there is h 2 Hq such that t þ h is equilabelled (i.e., t þ h does not overlap Hp for p 6¼ q), and t is tertiary otherwise. Note that an element can be both primary and secondary. In fact primary data is a subset of secondary data. Now we let

COR

234 235 236 237 238 239 240 241 242

for each n 6 K;

TED

m0 ðt; n; GÞ ¼ jfh 2 Gn : t 6 hgj

m0p ðt; n; H Þ ¼ jfh 2 Hn : t 6 hgj for each n 6 K;

UN

mp ðt; H Þ ¼ maxfm0p ðt; n; H Þ : n 6 Kg;

Mp ðt; H Þ ¼ fi 6 K : m0p ðt; i; H Þ ¼ mp ðt; hÞg:

244 and

ð3:8Þ ð3:9Þ ð3:10Þ

IJA 6238 ARTICLE 21 November 2003 Disk used

No. of Pages 19, DTD = 4.3.1

IN PRESS

H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx

m0s ðt; n; H Þ ¼ jfh 2 Hn : t þ h 2 Egj for each n 6 K; ms ðt; H Þ ¼ Ms ðt; H Þ ¼

ð3:11Þ ð3:12Þ ð3:13Þ

2 If t is primary, then label t by a randomly chosen element of Mp ðt; H Þ. If t is secondary, then label t by a randomly chosen element of Ms ðt; H Þ. Otherwise, label t as unclassified.

PRO OF

Rule • • •

maxfm0s ðt; n; H Þ : n 6 Kg; fi 6 K : m0s ðt; i; H Þ ¼ ms ðt; hÞg:

9

250 Now we explore a generalisation of both primary and secondary data in the 251 following sense: 252 Lemma 3.5. If H ¼ EðP Þ, we let P 0 ¼ P [ ftg and H 0 ¼ EðP 0 Þ. Then, t is secondary for H ) t is primary for H 0 :

256

TED

Proof. Let h 2 H such that t þ h is equilabelled. Then, t þ h 2 ½P 0  \ E, and 255 thus, t 6 g for some g 2 EðP 0 Þ. h The converse is not true: Let D ¼ fa; b; c; d; e; f g with b ¼ h1; 0; 0i; f ¼ h0; 1; 0i:

c ¼ h1; 0; 1i;

d ¼ h1; 1; 1i;

REC

a ¼ h0; 0; 0i; e ¼ h0; 1; 1i;

258 Suppose that a, b, e are coloured blue and d, e, f are coloured red. Then, EðDÞ ¼ fa þ b; e; c þ d; f g:

COR

260 The aim is to classify t ¼ h0; 0; 1i with respect to the hypothesis H ¼ EðDÞ. 261 Now, a þ t ¼ h0; 0; 01i is equilabelled blue; e þ t ¼ h0; 01; 1i is equilabelled blue; c þ t ¼ h01; 0; 1i is equilabelled red; while b þ t, d þ t, f þ t are not equilabelled. Furthermore, q þ t is not equilabelled for any q 2 EðDÞ, and thus, t is not secondary with respect to EðDÞ. If we admit that the knowledge provided by t should be admitted when we try to classify t by H ¼ EðP Þ, then we should classify by using primary data of EðP [ ftgÞ. At any rate, admitting t does not cause any inconsistencies, and using primary data of H 0 is well in line with our aim of maximising consistency.

UN

263 264 265 266 267 268

IJA 6238 ARTICLE 21 November 2003 Disk used

10

No. of Pages 19, DTD = 4.3.1

IN PRESS

H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx

269 3.1. Weak hypotheses

PRO OF

H ; for q ¼ 0 to K do X Dq while X 6¼ ; do Order X as hg0 ; . . . ; gmðX Þ i h g0 for i ¼ 1 to mðX Þ do if h þ gi is equilabelled then h h þ gi end if end for X Xn # h H H [ fhg end while end for

TED

279 280 281 282 283 284 285 286 287 288 289 290 291 292 293

In the previous discussion we have introduced two classification rules based on E-sets. Rule 1 can be applied if G can be practically constructed from V by the LM algorithm, and Rule 2 can be applied if S (or EðP Þ) can be practically constructed from D (for D  P  V). In both cases we have to construct E-sets for P where D  P  V. From the LM algorithm we know that constructing an E-set is expensive and in the worst case it is exponential. Since the most time is spent finding maximal hypertuples, we make the following definition: A weak hypothesis is a set of equilabelled hypertuples which covers D. The following algorithm finds a weak hypothesis H [17]:

REC

270 271 272 273 274 275 276 277 278

294 The algorithm does not necessarily produce disjoint hypertuples: Let D0 ¼ fh0; 1; 1i; h1; 0; 1i; h1; 1; 1i; h1; 1; 0i; h2; 1; 1ig;

COR

D1 ¼ fh2; 0; 1i; h0; 0; 0ig: 296 Order X ¼ D0 by

g0 ¼ h0; 1; 1i; g1 ¼ h1; 0; 1i;

UN

g2 ¼ h1; 1; 1i; g3 ¼ h1; 1; 0i; g4 ¼ h2; 1; 1i:

298 Now, h0 ¼ g0 þ g1 þ g2 is equilabelled, while h0 þ g3 and h0 þ g4 are not. The 299 next step produces h1 ¼ g3 þ g4 . However, g2 6 h0 and g2 6 h1 .

IJA 6238 ARTICLE 21 November 2003 Disk used

No. of Pages 19, DTD = 4.3.1

IN PRESS

H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx

In the worst case the time complexity for building Hq (the hypothesis for 2 class Dq ) is OðjDq j Þ. Therefore the worst case complexity for building H (the whole hypothesis) is OðK  jDq j2 Þ, where K is the number of classes. S Let H ¼ i 6 K Hi be a weak hypothesis for D, where Hi ¼ fh0 ;    ; htðiÞ g is a weak hypothesis for class i and hj is an equilabelled hypertuple. Let Mp ðt; H Þ and Ms ðt; H Þ be defined as in Eqs. (3.8) and (3.11). Then we introduce the following rule, which is the same as Rule 2 except that the hypothesis here is weak.

PRO OF

300 301 302 303 304 305 306 307

11

308 Rule 3 309 • If t is primary, then label t by a randomly chosen element of Mp ðt; H Þ. 310 • If t is secondary, then label t by a randomly chosen element of Ms ðt; H Þ. 311 • Otherwise, label t as unclassified. 312 4. Mitchells version space

The situation in [9] can be described as a special decision system where d : U ! f0; 1g, and it is called the target concept. The set of positive examples is denoted by D1 , and that of negative examples by D0 . We will describe the concepts introduced there in our notation and we will follow the explanation in [9, p. 22, Table 2.2]. A hypothesis is a hypertuple t 2 L such that for all a 2 X jtðaÞj 6 1 or tðaÞ ¼ Va :

TED

313 314 315 316 317 318

ð4:1Þ

The set of all hypotheses is denoted by H. Observe that H is a hyperrelation, and that H is partially ordered by 6 as defined by (2.1), and that V  H. If s; t 2 H, and s 6 t, we say that s is more specific than t or, equivalently, that t is more general than s. We say that s satisfies hypothesis t, if

COR

322 323 324 325

REC

320 Thus, for each a 2 X we have 8 < ;; or tðaÞ ¼ m; for some m 2 Va ; or : Va :

326 1. s is a simple tuple, i.e. s 2 V, 327 2. s 6 t,

UN

328 and denote the set of all (simple) tuples that satisfy t by satðtÞ; observe that 329 satðtÞ ¼# t \ V. More generally, for A  L we let satðAÞ ¼# A \ V:

331 If tðaÞ ¼ ; for some a 2 X, then t cannot be satisfied. We interpret s 2 satðtÞ as 332 ‘‘instance s is classified by hypothesis t’’. According to [9], t is more general

IJA 6238 ARTICLE 21 November 2003 Disk used

12

No. of Pages 19, DTD = 4.3.1

IN PRESS

H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx

333 than s, if any instance classified by s is also classified by t; in other words, 334 satðsÞ  satðtÞ. That our notion captures this concept is shown by the following 335 result, the easy proof of which is left to the reader. 336 Theorem 4.1. Suppose that s; t 2 H. Then, s 6 t () satðsÞ  satðtÞ:

PRO OF

338 Since we are interested in hypotheses whose satisfiable observations are 339 within one class of d, we say that t 2 H is consistent with hD; di, if # t \ D ¼ D1 :

ð4:2Þ

341 Thus, in this case, the training examples satisfying t are exactly the positive 342 ones. The version space VSpm is now the set of all hypotheses consistent with 343 hD; di. In other words, VSpm ¼ ft 2 H : ð8sÞ½s 2 satðtÞ \ D () dðsÞ ¼ 1g:

345 The general boundary Gm is the set of maximal members of H consistent with 346 hD; di, i.e. Gm ¼ max VSpm :

Sm ¼ min VSpm :

The two boundaries delimit the version space in the sense that for any consistent hypothesis t in the version space there are g 2 Gm and s 2 Sm such that s 6 t 6 g. The example in Table 2 illustrates the idea of version space. We follow [9] in writing ? in column a, if aðxÞ ¼ Va .

REC

351 352 353 354 355

TED

348 The specific boundary Gm is the set of minimal members of H consistent with 349 hD; di, i.e.

357

COR

356 5. Expressive power of version space and boolean reasoning A simple tuple t can be regarded as a conjunction of descriptors ha0 ; tða0 Þi ^ ha1 ; tða1 Þi ^    ^ haT ; tðaT Þi:

UN

359 If t is a satisfiable hypothesis, then no tðaÞ is empty, and tðaÞ ¼ Va tells us that 360 any value is allowed in this column. Thus, such a descriptor places no 361 restriction on satisfiability in column a. If I ¼ fi 6 T : tðai Þ 6¼ Vai g, then, s 2 satðtÞ () ð8i 2 IÞsðai Þ ¼ tðai Þ;

363 so that we can interpret t as the conjunction ^ hai ; tðai Þi: i2I

IJA 6238 ARTICLE 21 November 2003 Disk used

No. of Pages 19, DTD = 4.3.1

IN PRESS

H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx

13

Table 2 Training data, hypotheses and boundaries [9] Wind

Water

FCast

d

Training data D Sunny Warm Sunny Warm Rainy Cold Sunny Warm

Normal High High High

Strong Strong Strong Strong

Warm Warm Warm Cool

Same Same Change Change

1 1 0 1

Hypotheses Sunny Sunny Sunny ? Sunny ?

Warm ? Warm Warm ? Warm

? ? ? ? ? ?

Strong Strong ? Strong ? ?

? ? ? ? ? ?

Specific boundary Sunny Warm

?

Strong

?

General boundary Sunny ? ? Warm

? ?

? ?

? ?

PRO OF

Humid

? ? ? ? ? ?

1 1 1 1 1 1

?

1

? ?

1 1

COR

REC

Such an expression is called an exact template in [13]. The expressive power of this type of conjunctive hypothesis is limited. An example from [9] illustrating this is shown in Table 3. For such a simple dataset, there is no consistent hypothesis in the sense of (4.2). A hypothesis in VSpm is a very special kind of hypertuple, and it is our aim to extend this notion of hypothesis, so that the resulting structures are more expressive while at the same time not so general as to carry no useful information. As suggested by Mitchell, a possible solution is to use arbitrary disjunctions of conjunctions of descriptors as hypothesis representation. It is not hard to see that this is overly general, since any positive Boolean expression can then serve as a hypothesis. In contrast, we suggest to use a specific class of disjunctions of conjunctions which is significantly narrower than the class of all positive Boolean expressions. The building Q blocks will be the hypertuples contained in the sub-semilattice of L ¼ a2X 2Va generated by V with + operTable 3 Limitation of version space

UN

365 366 367 368 369 370 371 372 373 374 375 376 377 378

ATemp

TED

Sky

1 2 3

Sky

ATemp

Humid

Wind

Water

FCast

d

Sunny Cloudy Rainy

Warm Warm Warm

Normal Normal Normal

Strong Strong Strong

Cool Cool Cool

Change Change Change

1 1 0

IJA 6238 ARTICLE 21 November 2003 Disk used

14

No. of Pages 19, DTD = 4.3.1

IN PRESS

H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx

379 ator ) [V]. Since we are only interested in the finite sums of elements of V we 380 will from now on assume that each Va is finite. 381 Suppose that t ¼ hfma0 ; ma1 ;    ; matðaÞ gia2X is a hypertuple. We interpret t as a 382 conjunction of disjunctions of descriptors ^ ðha; ma0 i _    _ ha; matðaÞ iÞ: ð5:1Þ a2X

By the distributivity of ^ and _ this can always be turned into a disjunction of simple tuples, but not every disjunction of simple tuples (considered as a conjunction of descriptors) is equivalent to an expression such as (5.1); consider, for example, ðha0 ; t00 i ^ ha1 ; t10 iÞ _ ðha0 ; t01 i ^ ha1 ; t11 iÞ:

REC

TED

A hypertuple can be viewed as a construction similar to hypercubes which delineate solids in an appropriate space. Now we compare our hypothesis space with Mitchell’s from the perspective of expressive power. Consider a dataset D as defined earlier. Suppose all attributes x are discrete, and all Vx are finite. In our hypothesis space each attribute x takes on aQ subset of Vx , so there are 2jVx j different subsets altogether. As a result there are x2X 2jVx j different hypertuples. Since a hypothesis is a set jVx j of hypertuples (i.e., a hyperrelation), there are 2Px2X 2 distinct hypotheses. In Mitchell’s hypothesis space each attribute x takes on a single value in Vx plus two other Q special values, ‘‘?’’ and ‘‘;’’. Therefore there are jVx j þ 2 different values, and x2X ðjVx j þ 2Þ different tuples. In his conjunctive hypothesis repQ resentation, each hypothesis is a (simple) tuple, so there are x2X ðjVx j þ 2Þ distinct hypotheses. In his disjunctive hypothesis representation, each hypothesis is a set of (simple) tuples (i.e., simple relation), so there are 2Px2X ðjVx jþ2Þ distinct hypotheses. Clearly our hypothesis space can represent more distinct objects than Mitchell’s can. In this sense we say our hypothesis is more expressive than Mitchell’s. Note that E characterises the eligible hypotheses. So Mitchell’s version space can be specialised from our version space in the following way:

COR

389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408

PRO OF

384 385 386 387

UN

409 • The E is restricted to Em ¼ fcðtÞ : t 2 E; D ^ tg, where c is an operation 410 to turn a hypertuple into a simple tuple in such a way that 411 cðtÞ ¼ ht0 ; t1 ; . . . ; tT i, where  tðxi Þ; if jtðxi Þj ¼ 1; ti ¼ ?; otherwise:

IJA 6238 ARTICLE 21 November 2003 Disk used

No. of Pages 19, DTD = 4.3.1

IN PRESS

H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx

15

413 Note that tðxi Þ is the projection of tuple t onto its xi attribute. 414 • Each hypothesis is an element of Em . 415 • There are two classes, i.e., K ¼ 2. 416 • The version space is built for only one class.

PRO OF

417 Given the above restrictions the specific boundary is Sm ¼ fSm0 ; Sm1 g, where m 418 S0 is the specific boundary for the negative class and Sm1 is the specific 419 boundary for the positive class. Similarly Gm ¼ fGm0 ; Gm1 g.

420 6. Experimental

TED

We have evaluated Rule 3 using public datasets. We used 17 public datasets in our evaluation, which are available from UC Irvine Machine Learning Repository. General information about these datasets is shown in the first three columns of Table 4. We used the C A S E E X T R A C T algorithm to construct weak hypotheses for the datasets, and applied Rule 3 to classify new data. For presentation purpose we refer to our classification procedure by G L M . The experimental results are shown in the last 5 columns of Table 4. As a comparison the C5.0 results on the same datasets are also shown. It is clear from this table that G L M performs extremely well on primary data, but it accounts for only 76.4% of all data on average. Over all data G L M compares well with C5.0. Parity problems are well known to be difficult for many machine learning algorithms. We evaluated the GLM algorithm using three well known parity datasets [16]––Monk-1, Monk-2, Monk-3. 1 Experimental results show that GLM works well for these parity datasets.

REC

421 422 423 424 425 426 427 428 429 430 431 432 433 434 435

Mitchell’s classical work on version space has been followed by many. Most notably Hirsh and Sebag. Hirsh [5] discusses how to merge version spaces when a central idea in Mitchell’s work is removed––a version space is the set of concepts strictly consistent with training data. This merging process can therefore accommodate noise. Sebag [14] presents what she calls a disjunctive version space approach to learning disjunctive concepts from noisy data. A

UN

437 438 439 440 441 442

COR

436 7. Discussion and conclusion

1 Target Concepts associated to the Monk’s problem: Monk-1: ða1 ¼ a2Þ or ða5 ¼ 1Þ; Monk-2: exactly two of a1 ¼ 1, a2 ¼ 1, a3 ¼ 1, a4 ¼ 1, a5 ¼ 1, a6 ¼ 1; Monk-3: (a5 ¼ 3 and a4 ¼ 1) or (a5 6¼ 4 and a2 6¼ 3) (5% class noise added to the training set).

#Test

38 14 25 8 20 9 13 19 22 4 6 6 6 60 9 18 13

798 690 205 768 1000 214 270 155 368 150 124 169 122 208 958 232 178

CV-5 CV-5 CV-5 CV-5 CV-5 CV-5 CV-5 CV-5 CV-5 CV-5 432 432 432 CV-5 CV-5 CV-5 CV-5

GLM

on primary and secondary data

REC

Class. Accuracy (%)

%PP

C5.0

GLM/SR

GLM/PSR

GLM/SSR

96.6 90.6 70.7 72.7 71.7 80.4 77.0 80.6 85.1 94.7 74.3 65.1 97.2 71.6 86.2 96.5 94.3

96.4 95.1 82.4 70.7 72.6 86.6 81.9 82.9 82.7 93.1 100.0 81.4 93.8 81.6 96.2 96.5 99.0

98.0 95.1 87.0 71.0 72.6 87.6 81.9 84.1 82.7 97.6 100.0 87.3 94.3 81.3 96.1 98.6 99.0

62.9 100.0 63.0 40.0 N/A 76.5 N/A 50.0 N/A 66.7 N/A 41.5 89.2 100.0 100.0 77.3 N/A

82.7

87.8

89.1

TED

PRO

OF

72.3

92.5 87.2 56.1 66.8 65.4 79.4 61.5 69.0 67.7 82.7 100.0 83.6 89.1 59.1 94.9 89.7 53.9 76.4

The validation method used is either 5-fold cross validation or explicit train/test validation. The acronyms are: SR––overall success ratio of primary and secondary data (i.e., the percentage of successfully classified primary and secondary data tuples over all primary and secondary data tuples), PSR–– success ratio of primary data, SSR––success ratio of secondary data, PP––the percentage of primary data, and N/A––not available.

No. of Pages 19, DTD = 4.3.1

Average

COR

#Train

IN PRESS

Annealing Australian Auto Diabetes German Glass Heart Hepatitis Horse-Colic Iris Monk-1 Monk-2 Monk-3 Sonar TTT Vote Wine

#Attr.

H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx

Dataset

IJA 6238 ARTICLE 21 November 2003 Disk used

16

UN

Table 4 General information about the datasets and the classification accuracy of C5.0 on all data and of

IJA 6238 ARTICLE 21 November 2003 Disk used

No. of Pages 19, DTD = 4.3.1

IN PRESS

H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx

460 8. Uncited reference 461

[7]

462 Acknowledgements

TED

PRO OF

separate version space is learned for each positive training example, then new instances are classified by combining the votes of these different version spaces. In this paper we investigate version spaces in a more expressive hypothesis space––disjunction of conjunctions of disjunctions, where each hypothesis is a set of hypertuples. Without a proper inductive bias the version space is uninteresting. We show that, with E-set as an inductive bias, this version space is a generalisation of Mitchell’s original version space, which employs a different type of inductive bias. For classification within the version space we proposed three classification rules for use in different situations. The first two rules are based on E-sets as hypotheses, and they can be applied when the data space (V) is finite and small. We showed that constructing E-sets is computationally expensive, so we introduced the third rule which is based on weak hypotheses. We presented an algorithm to construct weak hypotheses efficiently. Experimental results show that this classification approach performs extremely well on primary data, which account for over 75.0% of all data. Over all data this classification approach is comparable to C5.0.

REC

443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459

17

463 The work of Hui Wang was partly supported by the European Commission 464 project ICONS, project no. IST-2001-32429.

X ^ Y () ð8x 2 X Þð9y 2 Y Þx 6 y X EY () ð8y 2 Y Þð9x 2 X Þx 6 y U ¼ fx0 ; . . . ; xN g X ¼ fa Q 0 ; . . . ; aT g V ¼ Qa2X Va L ¼ a2X 2Va IðxÞ ¼ ha0 ðxÞ; a1 ðxÞ; . . . ; aT ðxÞi D ¼ fIðxÞ : x 2 U g ¼ D0 [ D1 [ . . . [ DK Eq ¼Sset of all elements equilabelled with respect to Dq E ¼ q 6 K Eq

UN

466 467 468 469 470 471 472 473 474 475

COR

465 Appendix A. Notation

IJA 6238 ARTICLE 21 November 2003 Disk used

18

H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx

EðP Þ ¼ ft 2 L : t is maximal in ½P  \ Eg Eq ðP Þ ¼ EðP Þ \ Eq S ¼ EðDÞ Sq ¼ Eq ðDÞ G ¼ EðVÞ Gq ¼ EqSðVÞ VSp ¼ fEðP Þ : D  P  Vg

PRO OF

476 477 478 479 480 481 482

483 References

COR

REC

TED

[1] I. D€ untsch, G. Gediga, Statistical evaluation of rough set dependency analysis, International Journal of Human–Computer Studies 46 (1997) 589–604. [2] I. D€ untsch, G. Gediga, Simple data filtering in rough set systems, International Journal of Approximate Reasoning 18 (1–2) (1998) 93–106. [3] G. Gr€ atzer, General Lattice Theory, Birkh€auser, Basel, 1978. [4] D. Haussler, Quantifying inductive bias: Ai learning algorithms and valiant’s learning framework, Artificial Intelligence 36 (1988) 177–221. [5] H. Hirsh, Generalizing version spaces, Machine Learning 17 (1) (1994) 5–46. [6] T.M. Mitchell, Version spaces: a candidate elimination approach to rule learning, in: Proc. 5th IJCAI, 1977, pp. 305–310. [7] T.M. Mitchell, Version spaces: an approach to concept learning. PhD thesis, Department of Electrical Engineering, Stanford University, Stanford, CA, 1979. [8] T.M. Mitchell, Generalization as search, Artificial Intelligence 18 (1982). [9] T.M. Mitchell, Machine Learning, The McGraw-Hill Companies, Inc, 1997. [10] H.S. Nguyen, From optimal hyperplanes to optimal decision trees, Fundamenta Informaticae 34 (1998) 145–174. [11] H.S. Nguyen, S.H. Nguyen, Pattern extraction from data, Fundamenta Informaticae 34 (1998) 129–144. [12] H.S. Nguyen, A. Skowron, Boolean reasoning for feature extraction problems, in: Z.W. Ras, A. Skowron (Eds.), Proc. of the 10th International Symposium on Methodologies for Intelligent Systems (ISMIS’97), volume 1325 of Lecture Notes in Artificial Intelligence, Springer-Verlag, Berlin, 1997, pp. 117–126. [13] S.H. Nguyen, A. Skowron, P. Synak, Discovery of data patterns with applications to decomposition and classification problems, in: L. Polkowski, A. Skowron (Eds.), Rough sets in knowledge discovery, vol. 2, Heidelberg, Physica-Verlag, 1998, pp. 55–97. [14] M. Sebag, Delaying the choice of bias: a disjunctive version space approach, in: Proc. of the 13th International Conference on Machine Learning, San Francisco, Morgan Kaufmann, 1996, pp. 444–452. [15] A. Skowron, Extracting laws from decision tables––a rough set approach, Computational Intelligence 11 (1995) 371–388. [16] S.B. Thrun, J. Bala, E. Bloedorn, I. Bratko, B. Cestnik, J. Cheng, K. De Jong, S. Dzeroski, S.E. Fahlman, D. Fisher, R.Hamann, K. Kaufman, S. Keller, I. Kononenko, J. Kreuziger, R.S. Michalski, T. Mitchell, P. Pachowicz, Y. Reich, H. Vafaie, W. Van deWelde, W. Wenzel, J. Wnek, J. Zhang, The monk’s problems––a performance comparison of different learning algorithms. Technical Report CS-CMU-91-197, Carnegie Mellon University, 1991. [17] H. Wang, W. Dubitzky, I. D€ untsch, D. Bell, A lattice machine approach to automated casebase design: Marrying lazy and eager learning, in: Proc. IJCAI99, Stockholm, Sweden, 1999, pp. 254–259.

UN

484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521

No. of Pages 19, DTD = 4.3.1

IN PRESS

IJA 6238 ARTICLE 21 November 2003 Disk used

No. of Pages 19, DTD = 4.3.1

IN PRESS

H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx

19

UN

COR

REC

TED

PRO OF

522 [18] H. Wang, I. D€untsch, D. Bell, Data reduction based on hyper relations, in: R. Agrawal, P. 523 Stolorz, G. Piatetsky-Shapiro (Eds.), Proceedings of KDD’98, New York, 1998, pp. 349–353. 524 [19] H. Wang, I. D€untsch, G. Gediga, Classificatory filtering in decision systems, International 525 Journal of Approximate Reasoning 23 (2000) 111–136.