IJA 6238 ARTICLE 21 November 2003 Disk used
No. of Pages 19, DTD = 4.3.1
IN PRESS
International Journal of Approximate Reasoning xxx (2003) xxx–xxx
www.elsevier.com/locate/ijar
Hui Wang a
PRO OF
Hyperrelations in version space a,*
, Ivo D€ untsch b, G€ unther Gediga c, Andrzej Skowron d
School of Information and Software Engineering, University of Ulster, Faculty of Informatics, Newtownabbey, Co. Antrim BT 37 0QB, Ireland b Department of Computer Science, Brock University, St. Catherines, Ont., Canada L2S 3AI c Institut f€ ur Evaluation und Marktanalysen, Brinkstr. 19, 49143 Jeggen, Germany d Institute of Mathematics, University of Warsaw, 02-097 Warszawa, Poland
TED
Received 1 October 2002; accepted 1 October 2003
Abstract
COR
REC
A version space is a set of all hypotheses consistent with a given set of training examples, delimited by the specific boundary and the general boundary. In existing studies [Machine Learning 17(1) (1994) 5; Proc. 5th IJCAI (1977) 305; Artificial Intelligence 18 (1982)] a hypothesis is a conjunction of attribute-value pairs, which is shown to have limited expressive power [Machine Learning, The McGraw-Hill Companies, Inc (1997)]. In a more expressive hypothesis space, e.g., disjunction of conjunction of attribute-value pairs, a general version space becomes uninteresting unless some restriction (inductive bias) is imposed [Machine Learning, The McGraw-Hill Companies, Inc (1997)]. In this paper we investigate version space in a hypothesis space where a hypothesis is a hyperrelation, which is in effect a disjunction of conjunctions of disjunctions of attribute-value pairs. Such a hypothesis space is more expressive than the conjunction of attribute-value pairs and the disjunction of conjunction of attribute-value pairs. However, given a dataset, we focus our attention only on those hypotheses which are consistent with given data and are maximal in the sense that the elements in a hypothesis cannot be merged further. Such a hypothesis is called an E-set for the given data, and the set of all E-sets is the version space which is delimited by the least E-set (specific boundary) and the greatest E-set (general boundary).
UN
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
*
Corresponding author. Tel.: +44-1232-368981; fax: 44-1232-366068. E-mail addresses:
[email protected] (H. Wang),
[email protected] (I. D€ untsch),
[email protected] (G. Gediga),
[email protected] (A. Skowron). 0888-613X/$ - see front matter 2003 Published by Elsevier Inc. doi:10.1016/j.ijar.2003.10.007
IJA 6238 ARTICLE 21 November 2003 Disk used
2
H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx
Based on this version space we propose three classification rules for use in different situations. The first two are based on E-sets, and the third one is based on ‘‘degraded’’ E-sets called weak hypotheses, where the maximality constraint is relaxed. We present an algorithm to calculate E-sets, though it is computationally expensive in the worst case. We also present an efficient algorithm to calculate weak hypotheses. The third rule is evaluated using public datasets, and the results compare well with C5.0 decision tree classifier. 2003 Published by Elsevier Inc.
PRO OF
30 31 32 33 34 35 36 37 38
39 1. Introduction
COR
REC
TED
Version space is a useful and powerful concept in the area of concept learning: a version space is the set of all hypotheses in a hypothesis space consistent with a dataset. A version space is delimited by the specific boundary and the general boundary––the sets of most specific and most general hypotheses respectively, to be explained below. The E L I M I N A T E - C A N D I D A T E algorithm [6,8] is the flagship algorithm used to construct a version space from a dataset. The hypothesis space used in this algorithm is the conjunction of attribute-value pairs, and it is shown to have limited expressive power [9]. It is also shown by [4] that the size of the general boundary can grow exponentially in the number of training examples, even when the hypothesis space consists of simple conjunctions of attribute-value pairs. Table 3 was used in [9] to show the limitation of Mitchell’s representation. It was shown that the E L I M I N A T E - C A N D I D A T E algorithm cannot come up with a consistent specific boundary or general boundary for this example due to this limitation (i.e., the chosen hypothesis space). A possible solution, by increasing the expressive power of the representation, is to extend the hypothesis space to disjunctions of conjunctions of attribute-value pairs from the restrictive conjunctions of attribute-value pairs. But it turned out that, without proper inductive bias, the specific boundary would always be overly specific (the disjunction of the observed positive examples) and the general boundary would always be overly general (the negated disjunction of the observed negative examples) [9], so they do not carry useful information––they are ‘‘uninteresting’’. The desired inductive bias, however, has not been explored. In this paper we report a study on version space which is aimed at increasing the expressive power of the hypothesis space and, at the same time, keeping the hypotheses ‘‘interesting’’ through introducing an inductive bias. We explore a semilattice structure that exists in the set of all hypertuples of a domain, where hypertuples generalise the traditional tuples from value-based to set-based. The semilattice structure can be used as a base for a hypothesis space. We take a hypothesis to be a hyperrelation, i.e., a set of hypertuples. A hyperrelation can
UN
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
No. of Pages 19, DTD = 4.3.1
IN PRESS
IJA 6238 ARTICLE 21 November 2003 Disk used
No. of Pages 19, DTD = 4.3.1
IN PRESS
H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx
PRO OF
be interpreted as a disjunction of hypertuples, which can be further interpreted as a conjunction of disjunctions of attribute-value pairs. Such a hypothesis space is much more expressive than the conjunction of attribute-value pairs and the disjunction of conjunction of attribute-value pairs. For a dataset there is a large number of hypertuples which are consistent with the data, some of which can be merged (through the semilattice operation) to form a different consistent hypertuple. We propose to focus on those hypertuples which are consistent with the given data and cannot be merged further––they are said to be maximal. A hypothesis is then a set of these hypertuples, and the version space is the set of all these hypertuples. Clearly this version space is a subset of the semilattice. An algorithm is presented which is able to construct the version space. A detailed analysis of Mitchell’s version space shows that our version space is indeed a generalisation of Mitchell’s and that it is more expressive than Mitchell’s. Classification within our version space is not trivial. We explore ways in which data can be classified using different hypotheses in the version space. We also examine how this whole process can be speeded up for practical benefits. Experimental results are presented to show the effectiveness of this approach. Most of the theory (and results) of version space can be expressed by the tools offered by Boolean reasoning as investigated, for example, in [10–13,15]. Unlike some of these publications we will only be concerned with consistent decision systems and exact classification rules.
REC
93 2. Definitions and notation
TED
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
3
94 For a set U , we let 2U be the powerset of U . If 6 is a partial ordering on U , 95 and X U , we let
If X ; Y U , we say that Y covers X , written as X ^ Y , if X # Y , i.e. if for each x 2 X there is some y 2 Y with x 6 y. We call X dense for Y , written X E Y , if for any y 2 Y there is some x 2 X such that x 6 y, i.e. Y " X . Note that Y covers X iff Y is dually dense for X . A semilattice is an algebra hA; þi, such that + is a binary operation on A satisfying
UN
97 98 99 100 101 102
COR
" X ¼ fy 2 U : ð9x 2 X Þx 6 yg; # X ¼ fy 2 U : ð9x 2 X Þy 6 xg:
x þ y ¼ y þ x;
ðx þ yÞ þ z ¼ x þ ðy þ zÞ; x þ x ¼ x:
IJA 6238 ARTICLE 21 November 2003 Disk used
4
No. of Pages 19, DTD = 4.3.1
IN PRESS
H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx
104 If X A, we denote by ½X the sub-semilattice of A generated by X . With 105 some abuse of notation, we also use + for the complex sum, i.e. 106 X þ Y ¼ fx þ y : x 2 X ; y 2 Y g. 107 Each semilattice has an intrinsic order defined by x 6 y () x þ y ¼ y:
ð2:1Þ
PRO OF
109 x is called minimal (maximal) in X U , if y 6 x ðx 6 yÞ implies x ¼ y. The set 110 of all minimal (maximal) elements of X is denoted by min X ðmax X Þ. 111 A decision system is a tuple I ¼ hU ; X; fVa : a 2 Xg; d; Vd i, where 112 1. U ¼ fx0 ; . . . ; xN g is a nonempty finite set. 113 2. X ¼ fa0 ; . . . ; aT g is a nonempty finite set of mappings ai : U ! Vai . 114 3. d : U ! Vd is a mapping. 115
Set V ¼
Q
a2X
Va . We let I : U ! V be the mapping
x 7! ha0 ðxÞ; a1 ðxÞ; . . . ; aT ðxÞi
TED
which assigns to each object x its feature vector. V is called data space, and the collection D ¼ fIðxÞ : x 2 U g is called the training space. In the sequel V is assumed finite (consequently D is finite) unless otherwise stated. The mapping d defines a labelling (or partition) of D into classes D0 ; D1 ; . . . ; DK , where IðxÞ and IðyÞ are in the same class Di if and only if dðxÞ ¼ dðyÞ. If a 2 X, v 2 Va , then ha; vi is called a descriptor. We call Y L¼ 2V a ð2:3Þ a2X
the extended data space, its elements hypertuples, and its subsets hyperrelations. In contrast we call the elements of V simple tuples and its subsets simple relations. If s 2 L and a 2 X we let sðaÞ be the projection of s with respect to a, i.e. the set appearing in the ath component. L is a lattice under the following natural order
COR
125 126 127 128 129
REC
117 118 119 120 121 122 123
ð2:2Þ
s 6 t () sðaÞ tðaÞfor all a 2 X; 131 with the least upper bound (+) and greatest lower bound ðÞ operations given 132 by
UN
ðt þ sÞðaÞ ¼ tðaÞ [ sðaÞ; ðt sÞðaÞ ¼ tðaÞ \ sðaÞ
134 for all a 2 X. s and t are said to be overlapping if there is some simple tuple d 135 such that d 6 t s. By identifying a singleton with the element it contains, we 136 may suppose that V L; in particular, we have D L.
IJA 6238 ARTICLE 21 November 2003 Disk used
No. of Pages 19, DTD = 4.3.1
IN PRESS
H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx
5
Table 1 a2
y
0 1 1 2
a a a b
(b) A set of hypertuples fa; bg fbg
{0,1} {2}
a b
PRO OF
a1 (a) A set of simple tuples a a b b
137 Table 1(a) below shows a simple relation (i.e., a set of simple tuples) and 138 Table 1(b) a hyperrelation. 139 For unexplained notation and background reading in lattice theory, we 140 invite the reader to consult [3].
TED
141 3. Hypertuples and hyperrelations
142 Suppose that Dq is a labelled class. The idea of [2,19] was to collect, if 143 possible, across the elements of Dq the values of the same attributes without 144 destroying the labelling information. For example, if the system generates the 145 rules
REC
If x is blue; then sell; if x is green; then sell;
147 we can identify the values ‘‘blue’’ and ‘‘green’’ and collect these into the single 148 rule If x is blue or green; then sell:
COR
Our aim is to maximise the number of attribute values on the left hand side of the rule in order to increase the generality of the rule and also increase its base as to guard against random influences [1]. This leads to the following definition [18]: We call an element r 2 L equilabelled with respect to the class Dq , if # r \ D 6¼ ;
ð3:1Þ
UN
150 151 152 153 154
ð3:2Þ
156 and
# r \ D Dq :
158 These two conditions express two fundamental principles in machine learning: 159 (3.1) says that r is supported by some d 2 D; it is a minimality condition in the
IJA 6238 ARTICLE 21 November 2003 Disk used
6
160 161 162 163 164 165
No. of Pages 19, DTD = 4.3.1
IN PRESS
H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx
sense that we do not want to consider hypertuples that have nothing to do with the training set. Condition (3.2) tells us that r needs to be consistent with the training set: all elements below r which are in the training set D are in the unique class Dq . We denote the set of all elements equilabelled with respect to S Dq by Eq , and let E ¼ q 6 K Eq be the set of all equilabelled elements. Note that D E, and that q; r 6 K; q 6¼ rimpliesEq \ Er ¼ ;; " Eq \ E ¼ Eq :
PRO OF
167 since fD0 ; . . . ; DK g partitions D. Furthermore,
ð3:3Þ ð3:4Þ
169 If D SP L, we let EðP Þ ¼ ft 2 L : t is maximal in ½P \ Eg, and set 170 VSp ¼ fEðP Þ : D P Vg. For each q 6 K, let Eq ðP Þ ¼ EðP Þ \ Eq . 171 Each set of the form EðP Þ is called an E-set or a hypothesis. We now have 172 Lemma 3.1. Let D P Q L. Then,
TED
173 1. EðP Þ ^ EðQÞ. 174 2. EðP Þ E EðQÞ.
175 Proof. Both claims follow immediately from 176 ; 6¼ ½P \ E ½Q \ E, and that ½Q \ E is finite. h
the
fact
that
177 Theorem 3.2. min VSp ¼ EðDÞ, max VSp ¼ EðVÞ.
Proof. We only prove the first part; the second part can be similarly proved. ‘‘’’: Suppose that t is minimal in VSp.Then, there is some D P V such that t 2 EðP Þ. Since EðDÞ E EðP Þ by Lemma 3.1, there is some s 2 EðDÞ such that s 6 t, and the minimality of t implies s ¼ t. ‘‘’’: If t 2 EðDÞ and s 6 t is minimal in VSp, then s 2 EðDÞ as well. Since the elements of EðDÞ are pairwise incomparable, we have s ¼ t. h
184 185 186 187 188 189
In the sequel, we will denote EðDÞ by S, and EðVÞ by G, since they correspond, respectively, to the specific and general boundaries of D in the sense of 7; we also set Sq ¼ Eq ðDÞ and Gq ¼ Eq ðVÞ. By Lemma 3.1, S is the least E-set and G is the greatest E-set. An algorithm to find EðDÞ, called the lattice machine (LM), has been presented in [18], and it can easily be generalised to find EðP Þ for each D P V:
190 191 192 193
Algorithm (L M algorithm). def 1. H0 ¼ P . def 2. Hkþ1 ¼ The set of maximal elements of ½# ðHk þ P Þ \ E. 3. Continue until Hn ¼ Hnþ1 for some n.
UN
COR
REC
178 179 180 181 182 183
IJA 6238 ARTICLE 21 November 2003 Disk used
No. of Pages 19, DTD = 4.3.1
IN PRESS
H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx
7
194 It was shown in [19] that Hn ¼ EðP Þ. If jP j ¼ m, then the complexity of the 195 algorithm is Oð2m Þ in the worst case.
Example. A dataset is shown in Fig. 1. The small circles and crosses are simple tuples of different classes while the rectangles are hypertuples. Every hypertuples depicted here are equilabelled since they cover only simple tuples of the same class. All the simple tuples are also equilabelled. Each hypertuple here is maximal since we cannot move its boundary in any dimension while maintaining it being equilabelled. The set of all hypertuples in the figure is an E-set or a hypothesis for the underlying concept implied by the dataset, since all hypertuples are equilabelled and maximal, and they together cover all the simple tuples. In fact this E-set is EðDÞ––the specific boundary. The general boundary is not depicted here as we need knowledge of the domain of each attribute to calculate it. Clearly the E-set does not cover the whole data space; in other word, there are regions that are not covered by the E-set. Primary elements are those that are covered explicitly by some equilabelled hypertuples; secondary elements are those not covered explicitly but can be covered by extending some equilabelled hypertuple without compromising its equilabelledness. We can easily find primary and secondary elements from this figure.
PRO OF
197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214
To illustrate the above notions we consider the following example.
TED
196
REC
215 Our next aim is to show that every simple tuple is covered by some equi216 labelled element:
UN
COR
217 Lemma 3.3. For each t 2 V there is some h 2 E such that t 6 h.
Fig. 1. A dataset and its specific boundary. Each circle or cross is a simple tuple, and each rectangle represents a hypertuple.
IJA 6238 ARTICLE 21 November 2003 Disk used
8
H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx
Proof. We show that there is some x 2 D such that # ðt þ xÞ \ D ¼ fxg; then, t 6 t þ x 2 E: Let t 2 V , and set M ¼ ft þ x : x 2 Dg. Suppose that x 2 D such that t þ x is minimal in M with respect to 6 . Let y 2 D such that y 6 t þ x, and assume that y 6¼ x; then, there is some a 2 X such that yðaÞ 6¼ xðaÞ. Since both y and x are simple tuples, we have in fact yðaÞ \ xðaÞ ¼ ;. Now, y 6 t þ x implies that yðaÞ tðaÞ [ xðaÞ, and it follows from yðaÞ \ xðaÞ ¼ ;––and the fact that t is simple––that yðaÞ ¼ tðaÞ(tðaÞ [ xðaÞ. Hence, t þ yflt þ x, contradicting the minimality of t þ x in M. h
PRO OF
218 219 220 221 222 223 224 225
No. of Pages 19, DTD = 4.3.1
IN PRESS
226 Since each element of E is covered by some element of G, and the sets Gi , 227 i 6 K, partition G, we obtain 228 Corollary 3.4. For each t 2 V, there is some i 6 K such that t ^ Gi .
229 Observe that such Gi need not be unique, and we need a selection procedure 230 to label t. Our chosen method is majority voting, combined with random 231 selection for tied maxima: Let
ð3:5Þ
mðt; GÞ ¼ maxfMðt; n; GÞ : n 6 Kg;
ð3:6Þ
Mðt; GÞ ¼ fi 6 K : m0 ðt; i; GÞ ¼ mðt; GÞg:
ð3:7Þ
REC
Rule 1. If t 2 V, then label t by a randomly chosen element of Mðt; GÞ. Now we consider an arbitrary hypothesis H . We can replace the G by H in Eqs. (3.5)–(3.7), and apply Rule 1 for classification. However, unlike the case where H ¼ G, there is no guarantee that t ^ H , i.e. that mðt; H Þ 6¼ 0. This motivates us to distinguish between different elements in V given H ¼ EðP Þ : t 2 V is called primary if there is Hq with t ^ Hq , t is secondary if there is h 2 Hq such that t þ h is equilabelled (i.e., t þ h does not overlap Hp for p 6¼ q), and t is tertiary otherwise. Note that an element can be both primary and secondary. In fact primary data is a subset of secondary data. Now we let
COR
234 235 236 237 238 239 240 241 242
for each n 6 K;
TED
m0 ðt; n; GÞ ¼ jfh 2 Gn : t 6 hgj
m0p ðt; n; H Þ ¼ jfh 2 Hn : t 6 hgj for each n 6 K;
UN
mp ðt; H Þ ¼ maxfm0p ðt; n; H Þ : n 6 Kg;
Mp ðt; H Þ ¼ fi 6 K : m0p ðt; i; H Þ ¼ mp ðt; hÞg:
244 and
ð3:8Þ ð3:9Þ ð3:10Þ
IJA 6238 ARTICLE 21 November 2003 Disk used
No. of Pages 19, DTD = 4.3.1
IN PRESS
H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx
m0s ðt; n; H Þ ¼ jfh 2 Hn : t þ h 2 Egj for each n 6 K; ms ðt; H Þ ¼ Ms ðt; H Þ ¼
ð3:11Þ ð3:12Þ ð3:13Þ
2 If t is primary, then label t by a randomly chosen element of Mp ðt; H Þ. If t is secondary, then label t by a randomly chosen element of Ms ðt; H Þ. Otherwise, label t as unclassified.
PRO OF
Rule • • •
maxfm0s ðt; n; H Þ : n 6 Kg; fi 6 K : m0s ðt; i; H Þ ¼ ms ðt; hÞg:
9
250 Now we explore a generalisation of both primary and secondary data in the 251 following sense: 252 Lemma 3.5. If H ¼ EðP Þ, we let P 0 ¼ P [ ftg and H 0 ¼ EðP 0 Þ. Then, t is secondary for H ) t is primary for H 0 :
256
TED
Proof. Let h 2 H such that t þ h is equilabelled. Then, t þ h 2 ½P 0 \ E, and 255 thus, t 6 g for some g 2 EðP 0 Þ. h The converse is not true: Let D ¼ fa; b; c; d; e; f g with b ¼ h1; 0; 0i; f ¼ h0; 1; 0i:
c ¼ h1; 0; 1i;
d ¼ h1; 1; 1i;
REC
a ¼ h0; 0; 0i; e ¼ h0; 1; 1i;
258 Suppose that a, b, e are coloured blue and d, e, f are coloured red. Then, EðDÞ ¼ fa þ b; e; c þ d; f g:
COR
260 The aim is to classify t ¼ h0; 0; 1i with respect to the hypothesis H ¼ EðDÞ. 261 Now, a þ t ¼ h0; 0; 01i is equilabelled blue; e þ t ¼ h0; 01; 1i is equilabelled blue; c þ t ¼ h01; 0; 1i is equilabelled red; while b þ t, d þ t, f þ t are not equilabelled. Furthermore, q þ t is not equilabelled for any q 2 EðDÞ, and thus, t is not secondary with respect to EðDÞ. If we admit that the knowledge provided by t should be admitted when we try to classify t by H ¼ EðP Þ, then we should classify by using primary data of EðP [ ftgÞ. At any rate, admitting t does not cause any inconsistencies, and using primary data of H 0 is well in line with our aim of maximising consistency.
UN
263 264 265 266 267 268
IJA 6238 ARTICLE 21 November 2003 Disk used
10
No. of Pages 19, DTD = 4.3.1
IN PRESS
H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx
269 3.1. Weak hypotheses
PRO OF
H ; for q ¼ 0 to K do X Dq while X 6¼ ; do Order X as hg0 ; . . . ; gmðX Þ i h g0 for i ¼ 1 to mðX Þ do if h þ gi is equilabelled then h h þ gi end if end for X Xn # h H H [ fhg end while end for
TED
279 280 281 282 283 284 285 286 287 288 289 290 291 292 293
In the previous discussion we have introduced two classification rules based on E-sets. Rule 1 can be applied if G can be practically constructed from V by the LM algorithm, and Rule 2 can be applied if S (or EðP Þ) can be practically constructed from D (for D P V). In both cases we have to construct E-sets for P where D P V. From the LM algorithm we know that constructing an E-set is expensive and in the worst case it is exponential. Since the most time is spent finding maximal hypertuples, we make the following definition: A weak hypothesis is a set of equilabelled hypertuples which covers D. The following algorithm finds a weak hypothesis H [17]:
REC
270 271 272 273 274 275 276 277 278
294 The algorithm does not necessarily produce disjoint hypertuples: Let D0 ¼ fh0; 1; 1i; h1; 0; 1i; h1; 1; 1i; h1; 1; 0i; h2; 1; 1ig;
COR
D1 ¼ fh2; 0; 1i; h0; 0; 0ig: 296 Order X ¼ D0 by
g0 ¼ h0; 1; 1i; g1 ¼ h1; 0; 1i;
UN
g2 ¼ h1; 1; 1i; g3 ¼ h1; 1; 0i; g4 ¼ h2; 1; 1i:
298 Now, h0 ¼ g0 þ g1 þ g2 is equilabelled, while h0 þ g3 and h0 þ g4 are not. The 299 next step produces h1 ¼ g3 þ g4 . However, g2 6 h0 and g2 6 h1 .
IJA 6238 ARTICLE 21 November 2003 Disk used
No. of Pages 19, DTD = 4.3.1
IN PRESS
H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx
In the worst case the time complexity for building Hq (the hypothesis for 2 class Dq ) is OðjDq j Þ. Therefore the worst case complexity for building H (the whole hypothesis) is OðK jDq j2 Þ, where K is the number of classes. S Let H ¼ i 6 K Hi be a weak hypothesis for D, where Hi ¼ fh0 ; ; htðiÞ g is a weak hypothesis for class i and hj is an equilabelled hypertuple. Let Mp ðt; H Þ and Ms ðt; H Þ be defined as in Eqs. (3.8) and (3.11). Then we introduce the following rule, which is the same as Rule 2 except that the hypothesis here is weak.
PRO OF
300 301 302 303 304 305 306 307
11
308 Rule 3 309 • If t is primary, then label t by a randomly chosen element of Mp ðt; H Þ. 310 • If t is secondary, then label t by a randomly chosen element of Ms ðt; H Þ. 311 • Otherwise, label t as unclassified. 312 4. Mitchells version space
The situation in [9] can be described as a special decision system where d : U ! f0; 1g, and it is called the target concept. The set of positive examples is denoted by D1 , and that of negative examples by D0 . We will describe the concepts introduced there in our notation and we will follow the explanation in [9, p. 22, Table 2.2]. A hypothesis is a hypertuple t 2 L such that for all a 2 X jtðaÞj 6 1 or tðaÞ ¼ Va :
TED
313 314 315 316 317 318
ð4:1Þ
The set of all hypotheses is denoted by H. Observe that H is a hyperrelation, and that H is partially ordered by 6 as defined by (2.1), and that V H. If s; t 2 H, and s 6 t, we say that s is more specific than t or, equivalently, that t is more general than s. We say that s satisfies hypothesis t, if
COR
322 323 324 325
REC
320 Thus, for each a 2 X we have 8 < ;; or tðaÞ ¼ m; for some m 2 Va ; or : Va :
326 1. s is a simple tuple, i.e. s 2 V, 327 2. s 6 t,
UN
328 and denote the set of all (simple) tuples that satisfy t by satðtÞ; observe that 329 satðtÞ ¼# t \ V. More generally, for A L we let satðAÞ ¼# A \ V:
331 If tðaÞ ¼ ; for some a 2 X, then t cannot be satisfied. We interpret s 2 satðtÞ as 332 ‘‘instance s is classified by hypothesis t’’. According to [9], t is more general
IJA 6238 ARTICLE 21 November 2003 Disk used
12
No. of Pages 19, DTD = 4.3.1
IN PRESS
H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx
333 than s, if any instance classified by s is also classified by t; in other words, 334 satðsÞ satðtÞ. That our notion captures this concept is shown by the following 335 result, the easy proof of which is left to the reader. 336 Theorem 4.1. Suppose that s; t 2 H. Then, s 6 t () satðsÞ satðtÞ:
PRO OF
338 Since we are interested in hypotheses whose satisfiable observations are 339 within one class of d, we say that t 2 H is consistent with hD; di, if # t \ D ¼ D1 :
ð4:2Þ
341 Thus, in this case, the training examples satisfying t are exactly the positive 342 ones. The version space VSpm is now the set of all hypotheses consistent with 343 hD; di. In other words, VSpm ¼ ft 2 H : ð8sÞ½s 2 satðtÞ \ D () dðsÞ ¼ 1g:
345 The general boundary Gm is the set of maximal members of H consistent with 346 hD; di, i.e. Gm ¼ max VSpm :
Sm ¼ min VSpm :
The two boundaries delimit the version space in the sense that for any consistent hypothesis t in the version space there are g 2 Gm and s 2 Sm such that s 6 t 6 g. The example in Table 2 illustrates the idea of version space. We follow [9] in writing ? in column a, if aðxÞ ¼ Va .
REC
351 352 353 354 355
TED
348 The specific boundary Gm is the set of minimal members of H consistent with 349 hD; di, i.e.
357
COR
356 5. Expressive power of version space and boolean reasoning A simple tuple t can be regarded as a conjunction of descriptors ha0 ; tða0 Þi ^ ha1 ; tða1 Þi ^ ^ haT ; tðaT Þi:
UN
359 If t is a satisfiable hypothesis, then no tðaÞ is empty, and tðaÞ ¼ Va tells us that 360 any value is allowed in this column. Thus, such a descriptor places no 361 restriction on satisfiability in column a. If I ¼ fi 6 T : tðai Þ 6¼ Vai g, then, s 2 satðtÞ () ð8i 2 IÞsðai Þ ¼ tðai Þ;
363 so that we can interpret t as the conjunction ^ hai ; tðai Þi: i2I
IJA 6238 ARTICLE 21 November 2003 Disk used
No. of Pages 19, DTD = 4.3.1
IN PRESS
H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx
13
Table 2 Training data, hypotheses and boundaries [9] Wind
Water
FCast
d
Training data D Sunny Warm Sunny Warm Rainy Cold Sunny Warm
Normal High High High
Strong Strong Strong Strong
Warm Warm Warm Cool
Same Same Change Change
1 1 0 1
Hypotheses Sunny Sunny Sunny ? Sunny ?
Warm ? Warm Warm ? Warm
? ? ? ? ? ?
Strong Strong ? Strong ? ?
? ? ? ? ? ?
Specific boundary Sunny Warm
?
Strong
?
General boundary Sunny ? ? Warm
? ?
? ?
? ?
PRO OF
Humid
? ? ? ? ? ?
1 1 1 1 1 1
?
1
? ?
1 1
COR
REC
Such an expression is called an exact template in [13]. The expressive power of this type of conjunctive hypothesis is limited. An example from [9] illustrating this is shown in Table 3. For such a simple dataset, there is no consistent hypothesis in the sense of (4.2). A hypothesis in VSpm is a very special kind of hypertuple, and it is our aim to extend this notion of hypothesis, so that the resulting structures are more expressive while at the same time not so general as to carry no useful information. As suggested by Mitchell, a possible solution is to use arbitrary disjunctions of conjunctions of descriptors as hypothesis representation. It is not hard to see that this is overly general, since any positive Boolean expression can then serve as a hypothesis. In contrast, we suggest to use a specific class of disjunctions of conjunctions which is significantly narrower than the class of all positive Boolean expressions. The building Q blocks will be the hypertuples contained in the sub-semilattice of L ¼ a2X 2Va generated by V with + operTable 3 Limitation of version space
UN
365 366 367 368 369 370 371 372 373 374 375 376 377 378
ATemp
TED
Sky
1 2 3
Sky
ATemp
Humid
Wind
Water
FCast
d
Sunny Cloudy Rainy
Warm Warm Warm
Normal Normal Normal
Strong Strong Strong
Cool Cool Cool
Change Change Change
1 1 0
IJA 6238 ARTICLE 21 November 2003 Disk used
14
No. of Pages 19, DTD = 4.3.1
IN PRESS
H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx
379 ator ) [V]. Since we are only interested in the finite sums of elements of V we 380 will from now on assume that each Va is finite. 381 Suppose that t ¼ hfma0 ; ma1 ; ; matðaÞ gia2X is a hypertuple. We interpret t as a 382 conjunction of disjunctions of descriptors ^ ðha; ma0 i _ _ ha; matðaÞ iÞ: ð5:1Þ a2X
By the distributivity of ^ and _ this can always be turned into a disjunction of simple tuples, but not every disjunction of simple tuples (considered as a conjunction of descriptors) is equivalent to an expression such as (5.1); consider, for example, ðha0 ; t00 i ^ ha1 ; t10 iÞ _ ðha0 ; t01 i ^ ha1 ; t11 iÞ:
REC
TED
A hypertuple can be viewed as a construction similar to hypercubes which delineate solids in an appropriate space. Now we compare our hypothesis space with Mitchell’s from the perspective of expressive power. Consider a dataset D as defined earlier. Suppose all attributes x are discrete, and all Vx are finite. In our hypothesis space each attribute x takes on aQ subset of Vx , so there are 2jVx j different subsets altogether. As a result there are x2X 2jVx j different hypertuples. Since a hypothesis is a set jVx j of hypertuples (i.e., a hyperrelation), there are 2Px2X 2 distinct hypotheses. In Mitchell’s hypothesis space each attribute x takes on a single value in Vx plus two other Q special values, ‘‘?’’ and ‘‘;’’. Therefore there are jVx j þ 2 different values, and x2X ðjVx j þ 2Þ different tuples. In his conjunctive hypothesis repQ resentation, each hypothesis is a (simple) tuple, so there are x2X ðjVx j þ 2Þ distinct hypotheses. In his disjunctive hypothesis representation, each hypothesis is a set of (simple) tuples (i.e., simple relation), so there are 2Px2X ðjVx jþ2Þ distinct hypotheses. Clearly our hypothesis space can represent more distinct objects than Mitchell’s can. In this sense we say our hypothesis is more expressive than Mitchell’s. Note that E characterises the eligible hypotheses. So Mitchell’s version space can be specialised from our version space in the following way:
COR
389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408
PRO OF
384 385 386 387
UN
409 • The E is restricted to Em ¼ fcðtÞ : t 2 E; D ^ tg, where c is an operation 410 to turn a hypertuple into a simple tuple in such a way that 411 cðtÞ ¼ ht0 ; t1 ; . . . ; tT i, where tðxi Þ; if jtðxi Þj ¼ 1; ti ¼ ?; otherwise:
IJA 6238 ARTICLE 21 November 2003 Disk used
No. of Pages 19, DTD = 4.3.1
IN PRESS
H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx
15
413 Note that tðxi Þ is the projection of tuple t onto its xi attribute. 414 • Each hypothesis is an element of Em . 415 • There are two classes, i.e., K ¼ 2. 416 • The version space is built for only one class.
PRO OF
417 Given the above restrictions the specific boundary is Sm ¼ fSm0 ; Sm1 g, where m 418 S0 is the specific boundary for the negative class and Sm1 is the specific 419 boundary for the positive class. Similarly Gm ¼ fGm0 ; Gm1 g.
420 6. Experimental
TED
We have evaluated Rule 3 using public datasets. We used 17 public datasets in our evaluation, which are available from UC Irvine Machine Learning Repository. General information about these datasets is shown in the first three columns of Table 4. We used the C A S E E X T R A C T algorithm to construct weak hypotheses for the datasets, and applied Rule 3 to classify new data. For presentation purpose we refer to our classification procedure by G L M . The experimental results are shown in the last 5 columns of Table 4. As a comparison the C5.0 results on the same datasets are also shown. It is clear from this table that G L M performs extremely well on primary data, but it accounts for only 76.4% of all data on average. Over all data G L M compares well with C5.0. Parity problems are well known to be difficult for many machine learning algorithms. We evaluated the GLM algorithm using three well known parity datasets [16]––Monk-1, Monk-2, Monk-3. 1 Experimental results show that GLM works well for these parity datasets.
REC
421 422 423 424 425 426 427 428 429 430 431 432 433 434 435
Mitchell’s classical work on version space has been followed by many. Most notably Hirsh and Sebag. Hirsh [5] discusses how to merge version spaces when a central idea in Mitchell’s work is removed––a version space is the set of concepts strictly consistent with training data. This merging process can therefore accommodate noise. Sebag [14] presents what she calls a disjunctive version space approach to learning disjunctive concepts from noisy data. A
UN
437 438 439 440 441 442
COR
436 7. Discussion and conclusion
1 Target Concepts associated to the Monk’s problem: Monk-1: ða1 ¼ a2Þ or ða5 ¼ 1Þ; Monk-2: exactly two of a1 ¼ 1, a2 ¼ 1, a3 ¼ 1, a4 ¼ 1, a5 ¼ 1, a6 ¼ 1; Monk-3: (a5 ¼ 3 and a4 ¼ 1) or (a5 6¼ 4 and a2 6¼ 3) (5% class noise added to the training set).
#Test
38 14 25 8 20 9 13 19 22 4 6 6 6 60 9 18 13
798 690 205 768 1000 214 270 155 368 150 124 169 122 208 958 232 178
CV-5 CV-5 CV-5 CV-5 CV-5 CV-5 CV-5 CV-5 CV-5 CV-5 432 432 432 CV-5 CV-5 CV-5 CV-5
GLM
on primary and secondary data
REC
Class. Accuracy (%)
%PP
C5.0
GLM/SR
GLM/PSR
GLM/SSR
96.6 90.6 70.7 72.7 71.7 80.4 77.0 80.6 85.1 94.7 74.3 65.1 97.2 71.6 86.2 96.5 94.3
96.4 95.1 82.4 70.7 72.6 86.6 81.9 82.9 82.7 93.1 100.0 81.4 93.8 81.6 96.2 96.5 99.0
98.0 95.1 87.0 71.0 72.6 87.6 81.9 84.1 82.7 97.6 100.0 87.3 94.3 81.3 96.1 98.6 99.0
62.9 100.0 63.0 40.0 N/A 76.5 N/A 50.0 N/A 66.7 N/A 41.5 89.2 100.0 100.0 77.3 N/A
82.7
87.8
89.1
TED
PRO
OF
72.3
92.5 87.2 56.1 66.8 65.4 79.4 61.5 69.0 67.7 82.7 100.0 83.6 89.1 59.1 94.9 89.7 53.9 76.4
The validation method used is either 5-fold cross validation or explicit train/test validation. The acronyms are: SR––overall success ratio of primary and secondary data (i.e., the percentage of successfully classified primary and secondary data tuples over all primary and secondary data tuples), PSR–– success ratio of primary data, SSR––success ratio of secondary data, PP––the percentage of primary data, and N/A––not available.
No. of Pages 19, DTD = 4.3.1
Average
COR
#Train
IN PRESS
Annealing Australian Auto Diabetes German Glass Heart Hepatitis Horse-Colic Iris Monk-1 Monk-2 Monk-3 Sonar TTT Vote Wine
#Attr.
H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx
Dataset
IJA 6238 ARTICLE 21 November 2003 Disk used
16
UN
Table 4 General information about the datasets and the classification accuracy of C5.0 on all data and of
IJA 6238 ARTICLE 21 November 2003 Disk used
No. of Pages 19, DTD = 4.3.1
IN PRESS
H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx
460 8. Uncited reference 461
[7]
462 Acknowledgements
TED
PRO OF
separate version space is learned for each positive training example, then new instances are classified by combining the votes of these different version spaces. In this paper we investigate version spaces in a more expressive hypothesis space––disjunction of conjunctions of disjunctions, where each hypothesis is a set of hypertuples. Without a proper inductive bias the version space is uninteresting. We show that, with E-set as an inductive bias, this version space is a generalisation of Mitchell’s original version space, which employs a different type of inductive bias. For classification within the version space we proposed three classification rules for use in different situations. The first two rules are based on E-sets as hypotheses, and they can be applied when the data space (V) is finite and small. We showed that constructing E-sets is computationally expensive, so we introduced the third rule which is based on weak hypotheses. We presented an algorithm to construct weak hypotheses efficiently. Experimental results show that this classification approach performs extremely well on primary data, which account for over 75.0% of all data. Over all data this classification approach is comparable to C5.0.
REC
443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459
17
463 The work of Hui Wang was partly supported by the European Commission 464 project ICONS, project no. IST-2001-32429.
X ^ Y () ð8x 2 X Þð9y 2 Y Þx 6 y X EY () ð8y 2 Y Þð9x 2 X Þx 6 y U ¼ fx0 ; . . . ; xN g X ¼ fa Q 0 ; . . . ; aT g V ¼ Qa2X Va L ¼ a2X 2Va IðxÞ ¼ ha0 ðxÞ; a1 ðxÞ; . . . ; aT ðxÞi D ¼ fIðxÞ : x 2 U g ¼ D0 [ D1 [ . . . [ DK Eq ¼Sset of all elements equilabelled with respect to Dq E ¼ q 6 K Eq
UN
466 467 468 469 470 471 472 473 474 475
COR
465 Appendix A. Notation
IJA 6238 ARTICLE 21 November 2003 Disk used
18
H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx
EðP Þ ¼ ft 2 L : t is maximal in ½P \ Eg Eq ðP Þ ¼ EðP Þ \ Eq S ¼ EðDÞ Sq ¼ Eq ðDÞ G ¼ EðVÞ Gq ¼ EqSðVÞ VSp ¼ fEðP Þ : D P Vg
PRO OF
476 477 478 479 480 481 482
483 References
COR
REC
TED
[1] I. D€ untsch, G. Gediga, Statistical evaluation of rough set dependency analysis, International Journal of Human–Computer Studies 46 (1997) 589–604. [2] I. D€ untsch, G. Gediga, Simple data filtering in rough set systems, International Journal of Approximate Reasoning 18 (1–2) (1998) 93–106. [3] G. Gr€ atzer, General Lattice Theory, Birkh€auser, Basel, 1978. [4] D. Haussler, Quantifying inductive bias: Ai learning algorithms and valiant’s learning framework, Artificial Intelligence 36 (1988) 177–221. [5] H. Hirsh, Generalizing version spaces, Machine Learning 17 (1) (1994) 5–46. [6] T.M. Mitchell, Version spaces: a candidate elimination approach to rule learning, in: Proc. 5th IJCAI, 1977, pp. 305–310. [7] T.M. Mitchell, Version spaces: an approach to concept learning. PhD thesis, Department of Electrical Engineering, Stanford University, Stanford, CA, 1979. [8] T.M. Mitchell, Generalization as search, Artificial Intelligence 18 (1982). [9] T.M. Mitchell, Machine Learning, The McGraw-Hill Companies, Inc, 1997. [10] H.S. Nguyen, From optimal hyperplanes to optimal decision trees, Fundamenta Informaticae 34 (1998) 145–174. [11] H.S. Nguyen, S.H. Nguyen, Pattern extraction from data, Fundamenta Informaticae 34 (1998) 129–144. [12] H.S. Nguyen, A. Skowron, Boolean reasoning for feature extraction problems, in: Z.W. Ras, A. Skowron (Eds.), Proc. of the 10th International Symposium on Methodologies for Intelligent Systems (ISMIS’97), volume 1325 of Lecture Notes in Artificial Intelligence, Springer-Verlag, Berlin, 1997, pp. 117–126. [13] S.H. Nguyen, A. Skowron, P. Synak, Discovery of data patterns with applications to decomposition and classification problems, in: L. Polkowski, A. Skowron (Eds.), Rough sets in knowledge discovery, vol. 2, Heidelberg, Physica-Verlag, 1998, pp. 55–97. [14] M. Sebag, Delaying the choice of bias: a disjunctive version space approach, in: Proc. of the 13th International Conference on Machine Learning, San Francisco, Morgan Kaufmann, 1996, pp. 444–452. [15] A. Skowron, Extracting laws from decision tables––a rough set approach, Computational Intelligence 11 (1995) 371–388. [16] S.B. Thrun, J. Bala, E. Bloedorn, I. Bratko, B. Cestnik, J. Cheng, K. De Jong, S. Dzeroski, S.E. Fahlman, D. Fisher, R.Hamann, K. Kaufman, S. Keller, I. Kononenko, J. Kreuziger, R.S. Michalski, T. Mitchell, P. Pachowicz, Y. Reich, H. Vafaie, W. Van deWelde, W. Wenzel, J. Wnek, J. Zhang, The monk’s problems––a performance comparison of different learning algorithms. Technical Report CS-CMU-91-197, Carnegie Mellon University, 1991. [17] H. Wang, W. Dubitzky, I. D€ untsch, D. Bell, A lattice machine approach to automated casebase design: Marrying lazy and eager learning, in: Proc. IJCAI99, Stockholm, Sweden, 1999, pp. 254–259.
UN
484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521
No. of Pages 19, DTD = 4.3.1
IN PRESS
IJA 6238 ARTICLE 21 November 2003 Disk used
No. of Pages 19, DTD = 4.3.1
IN PRESS
H. Wang et al. / Internat. J. Approx. Reason. xxx (2003) xxx–xxx
19
UN
COR
REC
TED
PRO OF
522 [18] H. Wang, I. D€untsch, D. Bell, Data reduction based on hyper relations, in: R. Agrawal, P. 523 Stolorz, G. Piatetsky-Shapiro (Eds.), Proceedings of KDD’98, New York, 1998, pp. 349–353. 524 [19] H. Wang, I. D€untsch, G. Gediga, Classificatory filtering in decision systems, International 525 Journal of Approximate Reasoning 23 (2000) 111–136.