Applied Soft Computing 11 (2011) 3887–3897

Contents lists available at ScienceDirect

Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc

Dynamic discreduction using Rough Sets P. Dey a,∗ , S. Dey a , S. Datta b , J. Sil c a b c

School of Materials Science and Engineering, Bengal Engineering and Science University, Shibpur, Howrah 711 103, India Birla Institute of Technology, Deoghar, Jasidih, Deoghar 814 142, India Department of Computer Science and Technology, Bengal Engineering and Science University, Shibpur, Howrah 711 103, India

a r t i c l e

i n f o

Article history: Received 28 January 2010 Received in revised form 1 September 2010 Accepted 3 January 2011 Available online 12 January 2011 Keywords: Rough Set Discretization Classiﬁcation Data mining TRIP steel

a b s t r a c t Discretization of continuous attributes is a necessary pre-requisite in deriving association rules and discovery of knowledge from databases. The derived rules are simpler and intuitively more meaningful if only a small number of attributes are used, and each attribute is discretized into a few intervals. The present research paper explores the interrelation between discretization and reduction of attributes. A method has been developed that uses Rough Set Theory and notions of Statistics to merge the two tasks into a single seamless process named dynamic discreduction. The method is tested on benchmark data sets and the results are compared with those obtained by existing state-of-the-art techniques. A real life data on TRIP steel is also analysed using the proposed method. © 2011 Elsevier B.V. All rights reserved.

1. Introduction With the impending explosion in digital data over the past few decades, it has become obligatory to search for methods that put semantics to these information. Deriving easily understandable rules and discovering meaningful knowledge in the data is a nontrivial task facing bottlenecks in the industry as well as in research levels. There exist different methods for analysis of data tables or information systems using techniques ranging from Statistics to formal logic. Most of them are designed to work on data where attributes can have only a few possible values [1]. But majority of the practical data mining problems are characterised by continuous attribute values. It has been common practice to resolve the problem in two separate modules. The ﬁrst step involves transforming the continuous data into a few discrete intervals, which is termed discretization. The discretized data is then analysed using the available techniques, loosely called machine learning algorithms. There has been a considerable volume of research on discretization. A couple of review papers [2,3] summarizes and categorizes the major studies, whereas [4] enlists 33 different discretization methods. But no single method has been reported to perform better than the others in all respects. Nor is any one of them known to apply indiscriminately to any data whatsoever [5].

∗ Corresponding author. E-mail address: [email protected] (P. Dey). 1568-4946/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.asoc.2011.01.015

Rough Set (RS) is known to be a powerful tool for deriving rules from data using a minimum number of attributes. The method involves ﬁnding a reduct containing a minimal subset of attributes which is just sufﬁcient to classify all objects in the data. However RS methods are typically devised to deal with discrete attribute values, which calls for methods that discretize the data either beforehand, or after the reduct has been found. RS has been successfully used to discretize continuous data through different innovative methods [4–6], often used in combination with ant colony [7] or particle swarm optimization [8]. Different statistical techniques and measures have also been used to innovate or improve a discretization procedure [9–11]. But problems with such modular methodology is that the attribute reduction and discretization processes are usually assumed to be independent of one another. This involves high computation cost or low prediction accuracy. If the data is ﬁrst discretized, cost is escalated from discretizing the redundant attributes, particulary if the data is large. On the other hand, giving prime emphasis to the elimination of attributes may lead to loss of important information and over-ﬁtting of the data. The rules thus derived will have fewer objects in their support and suffer from low prediction accuracy. Presence of noise in the data often hastens such possibility. Some very recent works have tried to fuse the two tasks of discretization and feature selection [12,13]. Researchers have also proceeded along more generalized paths like symbolic value partitioning [14] or tolerance Rough Set method based on similarity measure [15] that performs the two tasks simulatneously. There has been studies [16] where attributes values are grouped differ-

3888

P. Dey et al. / Applied Soft Computing 11 (2011) 3887–3897

Table 1 An illustrative example: 10 samples from the Iris data set. Objects

U

A

u1 u2 u3 u4 u5 u6 u7 u8 u9 u10

D

a1 Sepal length

a2 Sepal width

a3 Petal length

a4 Petal width

d Iris

4.8 5.7 5.1 4.9 4.9 5.9 5.5 6.7 6.2 6.2

3.0 4.4 3.5 3.6 2.4 3.2 2.5 2.5 2.8 3.4

1.4 1.5 1.4 1.4 3.3 4.8 4.0 5.8 4.8 5.4

0.1 0.4 0.3 0.1 1.0 1.8 1.3 1.8 1.8 2.3

Setosa Setosa Setosa Setosa Versicolor Versicolor Versicolor Vtginica Vtginica Vtginica

ently for each rule. Evidently there is need for further research in this area. In this paper the possible relations between discretization and reduction of attributes are explored. A method is proposed that uses RS and notions from Statistics to discretize, and at the same time distinguish the important attributes in a data set based on random samples. Using samples instead of the whole data to discretize the variables reduce the computation time, specially for large data sets. The resulting rules become simpler, and the classiﬁcation accuracy increases, particularly in case of noisy data sets. The method is tested on some benchmark data, and comparison is made with the results obtained using state-of-the-art techniques. A real life data from Materials Engineering ﬁeld is also analysed using the proposed method, and the rules derived are assessed in view of knowledge discovery. The remaining part of the paper is organized into the following ﬁve sections. Section 2 briefs the basic notions of Rough Set Theory and exempliﬁes the interrelation between discretization and reduction of attributes. The concept of discreduction is also introduced. In Section 3 the dependence of a discreduct on choice of attributes and sample size is explored. Section 4 describes the method of ﬁnding a dynamic discreduct, whereas results on applying the method to different data sets is presented in Section 5. Finally, Section 6 concludes the article. 2. Rough Set Theory Rough Set Theory (RST), introduced by Pawlak [17], is an extension of the classical Set Theory proposed by Georg Cantor [18]. In brings forth the concept of a boundary region which accommodates elements lying midway between belonging and not belonging to that set. The theory has shown immense usefulness in distilling out requisite information through a minimal set of attributes that classify all possible objects in a data set. An information system I essentially consists of a set (universe) of objects U. Each object has deﬁnite values for a set of conditional attributes A, depending on which the value of a decision attribute d is ascertained. Formally it is the ordered pair I = (U, A ∪ {d}) In the following subsections the basic notions of RST have been illustrated with the help of an example (in Table 1) consisting of 10 samples from the Fisher’s Iris data set [19]. 2.1. Classiﬁcation We can choose a non-empty subset of attributes S ⊆ A and proceed to examine the capability of S in classifying the objects in U. For example, if only the Sepal Length attribute is considered, i.e. S = {a1 }, the objects u4 and u5 are indiscernible, and so are u9 and

Class

C1

C2

C3

u10 . Whereas the latter pair of objects, by virtue of being in the same class pose no problem in classiﬁcation, the former pair belong to different classes of Iris, but are indistinguishable from one another from its Sepal Length alone. {a1 } can thus deﬁnitely classify 3 of the 4 Iris Setosa ﬂowers in the sample (u1 , u2 , u3 ), as also 2 of 3 Versicolor, and 3 of the 3 Virginica ﬂowers. This puts the classiﬁcation accuracy of S at 8/10. The concepts may be formalized as follows. Let Ci represent the set of objects in the ith class for i = 1, 2, . . ., l (l being the number of decision classes in I). Now U/d = {C1 , C2 , . . ., Cl } is the partition of U induced by the decision attribute d. The S-lower approximation of Ci is deﬁned as the set of objects deﬁnitely classiﬁed as Ci using the information in S. It is denoted by S(Ci ) =

u|[u]S ⊆ Ci ,

i = 1, 2, . . . , l

where [u]S is the set of objects indiscernible from u, given the information S (i.e. the objects for which values of all the attributes in S are identical to those of u). The corresponding S-upper approximation is the set of objects possibly classiﬁed as Ci using S ¯ i) = S(C

u|[u]s ∩ Ci = / ∅ ,

i = 1, 2, . . . , l

and the difference of the two sets give the S-boundary region of Ci ¯ i ) − S (Ci ), BNs (Ci ) = S(C -

i = 1, 2, . . . , l

The positive region of U/d with respect to S is the union of lower approximations of all classes in U/d. It represents the set of objects that can be deﬁnitely classiﬁed into one of the decision classes C1 , C2 , . . ., Cl using the information in S. It is denoted by

POSS (d) =

S(Ci )

Ci ∈ U/d

whereas the classiﬁcation accuracy of S with respect to d is the fraction of deﬁnitely classiﬁable objects expressed as (S, d) =

|POSS (d)| |U|

(1)

Clearly, the positive region of I w.r.t. Sepal Length is {u1 , u2 , u3 , u6 , u7 , u8 , u9 , u10 }, denoting the set of objects about which deﬁnite conclusion may be drawn (regarding the class of Iris) using only their Sepal Lengths. From Eq. (1), the classiﬁcation accuracy of {a1 } comes to be 0.8. An information system is said to be consistent if the entire set of attributes A can classify all the objects in U. That is, if (A, d) = 1 The information system I in Table 1 is a consistent system. The signiﬁcance of an attribute a ∈ S is deﬁned by the relative decrease in classiﬁcation accuracy on removing a from S. It is

P. Dey et al. / Applied Soft Computing 11 (2011) 3887–3897

(a) Sepal Length

(b) Sepal Length and Sepal Width

3889

(c) Sepal Length and Petal Length

Fig. 1. (a–c) Discretization of the Iris sample (Table 1) with different sets of attributes.

denoted by (S,d) (a) =

(S, d) − (S , d) (S, d)

(2)

where S = S −{a}. As an example, S = {a1 , a2 } can classify all the 10 objects, while S = {a1 } classiﬁes only 8. Thus (s,d) (a2 ) = 0.2. 2.2. Redundancy If two attributes are sufﬁcient to classify all the objects in a data set, the other attributes are considered redundant, and eventually eliminated to form a reduct. Formally, a reduct R is a minimal subset of attributes which has a classiﬁcation accuracy of 1, minimal in the sense that every proper subset of R has a classiﬁcation accuracy less than 1. Formally, R ⊆ A is a reduct if and only if both Eqs. (3) and (4) are satisﬁed (R, d) = 1

(3)

and

γ(S, d ) < 1

∀S

R

(4)

There may be more than one reduct in any information system, and a reduct with the lowest cardinality is called minimal reduct. Thus, R1 = {a1 , a2 } is a minimal reduct for Table 1. A set of attributes satisfying Eqs. (3) and (4) is called an ‘exact’ reduct. A known problem in exact reducts is that they are highly susceptible to noise, and often perform badly in classifying external data [20]. Dynamic reducts, on the other hand, constitute those attributes that appear ‘most frequently’ in the reducts of a series of samples taken from the data. Dynamic reduct and the general role of samples has been dealt in greater detail in Section 2.6. 2.3. Discretization and association rules Data, in most practical cases, is presented as real-valued (continuous) attributes, which per se generate rules with very few objects in its support. These rules are inefﬁcient in classifying external data. For example, the rule if Sepal Length is 6.2, then the Iris is Virginica with a set of two objects {u9 , u10 } in its support. However, the rule could be further generalized to the form if Sepal Length ≥ 6, then Virginica

(R1)

which has a three-object support {u8 , u9 , u10 } in its favour. The value 6, in that case, is called a cut on the attribute Sepal Length. It is denoted by the ordered pair (a1 , 6).

At least three more cuts (shown by the dotted lines in Fig. 1(a)) are required to classify the remaining objects. This results in four more rules if 5.8 ≤ Sepal Length < 6.0, then Versicolor

(R2)

if 5.6 ≤ Sepal Length < 5.8, then Setosa

(R3)

if 5.3 ≤ Sepal Length < 5.6, then Versicolor

(R4)

if Sepal Length < 5.3, then Setosa

(R5)

The ﬁrst three of these (R2), (R3), (R4) are supported by singletons {u6 }, {u2 }, {u7 }, respectively. (R5), however, has a triplet {u1 , u3 , u4 } in its support, as well as one misclassiﬁed object (u5 ) that diminishes its accuracy to 3/4. To formalize the concepts: discretization is the division of the range of an attribute’s values into a few intervals, and the values at the interval boundaries are called cuts. A set of cuts spanning one or more attributes is said to be consistent if all the objects in the information system can be properly classiﬁed using the intervals formed by the cuts. For example, the set of cuts {(a1 , 5.3), (a1 , 5.6), (a1 , 5.8), (a1 , 6.0)} is inconsistent, since u5 ∈ C2 is indiscernible from u1 , u3 , u4 ∈ C1 . For any rule “if then ,” the accuracy and coverage [21] may be expressed as Accuracy =

|| ||I ∩ ||||I || ||I

(5)

and Coverage =

|| ||I ∩ ||||I ||||I

(6)

where || ||I and ||||I are the set of objects in I matching the antecedent ( ) and consequent () of a rule, and |·| is the cardinality of a set. The set of objects || ||I ∩ ||||I that match both the antecedent and consequent, is the support of a rule. 2.4. Linguistics and semantics An improvement in the rule set (R) (i.e. (R1)–(R5)) is evident if a second attribute is taken into account. Including Sepal Width along with Sepal Length reduces the number of necessary cuts to three (denoted by the 3 dotted lines in Fig. 1(b)) and all the 10 objects are correctly classiﬁed by them. Thus {(a1 , 5.8), (a1 , 6), (a2 , 2.75)} is a consistent set of cuts. The intervals formed by the cuts are often assigned linguistic labels like short, moderate and long, or wide and narrow, to represent the rules in a more familiar form. As such, the process of discretization not only increases the efﬁcacy of classiﬁcation, the

3890

P. Dey et al. / Applied Soft Computing 11 (2011) 3887–3897

resulting rules are also simpliﬁed, reduced in number, and rendered meaningful from the intuitive aspect. The four rules thus formed are short, wide Sepals are Iris Setosa

(S1)

short, narrow Sepals are Iris Versicolor

(S2)

moderately long Sepals are Iris Versicolor

(S3)

long Sepals are Iris Virginica

(S4)

where the intervals for Sepal Width are denoted as follows

A Sepal is

narrow wide

if Sepal Width otherwise

A Sepal is

Deﬁnition 0. Discreduction is essentially a process that successfully blends the two related processes of discretization and ﬁnding reduct for an information system. Deﬁnition 1. A discreduct is a set of ordered pairs that, at the same time discretizes and determines a reduct of an information system. It may be expressed as

< 2.75 cm, and

and as regards the Sepal Length, the intervals are

and to some extent inﬂuence the classiﬁcation accuracy. In order to arrive at an optimally discretized set of attributes, discretization and ﬁnding reducts need to be merged into a single seamless process. With this uniﬁcation in view, the following terms may be introduced.

D=

short if Sepal Length < 5.8 cm, long if Sepal Length ≥ 6 cm, and moderately long otherwise

The rule set (S) (i.e. (S1)–(S4)) is evidently better than (R), both in terms of classiﬁcation accuracy and rule semantics. Only the moderately long interval is sarcastically narrow. To take a last example, we consider another pair of attributes Sepal Length and Petal Length. Together they excellently classify all the 10 objects with just two cuts – one on each attribute {(a1 , 6), (a3 , 2)}. The set of rules then becomes extremely simple short Sepals and short Petals are Iris Setosa

(P1)

short Sepals and long Petals are Iris Versicolor

(P2)

long Sepals and long Petals are Iris Virginica

(P3)

(ak , tik )|ak ∈ A and tik ∈ Vak

for

k ∈ {1, 2, . . . , na } i = 1, 2, . . . , mk

where na is the no. of attributes in A, mk is the no. of cuts on ak , and Vak = [lak , rak ) ⊂ R is the range of values of ak (R being the set of real numbers). Without loss of generality, it may be assumed that the mk cuts on any attribute ak are arranged in ascending order of their values, i.e. i < j ⇔ tik < tjk ,

i, j ∈ {1, 2, . . . , mk }

Deﬁnition 2. The reduct of a discreduct is the set of attributes which has at least one cut in the discreduct. In other words it is the domain of the discreduct R = dom D =

a | (a, t) ∈ D for some t .

Deﬁnition 3. The discreduced information system I may be deﬁned as the ordered pair consisting of the same set of objects with a reduced set of attributes R ⊆ A and discrete attribute values

where the thresholds distinguishing short and long are 6 cm for sepals and 2 cm for the petals, corresponding to the two dotted lines in Fig. 1(c). Evidently the set of rules (P1)–(P3) (or (P)) appear much simpler and is intuitively easier to understand than the rule set (S). On pondering over (P), one may even raise a ‘childish’ question: why are there no Iris ﬂowers with long sepals and short petals? To a botanist it might lead to some specialized knowledge discovery, but we may rather content ourselves with the simple answer that nature has a certain sense of proportion, which she abhors to disrupt.

ID = (U, R ∪ {d})

2.5. Uniﬁcation

may be suitably mapped to a set of linguistic labels as discussed in Section 2.4. The set of discreduced attributes in the reduct may also be denoted by RD .

To sum up, clarity and classiﬁcation accuracy of a set of rules is governed by two major factors: (i) choice of attributes (resulting in the reduct), and (ii) selection of cuts (forming the discretization).

where the attributes in the reduct are transformed as k aD (u) = i ⇔ ak (u) ∈ [tik , ti+1 ), k k with t0k = lak and tm

• self-similar across scales, and • complementary and inter-dependent. Considered in unison, the interdependence of the two factors: (i) and (ii) very much determine the clarity of knowledge extracted from the data (expressed by the comprehensibility of the rules)

= rak . The transformed attribute values

Deﬁnition 4. The classiﬁcation accuracy of a discreduct is the ratio of the number of objects classiﬁed by the discreduct and the number of objects classiﬁed by all the attributes prior to discreduction. (D) =

In order to obtain an optimum balance between them, the interrelation between the two need to be taken into account. A reduct is the result of eliminating redundant attributes, while redundant cuts are eliminated in discretization. The heuristic for ﬁnding reducts (or discretization) essentially consists of selecting a set of minimal attributes (or cuts) [20]. Whereas a few ﬁnely discretized attributes may satisfy Eqs. (3) and (4), the discretization could be coarse if more attributes are considered. In other words, the problems of discretization and ﬁnding a reduct are

k +1

i = 0, 1, 2, . . . , mk

|POSRD (d)| |POSA (d)|

2.6. Role of samples A consistent set of cuts suffers from two major drawbacks: one quantitative and the other qualitative. First, the amount of computation drastically increases with the size of the data. Secondly, the consistent set of cuts that exactly classiﬁes every object in a given data (particularly with the minimal set of attributes) suffers from some kind of over-ﬁtting, which decreases its accuracy in predicting unknown external data. The ﬁrst makes the process time-consuming and costly—specially in case of large data sets with many distinct values of each attribute, while the second renders the ﬁnal classiﬁcation or decision rules clumsy, incomprehensible, inaccurate or even incorrect—particularly when the data is imprecise or noisy.

P. Dey et al. / Applied Soft Computing 11 (2011) 3887–3897

3891

Table 2 Data set properties. Name of the data set

# of attrib.’s |A|

# of objects |U|

# of classes |U/d|

Iris Glass Breast

4 9 9

150 214 683

3 6 2

# of minimal reducts n 4 10 8

To resolve the problem, a series of data samples are taken and reducts are found for these samples. The attributes that are present in ‘most’ of the sample reducts are collected to form a dynamic reduct, which has been shown [20] to improve classiﬁcation accuracy. But there are some open problems, such as quantiﬁcation of ‘most,’ or determination of the size and number of samples, so as to ensure that the major trends in the data are sufﬁciently reﬂected in the rules, and at the same time noise is satisfactorily ﬁltered out. These are quantities that should be carefully calibrated in order to get optimum results. Unfortunately, there exists no unanimity in standard literature for determining such thresholds. In this paper we devise a method that uses samples taken from the data to ﬁnd a dynamic discreduct which determines (i) the requisite attributes, (ii) the number of cuts on each attribute, and (iii) the value (or position) of the cuts.

3. Behaviour of dynamic discreducts

Avg. # of cuts nc

3 2 4

10.0 45.5 16.7

The properties of the data sets are given in the ﬁrst four columns of Table 2. Records with missing values has been removed from the last data set. 3.1. Signiﬁcance vs. no. of cuts For the ﬁrst claim, all the minimal reducts (R1 , R2 , . . ., Rn ) are computed for a data set, and each reduct is then discretized with the MD-heuristic algorithm [20] to get n discreducts (D1 , D2 , . . ., Dn ). The number and cardinality of the minimal reducts for the three data sets are given in columns 5 and 6 of Table 2, while the last column gives the average number of cuts, nc in the discreducts. Since a tie in choosing the most effective cut (that discerns maximum object pairs from different decision classes) was resolved by a toss, 10 runs are taken for each reduct. Considering all the discretized minimal reducts, the average number of cuts contributed by an attribute, (a), is plotted against the signiﬁcance of that attribute, (a) in Fig. 2. (a) =

Two major arguments are made in this section:

1 (cuts contributed by a in Di ) n 1 (Ri ,d) (a) n i

• Cuts from more signiﬁcant attributes are more effective in a discreduct; this means that if we go on choosing the most effective cut in each step (as the MD-heuristic does) more cuts will ﬁnally be selected from attributes with a higher signiﬁcance. • The number of cuts needed to consistently classify a sample of objects is proportional to the square root of the sample size. The ﬁrst claim has no direct bearing on the methodology for ﬁnding discreducts. It is a study which warranties that cuts in a discreduct are not chosen randomly from any attribute, rather each cut chosen from an attribute (using the heuristic method) stresses and signiﬁes the presence of that attribute in the reduct. In other words it asserts that a discreduct is also a reduct (see Deﬁnition 1). The second claim is a lemma that is directly applied in the dynamic discreduct algorithm. Three commonly used data sets, namely Iris, Glass, and Breast [i.e. Winconsin Breast Cancer (Original)], from the UCI Machine Learning Repository [19] have been used to study these behaviours.

(a) Iris

| · | of minimal reducts |R|

(a) =

i

A weighted average (a) for the number of cuts contributed by an attribute is also calculated. The logic of placing weight is that: if a set of objects could be classiﬁed by fewer cuts, each cut should be deemed more effective. Thus the weight carried by a cut would vary inversely as the the number of cuts in that discreduct, nc /Di giving the relative efﬁciency of every cut in Di (a) =

nc n i

cuts contributed by a in D i total no. of cuts in Di

where nc = (1/n) |Di | is the average number of cuts in the n minimal reducts. The linear ﬁts in Fig. 2 shows that the signiﬁcance of an attribute is roughly proportional to the number of cuts contributed by the attribute. The correlation coefﬁcient (r) is about 90% for all the data sets, with the weighted average giving a slightly better value.

(b) Glass Fig. 2. (a–c) Signiﬁcance vs. no. of cuts on different data sets.

(c) Breast

3892

P. Dey et al. / Applied Soft Computing 11 (2011) 3887–3897

(a) Iris

(b) Glass

(c) Breast

Fig. 3. (a–c) No. of cuts vs. sample size of different data sets.

3.2. No. of cuts vs. sample size

4. Methodology for ﬁnding dynamic discreduct

A sample of size nz is taken from the set of objects U using random numbers generated from the code. nz is varied as

The algorithm for ﬁnding a dynamic discreduct D for a data set is given below. The number and size of samples ns and nz should be tuned so that the major trends in the data is sufﬁciently represented, and at the same time any noise gets eliminated as far as possible. The optimum values of these two parameters has been discussed at the end of Section 4.1.

nz = z · |U|,

z = 0.1, 0.2, . . . , 0.9

where, for each of the nine relative sample sizes (z), 100 samples are taken. The consistent set of cuts Ti is determined for the ith sample Ui using the MD-heuristic algorithm [20], and the average

100 |T |/100) needed to consistently classify number of cuts (nc = i=1 i Ui is plotted against z. The results in Fig. 3 show that z–nc plots for the three data sets closely ﬁt the power relation nc = c · z p

(7)

with an r2 value of 99.6%. Here, c is a constant for any data set denoting the number of cuts needed to consistently classify the entire U (i.e. when z = 1), and the parameter p has a value very near 0.5 for the Glass and Breast data sets. The relation of nc with z thus corresponds to the well known expression for the standard deviation of √ a sample, n ∝ n, for fairly large sample sizes n 1. In case of Iris, the value of p is slightly higher (about 0.6). The deviation is caused by the two points in Fig. 3(a) corresponding to the smallest sample sizes (z = 0.1 and 0.2) that clearly fall out of the parabolic ﬁt. The misﬁt of these two points is explicably because the number of cuts has fallen to an extremely low value (less than 3), in which condition relation (7) does not hold. We thus ﬁt an alternative form of the power relation nc = b + c · z p

(8)

leaving out the two smallest sample sizes (z = 0.1 and 0.2). The closeness of ﬁt (r2 ) then increases to more than 99.9%. The improvement is also graphically evident, where the gray dotted curve representing Eq. (8) passes through all the seven points, whereas the (blue) ﬁrm line as per Eq. (7) seems a bit stiff in ﬁtting the points. The value of p also touches the expected value 0.5. The small intercept on the z-axis (about 4.7) may be interpreted as the size of a sample where just one (round off zero to the next highest integer) cut will be sufﬁcient to discern all objects in the sample. To sum up, ﬁrstly an attribute in an information system contributes more cuts in a discreduct if its removal from a reduct greatly reduces the classiﬁcation accuracy (i.e. it has a high signiﬁcance as per Eq. (2)); and secondly, the number of cuts required to consistently classify a random sample of objects varies as the square root of the sample size for fairly sized discreducts (with more than 2 cuts).

ALGORITHM: Dynamic Discreduct (I, ns , nz ) 1 for i = 1 to ns do 2 create Ui , a random sample of nz objects 3 determine Ti , a minimal consistent set of cuts for Ui 4 for k = 1 to na do 5 Tik ← {t ∈ Ti | t is a cut on attribute ak } 6 end for 7 end for 8 D ← ∅, R ← ∅ 9 for k = 1 to na do ns 10

compute mk =

1 ns

|Tik |; round-off mk

i=1

11 if mk > 0 then 12 Fk ← the mk most frequent cuts in Tik , i = 1, 2, . . ., ns 13 D ← D ∪ {ak } × Fk 14 R ← R − ∪{ak } 15 end if 16 end for 17 return D, R

where I = (U, A ∪ {d}) is the information system, ns is the number of samples, nz is the size of each sample,and na = |A| the number of attributes. The set Tik contains the cuts contributed by ak in discretizing the objects in the ith sample Ui . In ns successive samples, the cuts that are chosen with the highest frequency are selected in Fk . The cardinality of Fk is determined by averaging the cardinalities of Tik (over i = 1, 2, . . ., ns ). The principle is of proportional representation: those attributes which can classify more objects get a higher share of cuts.

4.1. Classiﬁcation accuracy of dynamic discreduct In this subsection, we explore to ﬁnd out the optimum size of a sample that could best predict the class of some unknown data in a particular domain. For this the data set is split into 5 equal subsets (S1 , S2 , . . ., S5 ) using random numbers generated from the code. One of them is chosen as the ‘test-set,’ and the remaining are merged into a ‘training-set.’ Dynamic discreduct is found for the training set using the algorithm in Section 4. Rules are derived from the discreduced training set, which are given to predict the class of every object in the test set. If no rule is applicable on an object in the test set, it is predicted to fall in the widest class (i.e. one with the maximum number of objects). The percentage of objects correctly

P. Dey et al. / Applied Soft Computing 11 (2011) 3887–3897

(a) Iris

3893

(b) Glass

(c) Breast

Fig. 4. (a–c) Variation of classiﬁcation accuracy with number of cuts for different data sets.

classiﬁed is the classiﬁcation accuracy for that test set i =

data sets. It is clear that the method of dynamic discretization proposed in the present paper outperforms all the prevailing methods.

no. of objects in Si correctly classiﬁed by the rules from U − Si no. of objects in Si

The classiﬁcation accuracies are averaged over the ﬁve test sets, each time merging the remaining four subsets into the training set (the ﬁvefold cross validation scheme [1]) 1 i 5 5

=

(9)

i=1

The classiﬁcation accuracy () for the three data sets is plotted against the number of cuts (nc ) for different values of relative sample size (z) in Fig. 4. The results suggest that 20–30% of the data should be the optimum sample size. The optimum number of cuts can then be found from Eq. (7) according as (optm)

nc

=c·

5.1.1. Run time vs. sample size Experiments were conducted with different sample sizes (z) varying from 10% to 80% of the training data. The number of samples (ns ) was set so as to ensure that each data in the training set is represented thrice on an average in the samples, i.e. z · ns 3. The time taken for discreduction (i.e. selection of cuts) is presented in Table 4. The results suggest that for the larger data sets (excepting Iris) taking samples substantially reduce the computation time. The run time (t) was also plotted against the relative sample size (z) in Fig. 5, and a power ﬁt was tried on the data points for each of the three data sets. Now, the time required for selecting a cut varies as kz ·nz , where the number of objects in a sample nz ∝ z, and

z (optm)

which is near about 1/2 the number of cuts required to consistently classify all the objects in U. But in practice, c is quite difﬁcult to determine; rather the number of cuts needed to consistently classify a sample of size nz is much more readily available. So, it is best to set the value of z to 0.25, and the average cardinality of Ti , doubled, would serve as a good estimate for the optimal number of cuts needed to predict the whole data. If it falls below 3, 1 should be added to it, and it’s safe to round the value off to the next (higher) integer. The number of samples ns can be determined from the thumb rule that each object in the data should be represented 3 or 4 times. Thus, for nz = 0.25, ns would be 12–15.

Table 3 Classiﬁcation accuracies achieved by different methods. Data

S-ID3

C4.5

MDLP

1R

RS-D

LEM2

DD

Iris Glass Breast

96.67 62.79 –

94.67 65.89 –

93.93 63.14 93.63

95.9 56.4 –

95.33 66.41 –

95.3 – –

96.06 66.68 95.34

Average DD avg.

79.73 81.37

80.28 81.37

83.57 86.03

76.15 81.37

80.87 81.37

95.3 96.06

5. Results 5.1. Classiﬁcation of UCI data sets Using the algorithm described in the previous section, the best values of classiﬁcation accuracy achieved by dynamic discreduction (DD) for the three data sets has been compared with other methods in Table 3. The results denoted as S-ID 3 and C 4.5 were taken from [20]. The MDLP algorithm was originally proposed by Fayyad and Irani [22], but the original paper bears no numerical results; so other sources [1,23] were resorted to. The results of 1 R were taken from the original paper by Holte [24]. The results denoted as RS-D are obtained using classical Rough Set discretization methods [1,20], while LEM 2 is a powerful Rough Set algorithm [25]. The results of each reference in Table 3 have been summarized by an average of the existing results, and comparison has been made with the average of present (DD avg.) results for the corresponding

Fig. 5. Run time vs. sample size of different data sets.

– –

3894

P. Dey et al. / Applied Soft Computing 11 (2011) 3887–3897

Table 4 Run-time for full training set and different sample sizes (z) and no. of samples (ns 3/z). (z, ns )

(0.1, 30)

(0.2, 15)

(0.25, 12)

(0.3, 10)

(0.4, 8)

(0.5, 6)

(0.6, 5)

(0.8, 4)

(1, 1)

Iris Glass Breast

0.422 0.888 0.697

0.242 1.428 0.888

0.216 1.835 1.060

0.253 2.244 1.281

0.188 3.335 1.966

0.227 4.303 2.372

0.255 5.597 3.013

0.354 8.616 4.713

0.134 3.603 1.950

the number of distinct values in a sample varies as kz ∝ the time taken to discretize a sample should vary as t ∝ z 1.5

√

z. Thus (10)

The exponents of z in Fig. 5 strongly suggest the expected value 1.5, with r2 greater than 99%. The ﬁgure also suggests that a sample size less than 40% would be computationally economic, taking the suggested number of samples. 5.2. Mining a real life data set In this subsection the advantages of using the dynamic discreduct algorithm over the conventional process of ﬁnding the reduct and discretizing the attributes in the reduct is examined. Both the methods are applied to a data set relating the composition and process parameters to the mechanical properties of TRIP steel, and the obtained rules are compared as to how the rules reveal the underlying TRIP phenomena. Transformation induced plasticity (TRIP) steels, ﬁrst reported by Zackay et al. [26], exhibit a superior combination of strength and ductility. This characteristic has made TRIP aided steel a potential material for various automobile parts requiring high strength with adequate formability. As the phenomenon of TRIP is quite complicated, there is still sufﬁcient scope for exploration. Several attempts have been made to model the TRIP phenomenon from its physical understanding [27,28] but no suitable model exists till date to predict the mechanical properties of TRIP steel directly from the composition and processing parameters. This is mainly due to the lack of precise knowledge about the complex and non-linear role of the independent variables on the properties of the steel. Efforts have been made to develop datadriven models using tools like artiﬁcial neural network and genetic algorithm to predict TRIP steel properties [29–31]. But these models have the inherent complexity and opacity of a black box. In a recent work [32] the properties of TRIP steel have been investigated from the RS approach to deduce rules from the data using the minimal reduct discretization method described in Section 5.2.1. In Section 5.2.2 the dynamic discreduction algorithm has been used to derive another set of rules from the same TRIP steel data. The two methods may be briefed below 1. Determine a minimal reduct, then discretize every attribute therein, and derive a set of rules for the system. 2. Find a dynamic discreduct, and arrive at a set of rules. And ﬁnally, the two sets of rules thus obtained are compared vis-à-vis the existing knowledge on TRIP steel. The TRIP steel data set contains 90 objects, most of which was collected from published literature of several workers [30,33–37]. The multiple sources are also an inherent source of noise in the data, that comes through different dimensions of experimental errors. This renders the application of dynamic discreduction algorithm more relevant here. The range of values of the conditional (i.e. compositional and processing) attributes, as well as the decision attribute (UTS) are listed in Table 5. To start with, UTS is discretized into three equalfrequency classes, roughly allotting the same number of objects to each class. Steels below 730 MPa falls in the class of Low-strength

Table 5 Numerical range of the attributes in the TRIP steel data. Attributes (units)

Symbol

A. Conditional (i) Composition 1. Carbon (wt%) 2. Manganese (wt%) 3. Silicon (wt%) (ii) Processing 4. Cold deformation (%) 5. Intercritical annealing Temp. (◦ C) 6. Intercritical annealing time (s) 7. Bainitic transformation temp. (◦ C) 8. Bainitic transformation time (s) B. Decision Ultimate tensile strength (MPa)

Min.

C Mn Si

Max.

0.12 1.00 0.48

0.29 2.39 2.00

d Ta ta Tb tb

56.25 750 51 350 30

77.14 860 1200 500 1200

UTS

580.72

887.44

steels, those from 730 to 770 MPa in the Medium-strength class, and steels with UTS above 770 MPa falls in the High-strength class (see the last three columns of Table 6). This discretization of UTS has been used in both Sections 5.2.1 and 5.2.2. 5.2.1. Minimal reduct discretization The results in this subsection was reported in a recent work by the present authors [32]. We re-present them here to draw a comparison with the results of dynamic discreduction presented in the next subsection The ﬁrst task is to ﬁnd a minimal reduct for the TRIP steel data set. Since the number of attributes are quite few, an extensive search was undertaken to see whether a subset of attributes can classify all the objects of the data consistently, starting from the 8 one-attribute subsets, next searching the 28 two-attribute subsets, and so on. At cardinality 4, just one reduct is found; this is of course the only minimal reduct. The 4 attributes in the reduct are then discretized with the MD-heuristic algorithm, yielding intervals to which certain names (or labels) are given as in Table 6. Rules are derived from the discretized minimal set of attributes. The number of rules with one to four terms in the antecedent came to near about 200. From these, only a handful few are selected that actually represent the general patterns in the data. This was done on the basis of two qualifying metrics (Eqs. (5) and (6)). The threshold values of accuracy and coverage for selecting the rules were set to 80% and 15%, respectively. This was done on an ad-hoc basis, so as to limit the set of rules to a handful few. The ﬁnal set of rules have been presented in Rule Set 1. The pair of values in square brackets indicate these two values respectively for each rule. Rule Set 1 Rules obtained by minimal reduct discretization 1. 2. 3. 4. 5. 6. 7. 8.

if Si = MH if Si = QH if Si = QH if Si = QH if Si = QH if Si = VL if Si = L if Si = ML

∧Ta = L

∧Ta = H ∧Ta = L

∧Tb = ML ∧Tb = MH ∧tb = L ∧tb = ML ∧tb = MH ∧tb = MH ∧tb = MH ∧Tb = ML

then then then then then then then then

UTS = L UTS = M UTS = M UTS = M UTS = M UTS = H UTS = H UTS = H

[100, 21.4] [90, 23.1] [86, 15.4] [100, 17.9] [100, 17.9] [83, 21.7] [100, 21.7] [100, 21.7]

The absence of intercritical annealing time (ta ) and cold deformation (d) in the minimal reduct seems to be reasonable, as these variables are known to have insigniﬁcant contribution in the ﬁnal microstructure and property. On the other hand, it may be noted

P. Dey et al. / Applied Soft Computing 11 (2011) 3887–3897

3895

Table 6 Intervals on discretizing the four attributes in the minimal reduct of the TRIP steel data. Ta (◦ C)

Si (wt%)

Tb (◦ C)

Min

Max

Label

Min

Max

Label

Min

0.48 0.73 0.985 1.09 1.19 1.22 1.40 1.46

0.73 0.985 1.09 1.19 1.22 1.40 1.46 2.00

VL QL L ML MH H QH VH

750 795 810

795 810 860

L M H

350 375 415 440 457

tb (s) Max 375 415 440 457 500

UTS (MPa)

Label

Min

Max

Label

Min

Max

Label

L ML M MH H

30 45 150 230 280 450 950

45 150 230 280 450 950 1200

VL L ML M MH H VH

580 730 770

730 770 890

L M H

VL: very low; QL: quite low; L: low; ML: moderately low; M: medium. VH: very high; QH: quite high; H: high; MH: moderately high.

that the only compositional parameter Si (and the processing parameter bainitic transformation time, tb ) is present in (almost) every rule. Interestingly, both Si and tb has also more cuts than the other attributes, which in some way verify the ﬁrst claim made at the beginning of Section 3. The rules clearly show that lesser amount of Si, and moderately high bainitic transformation time (tb ) is favoured for higher strength of the steel. This can be justiﬁed from the fact that TRIP steel with lower amount of Si may contain carbides in the microstructure leading to high strength, whereas a bit higher transformation time favours a good amount of bainite resulting in an increase in the strength level. But the absence of C and Mn in the reduct cannot be justiﬁed, as it is known that these two elements play the most important role in the stability of retained austenite, and consequently in the occurrence of TRIP. Since the data was compiled from various sources (reporting experiments carried out in different situations) there is ample scope for noise in the data. This may have caused the over-ﬁtting, and under-representation of essential attributes in the reduct.

resulting discreduct with 10 cuts (compared to 19 from the consistent discretization in the previous section) spans six attributes (instead of four previously). The position of the cuts and the labels assigned to respective intervals are shown in Table 7. Five rules were obtained from the data set which cleared the 80% accuracy and 30% coverage levels. They are presented in Rule Set 2. Two processing attributes (d and ta ) failed to contribute any cut, and were regarded as redundant – similar to the previous method. On the other hand, introduction of C and Mn is very much signiﬁcant from the metallurgical point of view, since, in the given range of values, they are known to be quite important parameters in deciding the strength of any steel. C and Mn are the most potent austenite stabilizers and also play a signiﬁcant role in the hardenability of the retained austenite. Thus the TRIP phenomenon in steel is chieﬂy controlled by C and Mn. From this point of view inclusion of these two attributes is a commendable achievement of the proposed algorithm. Introduction of these two compositional attributes trigger another interesting series of events. The additional attributes help to reduce the cuts to a meagre 10 (compared to 19 in the previous set of rules). This conﬁnes the composition and processing attributes to two to three discrete classes, which can be described by simple qualiﬁers like ‘Low’, ‘Medium’ or ‘High’, dispensing the use of ﬁner intervals like ‘Quite High’ or ‘Moderately Low.’ This in turn keeps the coverage of the rules to a higher level, indicating that the most general patterns are represented in the rules. Thus the rules appear very comfortable from the aspect of understanding the TRIP phenomena, and subsequent application of the inherited knowledge in further development of TRIP steel. This is evidently a marked improvement achieved by the dynamic discreduction, as compared to the previous method.

Rule Set 2 Rules obtained by dynamic discreduction 1. 2. 3. 4. 5.

If C = L If C = H If If If C = H

∧Mn = L ∧Si = M ∧Ta = L ∧Tb = L ∧Mn = L ∧Si = H ∧Ta = L Mn = H ∧Tb = L Mn = H ∧Ta = H ∧Si = H

then then ∧tb = M then ∧tb = M then ∧tb = M then

UTS = L UTS = M UTS = H UTS = H UTS = H

[86, 46] [92, 33] [100, 35] [89, 39] [87, 30]

5.2.2. Dynamic discreduction Thirty samples were taken from the data set, each containing 30 objects. For each sample a consistent set of cuts was determined using the MD-heuristic algorithm, with all the 8 conditional attributes being allowed to contribute cuts. The 10 most frequently occurring cuts were chosen using a proportional representation of attributes as described in the algorithm in subsection 4. The

5.2.3. Cuts in the dynamic discreducts Finally we present two characteristics of the cuts in the 30 samples from which the dynamic discreduct was constructed.

Table 7 Intervals in the dynamic discreduct of the TRIP steel data. C (wt%)

Mn (wt%)

Min

Max

Label

Min

0.12 0.145

0.145 0.29

L H

1.0 1.48 2.15

Ta (◦ C)

Si (wt%) Max 1.48 2.15 2.39

Label

Min

Max

Label

L M H

0.48 0.73 1.4

0.73 1.4 2.0

L M H

Tb (◦ C)

tb (s)

Min

Max

Label

Min

Max

Label

Min

Max

750 810

810 860

L H

350 415 457

415 457 500

L M H

30 230 450

230 450 1200

L: low; M: medium; H: high.

Label L M H

3896

P. Dey et al. / Applied Soft Computing 11 (2011) 3887–3897

Fig. 6. (a and b) Sample cuts in the TRIP steel data.

The number of cuts that were required to consistently classify a sample was plotted in a bar-chart in Fig. 6(a), while the shares of each of the eight attributes in the total number of cuts in the samples were plotted in another bar-chart shown in Fig. 6(b). Fig. 6(a) is interesting from the point that it almost represents a normal distribution, except a sharp rise for the value 11. The possible explanation for this is that few objects contained special information that was not included in other objects. This could represent results in a region that was not covered by other experiments; otherwise it would denote noise in the data, i.e. experimental errors. However, the average value comes to 8.4, which rounds off to 9. Keeping a safe limit, we take 10 cuts as the cardinality of the dynamic discreduct. The share of cuts in the sample discreducts shown in Fig. 6(b) clearly demarcate two attributes (d and ta ) as redundant, getting less than 5% of the cuts in the samples. Two other attributes, C and Ta , get around 10% of the sample cuts, while four others (viz. Mn, Si, Tb , and tb ) receive 15–20% cuts each. In constructing the dynamic discreduct, C and Ta are thus given one cut each, while each of Mn, Si, Tb , and tb are allotted two cuts (see Table 7).

6. Conclusion In the present paper the relation and interdependence between two vital tasks carried out through Rough Set Theory, namely reduction and discretization of attributes, have been investigated. Self-similarity and complementarity of the two processes have been utilized to devise a method that ﬁnds the optimally discretized set of attributes. The cuts that are most frequently required to classify all objects in a series of samples taken from the data, are collected to form a dynamic discreduct. Those attributes where one or more cut is placed form the reduct. The process of discretization and ﬁnding reducts are thus merged into a single seamless process, which has been named dynamic discreduction. The efﬁciency of the algorithm is dependant on two parameters, viz. the number of cuts and the size of samples. To obtain an optimum value of these two parameters, their effect on classiﬁcation accuracy has been studied. The method has been applied on some benchmark data sets, and the results clearly outperform all existing methods. A real life data set on TRIP steel has also been analysed, where the rules derived from dynamic discreduct are found to be simpler, more general, and appropriate from the metallurgical aspect compared to the the rules derived from discretized minimal reducts.

Acknowledgements The present research was conducted as part of a Fast Track Scheme for Young Scientists supported by the Department of Science and Technology, Government of India, vide Grant no. SR/FTP/ETA-02/2007. The ﬁnancial support is being duly acknowledged. References [1] H.S. Nguyen, S.H. Nguyen, Discretization Methods in Data Mining, vol. 1, Springer Physica-Verlag, 1998, pp. 451–482. [2] H. Liu, F. Hussain, C.L. Tan, M. Dash, Discretization: an enabling technique, Data Mining and Knowledge Discovery 6 (2002) 393–423. [3] R. Jin, Y. Breitbart, C. Muoh, Data discretization uniﬁcation, in: Seventh IEEE International Conference on Data Mining, 2007, pp. 183–192, doi:10.1109/ICDM.2007.35. [4] Y. Yang, Discretization for Naive-Bayes Learning, PhD thesis, School of Computer Science and Software Engineering, Monash University, 2003. [5] P. Blajdo, Z.S. Hippe, T. Mroczek, J.W. Grzymala-Busse, M. Knap, L. Piatek, An extended comparison of six approaches to discretization – a Rough Set approach, Fundamenta Informaticae 94 (2009) 121–131. [6] J. Zhao, Y. Zhou, New heuristic method for data discretization based on Rough Set Theory, The Journal of China Universities of Posts and Telecommunications 16 (2009) 113–120. [7] Y. He, D. Chen, W. Zhao, Integrated method of compromise-based ant colony algorithm and Rough Set Theory and its application in toxicity mechanism classiﬁcation, Chemometrics and Intelligent Laboratory Systems 92 (2008) 22–32. [8] L. Xu, F. Zhang, X. Jin, Discretization algorithm for continuous attributes based on niche discrete particle swarm optimization, Journal of Data Acquisition and Processing 23 (2008) 584–588 (in Chinese: Shuju Caiji Yu Chuli). [9] M. Boulle, Khiops: a statistical discretization method of continuous attributes, Machine Learning 55 (2004) 53–69. [10] G. Li, H. Sun, H. Li, X. Jiang, Discretization of continuous attributes based on statistical information, Journal of Computational Information Systems 4 (2008) 1069–1076. [11] T. Qureshi, D.A. Zighed, Using resampling techniques for better quality discretization, in: 6th International Conference on Machine Learning and Data Mining in Pattern Recognition, MLDM 2009, Leipzig, 2009, pp. 1515–1520. [12] J. Senthilkumar, D. Manjula, R. Krishnamoorthy, NANO: a new supervised algorithm for feature selection with discretization, in: IEEE International Advance Computing Conference, IACC 2009, 2009, pp. 1515–1520. [13] L. Tinghui, S. Liang, J. Qingshan, W. Beizhan, Reduction and dynamic discretization of multi-attribute based on Rough Set, in: World Congress on Software Engineering, 2009, WCSE 09, Xiamen, 2009. [14] F. Min, Q. Liu, C. Fang, Rough Sets approach to symbolic value partition, International Journal of Approximate Reasoning 49 (2008) 689–700. [15] Y.-Y. Guan, H.-K. Wang, Y. Wang, F. Yang, Attribute reduction and optimal decision rules acquisition for continuous valued information systems, Information Sciences 179 (2009) 2974–2984. [16] J. Mata1, J.-L. Alvarez, J.-C. Riquelme, Discovering numeric association rules via evolutionary algorithm, in: 6th Conference on Knowledge Discovery and Data Mining, 2002, pp. 40–51. [17] Z Pawlak, Rough Sets, International Journal of Computer & Information Sciences 11 (1982) 341–356. [18] G. Cantor, Contributions to the Founding of the Theory of Transﬁnite Numbers, Dover Publications, 1915.

P. Dey et al. / Applied Soft Computing 11 (2011) 3887–3897 [19] A. Frank, A. Asuncion, UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, 2010. URL: http://archive.ics.uci.edu/ml. [20] J. Komorowski, Z. Pawlak, L. Polkowski, A. Skowron, Rough Sets: A Tutorial, 2002. URL: alfa.mimuw.edu.pl/prace/1999/D5/Tutor06 09.ps. [21] I. Düntsch, G. Gediga, Rough Set data analysis: a road to non-invasive knowledge discovery, Metho␦os (2000). [22] U.M. Fayyad, K.B. Irani, On the handling of continuous-valued attributes in decision tree generation, Machine Learning 8 (1992) 87–102. [23] A. An, N. Cercone, Discretization of Continuous Attributes for Learning Classiﬁcation Rules LNAI 1574, Springer-Verlag, Berlin/Heidelberg, 1999, pp. 509–514. [24] R.C. Holte, Very simple classiﬁcation rules perform well on most commonly used datasets, Machine Learning 11 (1993) 63–91. [25] J.W. Grzymala-Busse, LERS – a system for learning from examples based on Rough Sets, in: Intelligent Decision Support – A Handbook of Applications and Advances in the Rough Set Theory, Kluwer Academic Publishers, 1992, pp. 3–18. [26] V.F. Zackay, E.R. Parker, D. Fahr, R. Bush, The enhancement of ductility in high strength steels, Transactions of the American Society of Metals 60 (1967) 252–259. [27] J. Bouquerel, K. Verbeken, B.C. De Cooman, Microstructure-based model for the static mechanical behaviour of multiphase steels, Acta Metallurgica Materiala 54 (2006) 1443–1456. [28] H.N. Han, C.G. Lee, C.-S. Oh, T.-H. Lee, S.-J. Kim, A model for deformation behavior and mechanically induced martensitic transformation of metastable austenitic steel, Acta Metallurgica Materiala 52 (2004) 5203–5214. [29] S.M.K. Hosseini, A. Zarei-Hanzaki, M.J.Y. Panah, S. Yue, ANN model for prediction of the effects of composition and process parameters on tensile strength and

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

3897

percent elongation of Si–Mn TRIP steels, Materials Science and Engineering A 374 (2004) 122–128. M. Mukherjee, S.B. Singh, O.N. Mohanty, Neural network analysis of strain induced transformation behaviour of retained austenite in TRIP-aided steels, Materials Science and Engineering A 434 (2006) 237–245. S. Datta, F. Pettersson, S. Ganguly, H. Saxén, N. Chakraborti, Identiﬁcation of factors governing mechanical properties of TRIP-aided steel using genetic algorithms and neural networks, Materials and Manufacturing Processes 23 (2008) 130–137. S. Dey, P. Dey, S. Datta, J. Sil, Rough Set approach to predict the strength and ductility of TRIP steel, Materials and Manufacturing Processes 24 (2009) 150–154. H.C. Chen, H. Era, M. Shimizu, Effect of phosphorus on the formation of retained austenite and mechanical properties in Si low-carbon steel sheet, Metallurgical Transactions A 20 (1989) 437–445. Y. Sakuma, O. Matsumura, O. Akisue, Inﬂuence of C content and annealing temperature on microstructure and mechanical properties of 400 ◦ C transformed steel containing retained austenite, ISIJ International 31 (1991) 1348–1353. M.D. Meyer, D. Vanderschueren, B.D. Cooman, The inﬂuence of the substitution of Si by Al on the properties of cold rolled C–Mn–Si TRIP steels, ISIJ International 39 (1999) 813–822. S. Papaefthymiou, W. Bleck, S. Kruijver, J. Sietsma, L. Zhao, S. van der Zwaag, Inﬂuence of intercritical deformation on microstructure of TRIP steels containing Al, Materials Science and Technology 20 (2004) 201–206. N.R. Bandyopadhyay, S. Datta, Effect of manganese partitioning on transformation induced plasticity characteristics in microalloyed dual phase steels, ISIJ International 44 (2004) 927–934.

Contents lists available at ScienceDirect

Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc

Dynamic discreduction using Rough Sets P. Dey a,∗ , S. Dey a , S. Datta b , J. Sil c a b c

School of Materials Science and Engineering, Bengal Engineering and Science University, Shibpur, Howrah 711 103, India Birla Institute of Technology, Deoghar, Jasidih, Deoghar 814 142, India Department of Computer Science and Technology, Bengal Engineering and Science University, Shibpur, Howrah 711 103, India

a r t i c l e

i n f o

Article history: Received 28 January 2010 Received in revised form 1 September 2010 Accepted 3 January 2011 Available online 12 January 2011 Keywords: Rough Set Discretization Classiﬁcation Data mining TRIP steel

a b s t r a c t Discretization of continuous attributes is a necessary pre-requisite in deriving association rules and discovery of knowledge from databases. The derived rules are simpler and intuitively more meaningful if only a small number of attributes are used, and each attribute is discretized into a few intervals. The present research paper explores the interrelation between discretization and reduction of attributes. A method has been developed that uses Rough Set Theory and notions of Statistics to merge the two tasks into a single seamless process named dynamic discreduction. The method is tested on benchmark data sets and the results are compared with those obtained by existing state-of-the-art techniques. A real life data on TRIP steel is also analysed using the proposed method. © 2011 Elsevier B.V. All rights reserved.

1. Introduction With the impending explosion in digital data over the past few decades, it has become obligatory to search for methods that put semantics to these information. Deriving easily understandable rules and discovering meaningful knowledge in the data is a nontrivial task facing bottlenecks in the industry as well as in research levels. There exist different methods for analysis of data tables or information systems using techniques ranging from Statistics to formal logic. Most of them are designed to work on data where attributes can have only a few possible values [1]. But majority of the practical data mining problems are characterised by continuous attribute values. It has been common practice to resolve the problem in two separate modules. The ﬁrst step involves transforming the continuous data into a few discrete intervals, which is termed discretization. The discretized data is then analysed using the available techniques, loosely called machine learning algorithms. There has been a considerable volume of research on discretization. A couple of review papers [2,3] summarizes and categorizes the major studies, whereas [4] enlists 33 different discretization methods. But no single method has been reported to perform better than the others in all respects. Nor is any one of them known to apply indiscriminately to any data whatsoever [5].

∗ Corresponding author. E-mail address: [email protected] (P. Dey). 1568-4946/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.asoc.2011.01.015

Rough Set (RS) is known to be a powerful tool for deriving rules from data using a minimum number of attributes. The method involves ﬁnding a reduct containing a minimal subset of attributes which is just sufﬁcient to classify all objects in the data. However RS methods are typically devised to deal with discrete attribute values, which calls for methods that discretize the data either beforehand, or after the reduct has been found. RS has been successfully used to discretize continuous data through different innovative methods [4–6], often used in combination with ant colony [7] or particle swarm optimization [8]. Different statistical techniques and measures have also been used to innovate or improve a discretization procedure [9–11]. But problems with such modular methodology is that the attribute reduction and discretization processes are usually assumed to be independent of one another. This involves high computation cost or low prediction accuracy. If the data is ﬁrst discretized, cost is escalated from discretizing the redundant attributes, particulary if the data is large. On the other hand, giving prime emphasis to the elimination of attributes may lead to loss of important information and over-ﬁtting of the data. The rules thus derived will have fewer objects in their support and suffer from low prediction accuracy. Presence of noise in the data often hastens such possibility. Some very recent works have tried to fuse the two tasks of discretization and feature selection [12,13]. Researchers have also proceeded along more generalized paths like symbolic value partitioning [14] or tolerance Rough Set method based on similarity measure [15] that performs the two tasks simulatneously. There has been studies [16] where attributes values are grouped differ-

3888

P. Dey et al. / Applied Soft Computing 11 (2011) 3887–3897

Table 1 An illustrative example: 10 samples from the Iris data set. Objects

U

A

u1 u2 u3 u4 u5 u6 u7 u8 u9 u10

D

a1 Sepal length

a2 Sepal width

a3 Petal length

a4 Petal width

d Iris

4.8 5.7 5.1 4.9 4.9 5.9 5.5 6.7 6.2 6.2

3.0 4.4 3.5 3.6 2.4 3.2 2.5 2.5 2.8 3.4

1.4 1.5 1.4 1.4 3.3 4.8 4.0 5.8 4.8 5.4

0.1 0.4 0.3 0.1 1.0 1.8 1.3 1.8 1.8 2.3

Setosa Setosa Setosa Setosa Versicolor Versicolor Versicolor Vtginica Vtginica Vtginica

ently for each rule. Evidently there is need for further research in this area. In this paper the possible relations between discretization and reduction of attributes are explored. A method is proposed that uses RS and notions from Statistics to discretize, and at the same time distinguish the important attributes in a data set based on random samples. Using samples instead of the whole data to discretize the variables reduce the computation time, specially for large data sets. The resulting rules become simpler, and the classiﬁcation accuracy increases, particularly in case of noisy data sets. The method is tested on some benchmark data, and comparison is made with the results obtained using state-of-the-art techniques. A real life data from Materials Engineering ﬁeld is also analysed using the proposed method, and the rules derived are assessed in view of knowledge discovery. The remaining part of the paper is organized into the following ﬁve sections. Section 2 briefs the basic notions of Rough Set Theory and exempliﬁes the interrelation between discretization and reduction of attributes. The concept of discreduction is also introduced. In Section 3 the dependence of a discreduct on choice of attributes and sample size is explored. Section 4 describes the method of ﬁnding a dynamic discreduct, whereas results on applying the method to different data sets is presented in Section 5. Finally, Section 6 concludes the article. 2. Rough Set Theory Rough Set Theory (RST), introduced by Pawlak [17], is an extension of the classical Set Theory proposed by Georg Cantor [18]. In brings forth the concept of a boundary region which accommodates elements lying midway between belonging and not belonging to that set. The theory has shown immense usefulness in distilling out requisite information through a minimal set of attributes that classify all possible objects in a data set. An information system I essentially consists of a set (universe) of objects U. Each object has deﬁnite values for a set of conditional attributes A, depending on which the value of a decision attribute d is ascertained. Formally it is the ordered pair I = (U, A ∪ {d}) In the following subsections the basic notions of RST have been illustrated with the help of an example (in Table 1) consisting of 10 samples from the Fisher’s Iris data set [19]. 2.1. Classiﬁcation We can choose a non-empty subset of attributes S ⊆ A and proceed to examine the capability of S in classifying the objects in U. For example, if only the Sepal Length attribute is considered, i.e. S = {a1 }, the objects u4 and u5 are indiscernible, and so are u9 and

Class

C1

C2

C3

u10 . Whereas the latter pair of objects, by virtue of being in the same class pose no problem in classiﬁcation, the former pair belong to different classes of Iris, but are indistinguishable from one another from its Sepal Length alone. {a1 } can thus deﬁnitely classify 3 of the 4 Iris Setosa ﬂowers in the sample (u1 , u2 , u3 ), as also 2 of 3 Versicolor, and 3 of the 3 Virginica ﬂowers. This puts the classiﬁcation accuracy of S at 8/10. The concepts may be formalized as follows. Let Ci represent the set of objects in the ith class for i = 1, 2, . . ., l (l being the number of decision classes in I). Now U/d = {C1 , C2 , . . ., Cl } is the partition of U induced by the decision attribute d. The S-lower approximation of Ci is deﬁned as the set of objects deﬁnitely classiﬁed as Ci using the information in S. It is denoted by S(Ci ) =

u|[u]S ⊆ Ci ,

i = 1, 2, . . . , l

where [u]S is the set of objects indiscernible from u, given the information S (i.e. the objects for which values of all the attributes in S are identical to those of u). The corresponding S-upper approximation is the set of objects possibly classiﬁed as Ci using S ¯ i) = S(C

u|[u]s ∩ Ci = / ∅ ,

i = 1, 2, . . . , l

and the difference of the two sets give the S-boundary region of Ci ¯ i ) − S (Ci ), BNs (Ci ) = S(C -

i = 1, 2, . . . , l

The positive region of U/d with respect to S is the union of lower approximations of all classes in U/d. It represents the set of objects that can be deﬁnitely classiﬁed into one of the decision classes C1 , C2 , . . ., Cl using the information in S. It is denoted by

POSS (d) =

S(Ci )

Ci ∈ U/d

whereas the classiﬁcation accuracy of S with respect to d is the fraction of deﬁnitely classiﬁable objects expressed as (S, d) =

|POSS (d)| |U|

(1)

Clearly, the positive region of I w.r.t. Sepal Length is {u1 , u2 , u3 , u6 , u7 , u8 , u9 , u10 }, denoting the set of objects about which deﬁnite conclusion may be drawn (regarding the class of Iris) using only their Sepal Lengths. From Eq. (1), the classiﬁcation accuracy of {a1 } comes to be 0.8. An information system is said to be consistent if the entire set of attributes A can classify all the objects in U. That is, if (A, d) = 1 The information system I in Table 1 is a consistent system. The signiﬁcance of an attribute a ∈ S is deﬁned by the relative decrease in classiﬁcation accuracy on removing a from S. It is

P. Dey et al. / Applied Soft Computing 11 (2011) 3887–3897

(a) Sepal Length

(b) Sepal Length and Sepal Width

3889

(c) Sepal Length and Petal Length

Fig. 1. (a–c) Discretization of the Iris sample (Table 1) with different sets of attributes.

denoted by (S,d) (a) =

(S, d) − (S , d) (S, d)

(2)

where S = S −{a}. As an example, S = {a1 , a2 } can classify all the 10 objects, while S = {a1 } classiﬁes only 8. Thus (s,d) (a2 ) = 0.2. 2.2. Redundancy If two attributes are sufﬁcient to classify all the objects in a data set, the other attributes are considered redundant, and eventually eliminated to form a reduct. Formally, a reduct R is a minimal subset of attributes which has a classiﬁcation accuracy of 1, minimal in the sense that every proper subset of R has a classiﬁcation accuracy less than 1. Formally, R ⊆ A is a reduct if and only if both Eqs. (3) and (4) are satisﬁed (R, d) = 1

(3)

and

γ(S, d ) < 1

∀S

R

(4)

There may be more than one reduct in any information system, and a reduct with the lowest cardinality is called minimal reduct. Thus, R1 = {a1 , a2 } is a minimal reduct for Table 1. A set of attributes satisfying Eqs. (3) and (4) is called an ‘exact’ reduct. A known problem in exact reducts is that they are highly susceptible to noise, and often perform badly in classifying external data [20]. Dynamic reducts, on the other hand, constitute those attributes that appear ‘most frequently’ in the reducts of a series of samples taken from the data. Dynamic reduct and the general role of samples has been dealt in greater detail in Section 2.6. 2.3. Discretization and association rules Data, in most practical cases, is presented as real-valued (continuous) attributes, which per se generate rules with very few objects in its support. These rules are inefﬁcient in classifying external data. For example, the rule if Sepal Length is 6.2, then the Iris is Virginica with a set of two objects {u9 , u10 } in its support. However, the rule could be further generalized to the form if Sepal Length ≥ 6, then Virginica

(R1)

which has a three-object support {u8 , u9 , u10 } in its favour. The value 6, in that case, is called a cut on the attribute Sepal Length. It is denoted by the ordered pair (a1 , 6).

At least three more cuts (shown by the dotted lines in Fig. 1(a)) are required to classify the remaining objects. This results in four more rules if 5.8 ≤ Sepal Length < 6.0, then Versicolor

(R2)

if 5.6 ≤ Sepal Length < 5.8, then Setosa

(R3)

if 5.3 ≤ Sepal Length < 5.6, then Versicolor

(R4)

if Sepal Length < 5.3, then Setosa

(R5)

The ﬁrst three of these (R2), (R3), (R4) are supported by singletons {u6 }, {u2 }, {u7 }, respectively. (R5), however, has a triplet {u1 , u3 , u4 } in its support, as well as one misclassiﬁed object (u5 ) that diminishes its accuracy to 3/4. To formalize the concepts: discretization is the division of the range of an attribute’s values into a few intervals, and the values at the interval boundaries are called cuts. A set of cuts spanning one or more attributes is said to be consistent if all the objects in the information system can be properly classiﬁed using the intervals formed by the cuts. For example, the set of cuts {(a1 , 5.3), (a1 , 5.6), (a1 , 5.8), (a1 , 6.0)} is inconsistent, since u5 ∈ C2 is indiscernible from u1 , u3 , u4 ∈ C1 . For any rule “if then ,” the accuracy and coverage [21] may be expressed as Accuracy =

|| ||I ∩ ||||I || ||I

(5)

and Coverage =

|| ||I ∩ ||||I ||||I

(6)

where || ||I and ||||I are the set of objects in I matching the antecedent ( ) and consequent () of a rule, and |·| is the cardinality of a set. The set of objects || ||I ∩ ||||I that match both the antecedent and consequent, is the support of a rule. 2.4. Linguistics and semantics An improvement in the rule set (R) (i.e. (R1)–(R5)) is evident if a second attribute is taken into account. Including Sepal Width along with Sepal Length reduces the number of necessary cuts to three (denoted by the 3 dotted lines in Fig. 1(b)) and all the 10 objects are correctly classiﬁed by them. Thus {(a1 , 5.8), (a1 , 6), (a2 , 2.75)} is a consistent set of cuts. The intervals formed by the cuts are often assigned linguistic labels like short, moderate and long, or wide and narrow, to represent the rules in a more familiar form. As such, the process of discretization not only increases the efﬁcacy of classiﬁcation, the

3890

P. Dey et al. / Applied Soft Computing 11 (2011) 3887–3897

resulting rules are also simpliﬁed, reduced in number, and rendered meaningful from the intuitive aspect. The four rules thus formed are short, wide Sepals are Iris Setosa

(S1)

short, narrow Sepals are Iris Versicolor

(S2)

moderately long Sepals are Iris Versicolor

(S3)

long Sepals are Iris Virginica

(S4)

where the intervals for Sepal Width are denoted as follows

A Sepal is

narrow wide

if Sepal Width otherwise

A Sepal is

Deﬁnition 0. Discreduction is essentially a process that successfully blends the two related processes of discretization and ﬁnding reduct for an information system. Deﬁnition 1. A discreduct is a set of ordered pairs that, at the same time discretizes and determines a reduct of an information system. It may be expressed as

< 2.75 cm, and

and as regards the Sepal Length, the intervals are

and to some extent inﬂuence the classiﬁcation accuracy. In order to arrive at an optimally discretized set of attributes, discretization and ﬁnding reducts need to be merged into a single seamless process. With this uniﬁcation in view, the following terms may be introduced.

D=

short if Sepal Length < 5.8 cm, long if Sepal Length ≥ 6 cm, and moderately long otherwise

The rule set (S) (i.e. (S1)–(S4)) is evidently better than (R), both in terms of classiﬁcation accuracy and rule semantics. Only the moderately long interval is sarcastically narrow. To take a last example, we consider another pair of attributes Sepal Length and Petal Length. Together they excellently classify all the 10 objects with just two cuts – one on each attribute {(a1 , 6), (a3 , 2)}. The set of rules then becomes extremely simple short Sepals and short Petals are Iris Setosa

(P1)

short Sepals and long Petals are Iris Versicolor

(P2)

long Sepals and long Petals are Iris Virginica

(P3)

(ak , tik )|ak ∈ A and tik ∈ Vak

for

k ∈ {1, 2, . . . , na } i = 1, 2, . . . , mk

where na is the no. of attributes in A, mk is the no. of cuts on ak , and Vak = [lak , rak ) ⊂ R is the range of values of ak (R being the set of real numbers). Without loss of generality, it may be assumed that the mk cuts on any attribute ak are arranged in ascending order of their values, i.e. i < j ⇔ tik < tjk ,

i, j ∈ {1, 2, . . . , mk }

Deﬁnition 2. The reduct of a discreduct is the set of attributes which has at least one cut in the discreduct. In other words it is the domain of the discreduct R = dom D =

a | (a, t) ∈ D for some t .

Deﬁnition 3. The discreduced information system I may be deﬁned as the ordered pair consisting of the same set of objects with a reduced set of attributes R ⊆ A and discrete attribute values

where the thresholds distinguishing short and long are 6 cm for sepals and 2 cm for the petals, corresponding to the two dotted lines in Fig. 1(c). Evidently the set of rules (P1)–(P3) (or (P)) appear much simpler and is intuitively easier to understand than the rule set (S). On pondering over (P), one may even raise a ‘childish’ question: why are there no Iris ﬂowers with long sepals and short petals? To a botanist it might lead to some specialized knowledge discovery, but we may rather content ourselves with the simple answer that nature has a certain sense of proportion, which she abhors to disrupt.

ID = (U, R ∪ {d})

2.5. Uniﬁcation

may be suitably mapped to a set of linguistic labels as discussed in Section 2.4. The set of discreduced attributes in the reduct may also be denoted by RD .

To sum up, clarity and classiﬁcation accuracy of a set of rules is governed by two major factors: (i) choice of attributes (resulting in the reduct), and (ii) selection of cuts (forming the discretization).

where the attributes in the reduct are transformed as k aD (u) = i ⇔ ak (u) ∈ [tik , ti+1 ), k k with t0k = lak and tm

• self-similar across scales, and • complementary and inter-dependent. Considered in unison, the interdependence of the two factors: (i) and (ii) very much determine the clarity of knowledge extracted from the data (expressed by the comprehensibility of the rules)

= rak . The transformed attribute values

Deﬁnition 4. The classiﬁcation accuracy of a discreduct is the ratio of the number of objects classiﬁed by the discreduct and the number of objects classiﬁed by all the attributes prior to discreduction. (D) =

In order to obtain an optimum balance between them, the interrelation between the two need to be taken into account. A reduct is the result of eliminating redundant attributes, while redundant cuts are eliminated in discretization. The heuristic for ﬁnding reducts (or discretization) essentially consists of selecting a set of minimal attributes (or cuts) [20]. Whereas a few ﬁnely discretized attributes may satisfy Eqs. (3) and (4), the discretization could be coarse if more attributes are considered. In other words, the problems of discretization and ﬁnding a reduct are

k +1

i = 0, 1, 2, . . . , mk

|POSRD (d)| |POSA (d)|

2.6. Role of samples A consistent set of cuts suffers from two major drawbacks: one quantitative and the other qualitative. First, the amount of computation drastically increases with the size of the data. Secondly, the consistent set of cuts that exactly classiﬁes every object in a given data (particularly with the minimal set of attributes) suffers from some kind of over-ﬁtting, which decreases its accuracy in predicting unknown external data. The ﬁrst makes the process time-consuming and costly—specially in case of large data sets with many distinct values of each attribute, while the second renders the ﬁnal classiﬁcation or decision rules clumsy, incomprehensible, inaccurate or even incorrect—particularly when the data is imprecise or noisy.

P. Dey et al. / Applied Soft Computing 11 (2011) 3887–3897

3891

Table 2 Data set properties. Name of the data set

# of attrib.’s |A|

# of objects |U|

# of classes |U/d|

Iris Glass Breast

4 9 9

150 214 683

3 6 2

# of minimal reducts n 4 10 8

To resolve the problem, a series of data samples are taken and reducts are found for these samples. The attributes that are present in ‘most’ of the sample reducts are collected to form a dynamic reduct, which has been shown [20] to improve classiﬁcation accuracy. But there are some open problems, such as quantiﬁcation of ‘most,’ or determination of the size and number of samples, so as to ensure that the major trends in the data are sufﬁciently reﬂected in the rules, and at the same time noise is satisfactorily ﬁltered out. These are quantities that should be carefully calibrated in order to get optimum results. Unfortunately, there exists no unanimity in standard literature for determining such thresholds. In this paper we devise a method that uses samples taken from the data to ﬁnd a dynamic discreduct which determines (i) the requisite attributes, (ii) the number of cuts on each attribute, and (iii) the value (or position) of the cuts.

3. Behaviour of dynamic discreducts

Avg. # of cuts nc

3 2 4

10.0 45.5 16.7

The properties of the data sets are given in the ﬁrst four columns of Table 2. Records with missing values has been removed from the last data set. 3.1. Signiﬁcance vs. no. of cuts For the ﬁrst claim, all the minimal reducts (R1 , R2 , . . ., Rn ) are computed for a data set, and each reduct is then discretized with the MD-heuristic algorithm [20] to get n discreducts (D1 , D2 , . . ., Dn ). The number and cardinality of the minimal reducts for the three data sets are given in columns 5 and 6 of Table 2, while the last column gives the average number of cuts, nc in the discreducts. Since a tie in choosing the most effective cut (that discerns maximum object pairs from different decision classes) was resolved by a toss, 10 runs are taken for each reduct. Considering all the discretized minimal reducts, the average number of cuts contributed by an attribute, (a), is plotted against the signiﬁcance of that attribute, (a) in Fig. 2. (a) =

Two major arguments are made in this section:

1 (cuts contributed by a in Di ) n 1 (Ri ,d) (a) n i

• Cuts from more signiﬁcant attributes are more effective in a discreduct; this means that if we go on choosing the most effective cut in each step (as the MD-heuristic does) more cuts will ﬁnally be selected from attributes with a higher signiﬁcance. • The number of cuts needed to consistently classify a sample of objects is proportional to the square root of the sample size. The ﬁrst claim has no direct bearing on the methodology for ﬁnding discreducts. It is a study which warranties that cuts in a discreduct are not chosen randomly from any attribute, rather each cut chosen from an attribute (using the heuristic method) stresses and signiﬁes the presence of that attribute in the reduct. In other words it asserts that a discreduct is also a reduct (see Deﬁnition 1). The second claim is a lemma that is directly applied in the dynamic discreduct algorithm. Three commonly used data sets, namely Iris, Glass, and Breast [i.e. Winconsin Breast Cancer (Original)], from the UCI Machine Learning Repository [19] have been used to study these behaviours.

(a) Iris

| · | of minimal reducts |R|

(a) =

i

A weighted average (a) for the number of cuts contributed by an attribute is also calculated. The logic of placing weight is that: if a set of objects could be classiﬁed by fewer cuts, each cut should be deemed more effective. Thus the weight carried by a cut would vary inversely as the the number of cuts in that discreduct, nc /Di giving the relative efﬁciency of every cut in Di (a) =

nc n i

cuts contributed by a in D i total no. of cuts in Di

where nc = (1/n) |Di | is the average number of cuts in the n minimal reducts. The linear ﬁts in Fig. 2 shows that the signiﬁcance of an attribute is roughly proportional to the number of cuts contributed by the attribute. The correlation coefﬁcient (r) is about 90% for all the data sets, with the weighted average giving a slightly better value.

(b) Glass Fig. 2. (a–c) Signiﬁcance vs. no. of cuts on different data sets.

(c) Breast

3892

P. Dey et al. / Applied Soft Computing 11 (2011) 3887–3897

(a) Iris

(b) Glass

(c) Breast

Fig. 3. (a–c) No. of cuts vs. sample size of different data sets.

3.2. No. of cuts vs. sample size

4. Methodology for ﬁnding dynamic discreduct

A sample of size nz is taken from the set of objects U using random numbers generated from the code. nz is varied as

The algorithm for ﬁnding a dynamic discreduct D for a data set is given below. The number and size of samples ns and nz should be tuned so that the major trends in the data is sufﬁciently represented, and at the same time any noise gets eliminated as far as possible. The optimum values of these two parameters has been discussed at the end of Section 4.1.

nz = z · |U|,

z = 0.1, 0.2, . . . , 0.9

where, for each of the nine relative sample sizes (z), 100 samples are taken. The consistent set of cuts Ti is determined for the ith sample Ui using the MD-heuristic algorithm [20], and the average

100 |T |/100) needed to consistently classify number of cuts (nc = i=1 i Ui is plotted against z. The results in Fig. 3 show that z–nc plots for the three data sets closely ﬁt the power relation nc = c · z p

(7)

with an r2 value of 99.6%. Here, c is a constant for any data set denoting the number of cuts needed to consistently classify the entire U (i.e. when z = 1), and the parameter p has a value very near 0.5 for the Glass and Breast data sets. The relation of nc with z thus corresponds to the well known expression for the standard deviation of √ a sample, n ∝ n, for fairly large sample sizes n 1. In case of Iris, the value of p is slightly higher (about 0.6). The deviation is caused by the two points in Fig. 3(a) corresponding to the smallest sample sizes (z = 0.1 and 0.2) that clearly fall out of the parabolic ﬁt. The misﬁt of these two points is explicably because the number of cuts has fallen to an extremely low value (less than 3), in which condition relation (7) does not hold. We thus ﬁt an alternative form of the power relation nc = b + c · z p

(8)

leaving out the two smallest sample sizes (z = 0.1 and 0.2). The closeness of ﬁt (r2 ) then increases to more than 99.9%. The improvement is also graphically evident, where the gray dotted curve representing Eq. (8) passes through all the seven points, whereas the (blue) ﬁrm line as per Eq. (7) seems a bit stiff in ﬁtting the points. The value of p also touches the expected value 0.5. The small intercept on the z-axis (about 4.7) may be interpreted as the size of a sample where just one (round off zero to the next highest integer) cut will be sufﬁcient to discern all objects in the sample. To sum up, ﬁrstly an attribute in an information system contributes more cuts in a discreduct if its removal from a reduct greatly reduces the classiﬁcation accuracy (i.e. it has a high signiﬁcance as per Eq. (2)); and secondly, the number of cuts required to consistently classify a random sample of objects varies as the square root of the sample size for fairly sized discreducts (with more than 2 cuts).

ALGORITHM: Dynamic Discreduct (I, ns , nz ) 1 for i = 1 to ns do 2 create Ui , a random sample of nz objects 3 determine Ti , a minimal consistent set of cuts for Ui 4 for k = 1 to na do 5 Tik ← {t ∈ Ti | t is a cut on attribute ak } 6 end for 7 end for 8 D ← ∅, R ← ∅ 9 for k = 1 to na do ns 10

compute mk =

1 ns

|Tik |; round-off mk

i=1

11 if mk > 0 then 12 Fk ← the mk most frequent cuts in Tik , i = 1, 2, . . ., ns 13 D ← D ∪ {ak } × Fk 14 R ← R − ∪{ak } 15 end if 16 end for 17 return D, R

where I = (U, A ∪ {d}) is the information system, ns is the number of samples, nz is the size of each sample,and na = |A| the number of attributes. The set Tik contains the cuts contributed by ak in discretizing the objects in the ith sample Ui . In ns successive samples, the cuts that are chosen with the highest frequency are selected in Fk . The cardinality of Fk is determined by averaging the cardinalities of Tik (over i = 1, 2, . . ., ns ). The principle is of proportional representation: those attributes which can classify more objects get a higher share of cuts.

4.1. Classiﬁcation accuracy of dynamic discreduct In this subsection, we explore to ﬁnd out the optimum size of a sample that could best predict the class of some unknown data in a particular domain. For this the data set is split into 5 equal subsets (S1 , S2 , . . ., S5 ) using random numbers generated from the code. One of them is chosen as the ‘test-set,’ and the remaining are merged into a ‘training-set.’ Dynamic discreduct is found for the training set using the algorithm in Section 4. Rules are derived from the discreduced training set, which are given to predict the class of every object in the test set. If no rule is applicable on an object in the test set, it is predicted to fall in the widest class (i.e. one with the maximum number of objects). The percentage of objects correctly

P. Dey et al. / Applied Soft Computing 11 (2011) 3887–3897

(a) Iris

3893

(b) Glass

(c) Breast

Fig. 4. (a–c) Variation of classiﬁcation accuracy with number of cuts for different data sets.

classiﬁed is the classiﬁcation accuracy for that test set i =

data sets. It is clear that the method of dynamic discretization proposed in the present paper outperforms all the prevailing methods.

no. of objects in Si correctly classiﬁed by the rules from U − Si no. of objects in Si

The classiﬁcation accuracies are averaged over the ﬁve test sets, each time merging the remaining four subsets into the training set (the ﬁvefold cross validation scheme [1]) 1 i 5 5

=

(9)

i=1

The classiﬁcation accuracy () for the three data sets is plotted against the number of cuts (nc ) for different values of relative sample size (z) in Fig. 4. The results suggest that 20–30% of the data should be the optimum sample size. The optimum number of cuts can then be found from Eq. (7) according as (optm)

nc

=c·

5.1.1. Run time vs. sample size Experiments were conducted with different sample sizes (z) varying from 10% to 80% of the training data. The number of samples (ns ) was set so as to ensure that each data in the training set is represented thrice on an average in the samples, i.e. z · ns 3. The time taken for discreduction (i.e. selection of cuts) is presented in Table 4. The results suggest that for the larger data sets (excepting Iris) taking samples substantially reduce the computation time. The run time (t) was also plotted against the relative sample size (z) in Fig. 5, and a power ﬁt was tried on the data points for each of the three data sets. Now, the time required for selecting a cut varies as kz ·nz , where the number of objects in a sample nz ∝ z, and

z (optm)

which is near about 1/2 the number of cuts required to consistently classify all the objects in U. But in practice, c is quite difﬁcult to determine; rather the number of cuts needed to consistently classify a sample of size nz is much more readily available. So, it is best to set the value of z to 0.25, and the average cardinality of Ti , doubled, would serve as a good estimate for the optimal number of cuts needed to predict the whole data. If it falls below 3, 1 should be added to it, and it’s safe to round the value off to the next (higher) integer. The number of samples ns can be determined from the thumb rule that each object in the data should be represented 3 or 4 times. Thus, for nz = 0.25, ns would be 12–15.

Table 3 Classiﬁcation accuracies achieved by different methods. Data

S-ID3

C4.5

MDLP

1R

RS-D

LEM2

DD

Iris Glass Breast

96.67 62.79 –

94.67 65.89 –

93.93 63.14 93.63

95.9 56.4 –

95.33 66.41 –

95.3 – –

96.06 66.68 95.34

Average DD avg.

79.73 81.37

80.28 81.37

83.57 86.03

76.15 81.37

80.87 81.37

95.3 96.06

5. Results 5.1. Classiﬁcation of UCI data sets Using the algorithm described in the previous section, the best values of classiﬁcation accuracy achieved by dynamic discreduction (DD) for the three data sets has been compared with other methods in Table 3. The results denoted as S-ID 3 and C 4.5 were taken from [20]. The MDLP algorithm was originally proposed by Fayyad and Irani [22], but the original paper bears no numerical results; so other sources [1,23] were resorted to. The results of 1 R were taken from the original paper by Holte [24]. The results denoted as RS-D are obtained using classical Rough Set discretization methods [1,20], while LEM 2 is a powerful Rough Set algorithm [25]. The results of each reference in Table 3 have been summarized by an average of the existing results, and comparison has been made with the average of present (DD avg.) results for the corresponding

Fig. 5. Run time vs. sample size of different data sets.

– –

3894

P. Dey et al. / Applied Soft Computing 11 (2011) 3887–3897

Table 4 Run-time for full training set and different sample sizes (z) and no. of samples (ns 3/z). (z, ns )

(0.1, 30)

(0.2, 15)

(0.25, 12)

(0.3, 10)

(0.4, 8)

(0.5, 6)

(0.6, 5)

(0.8, 4)

(1, 1)

Iris Glass Breast

0.422 0.888 0.697

0.242 1.428 0.888

0.216 1.835 1.060

0.253 2.244 1.281

0.188 3.335 1.966

0.227 4.303 2.372

0.255 5.597 3.013

0.354 8.616 4.713

0.134 3.603 1.950

the number of distinct values in a sample varies as kz ∝ the time taken to discretize a sample should vary as t ∝ z 1.5

√

z. Thus (10)

The exponents of z in Fig. 5 strongly suggest the expected value 1.5, with r2 greater than 99%. The ﬁgure also suggests that a sample size less than 40% would be computationally economic, taking the suggested number of samples. 5.2. Mining a real life data set In this subsection the advantages of using the dynamic discreduct algorithm over the conventional process of ﬁnding the reduct and discretizing the attributes in the reduct is examined. Both the methods are applied to a data set relating the composition and process parameters to the mechanical properties of TRIP steel, and the obtained rules are compared as to how the rules reveal the underlying TRIP phenomena. Transformation induced plasticity (TRIP) steels, ﬁrst reported by Zackay et al. [26], exhibit a superior combination of strength and ductility. This characteristic has made TRIP aided steel a potential material for various automobile parts requiring high strength with adequate formability. As the phenomenon of TRIP is quite complicated, there is still sufﬁcient scope for exploration. Several attempts have been made to model the TRIP phenomenon from its physical understanding [27,28] but no suitable model exists till date to predict the mechanical properties of TRIP steel directly from the composition and processing parameters. This is mainly due to the lack of precise knowledge about the complex and non-linear role of the independent variables on the properties of the steel. Efforts have been made to develop datadriven models using tools like artiﬁcial neural network and genetic algorithm to predict TRIP steel properties [29–31]. But these models have the inherent complexity and opacity of a black box. In a recent work [32] the properties of TRIP steel have been investigated from the RS approach to deduce rules from the data using the minimal reduct discretization method described in Section 5.2.1. In Section 5.2.2 the dynamic discreduction algorithm has been used to derive another set of rules from the same TRIP steel data. The two methods may be briefed below 1. Determine a minimal reduct, then discretize every attribute therein, and derive a set of rules for the system. 2. Find a dynamic discreduct, and arrive at a set of rules. And ﬁnally, the two sets of rules thus obtained are compared vis-à-vis the existing knowledge on TRIP steel. The TRIP steel data set contains 90 objects, most of which was collected from published literature of several workers [30,33–37]. The multiple sources are also an inherent source of noise in the data, that comes through different dimensions of experimental errors. This renders the application of dynamic discreduction algorithm more relevant here. The range of values of the conditional (i.e. compositional and processing) attributes, as well as the decision attribute (UTS) are listed in Table 5. To start with, UTS is discretized into three equalfrequency classes, roughly allotting the same number of objects to each class. Steels below 730 MPa falls in the class of Low-strength

Table 5 Numerical range of the attributes in the TRIP steel data. Attributes (units)

Symbol

A. Conditional (i) Composition 1. Carbon (wt%) 2. Manganese (wt%) 3. Silicon (wt%) (ii) Processing 4. Cold deformation (%) 5. Intercritical annealing Temp. (◦ C) 6. Intercritical annealing time (s) 7. Bainitic transformation temp. (◦ C) 8. Bainitic transformation time (s) B. Decision Ultimate tensile strength (MPa)

Min.

C Mn Si

Max.

0.12 1.00 0.48

0.29 2.39 2.00

d Ta ta Tb tb

56.25 750 51 350 30

77.14 860 1200 500 1200

UTS

580.72

887.44

steels, those from 730 to 770 MPa in the Medium-strength class, and steels with UTS above 770 MPa falls in the High-strength class (see the last three columns of Table 6). This discretization of UTS has been used in both Sections 5.2.1 and 5.2.2. 5.2.1. Minimal reduct discretization The results in this subsection was reported in a recent work by the present authors [32]. We re-present them here to draw a comparison with the results of dynamic discreduction presented in the next subsection The ﬁrst task is to ﬁnd a minimal reduct for the TRIP steel data set. Since the number of attributes are quite few, an extensive search was undertaken to see whether a subset of attributes can classify all the objects of the data consistently, starting from the 8 one-attribute subsets, next searching the 28 two-attribute subsets, and so on. At cardinality 4, just one reduct is found; this is of course the only minimal reduct. The 4 attributes in the reduct are then discretized with the MD-heuristic algorithm, yielding intervals to which certain names (or labels) are given as in Table 6. Rules are derived from the discretized minimal set of attributes. The number of rules with one to four terms in the antecedent came to near about 200. From these, only a handful few are selected that actually represent the general patterns in the data. This was done on the basis of two qualifying metrics (Eqs. (5) and (6)). The threshold values of accuracy and coverage for selecting the rules were set to 80% and 15%, respectively. This was done on an ad-hoc basis, so as to limit the set of rules to a handful few. The ﬁnal set of rules have been presented in Rule Set 1. The pair of values in square brackets indicate these two values respectively for each rule. Rule Set 1 Rules obtained by minimal reduct discretization 1. 2. 3. 4. 5. 6. 7. 8.

if Si = MH if Si = QH if Si = QH if Si = QH if Si = QH if Si = VL if Si = L if Si = ML

∧Ta = L

∧Ta = H ∧Ta = L

∧Tb = ML ∧Tb = MH ∧tb = L ∧tb = ML ∧tb = MH ∧tb = MH ∧tb = MH ∧Tb = ML

then then then then then then then then

UTS = L UTS = M UTS = M UTS = M UTS = M UTS = H UTS = H UTS = H

[100, 21.4] [90, 23.1] [86, 15.4] [100, 17.9] [100, 17.9] [83, 21.7] [100, 21.7] [100, 21.7]

The absence of intercritical annealing time (ta ) and cold deformation (d) in the minimal reduct seems to be reasonable, as these variables are known to have insigniﬁcant contribution in the ﬁnal microstructure and property. On the other hand, it may be noted

P. Dey et al. / Applied Soft Computing 11 (2011) 3887–3897

3895

Table 6 Intervals on discretizing the four attributes in the minimal reduct of the TRIP steel data. Ta (◦ C)

Si (wt%)

Tb (◦ C)

Min

Max

Label

Min

Max

Label

Min

0.48 0.73 0.985 1.09 1.19 1.22 1.40 1.46

0.73 0.985 1.09 1.19 1.22 1.40 1.46 2.00

VL QL L ML MH H QH VH

750 795 810

795 810 860

L M H

350 375 415 440 457

tb (s) Max 375 415 440 457 500

UTS (MPa)

Label

Min

Max

Label

Min

Max

Label

L ML M MH H

30 45 150 230 280 450 950

45 150 230 280 450 950 1200

VL L ML M MH H VH

580 730 770

730 770 890

L M H

VL: very low; QL: quite low; L: low; ML: moderately low; M: medium. VH: very high; QH: quite high; H: high; MH: moderately high.

that the only compositional parameter Si (and the processing parameter bainitic transformation time, tb ) is present in (almost) every rule. Interestingly, both Si and tb has also more cuts than the other attributes, which in some way verify the ﬁrst claim made at the beginning of Section 3. The rules clearly show that lesser amount of Si, and moderately high bainitic transformation time (tb ) is favoured for higher strength of the steel. This can be justiﬁed from the fact that TRIP steel with lower amount of Si may contain carbides in the microstructure leading to high strength, whereas a bit higher transformation time favours a good amount of bainite resulting in an increase in the strength level. But the absence of C and Mn in the reduct cannot be justiﬁed, as it is known that these two elements play the most important role in the stability of retained austenite, and consequently in the occurrence of TRIP. Since the data was compiled from various sources (reporting experiments carried out in different situations) there is ample scope for noise in the data. This may have caused the over-ﬁtting, and under-representation of essential attributes in the reduct.

resulting discreduct with 10 cuts (compared to 19 from the consistent discretization in the previous section) spans six attributes (instead of four previously). The position of the cuts and the labels assigned to respective intervals are shown in Table 7. Five rules were obtained from the data set which cleared the 80% accuracy and 30% coverage levels. They are presented in Rule Set 2. Two processing attributes (d and ta ) failed to contribute any cut, and were regarded as redundant – similar to the previous method. On the other hand, introduction of C and Mn is very much signiﬁcant from the metallurgical point of view, since, in the given range of values, they are known to be quite important parameters in deciding the strength of any steel. C and Mn are the most potent austenite stabilizers and also play a signiﬁcant role in the hardenability of the retained austenite. Thus the TRIP phenomenon in steel is chieﬂy controlled by C and Mn. From this point of view inclusion of these two attributes is a commendable achievement of the proposed algorithm. Introduction of these two compositional attributes trigger another interesting series of events. The additional attributes help to reduce the cuts to a meagre 10 (compared to 19 in the previous set of rules). This conﬁnes the composition and processing attributes to two to three discrete classes, which can be described by simple qualiﬁers like ‘Low’, ‘Medium’ or ‘High’, dispensing the use of ﬁner intervals like ‘Quite High’ or ‘Moderately Low.’ This in turn keeps the coverage of the rules to a higher level, indicating that the most general patterns are represented in the rules. Thus the rules appear very comfortable from the aspect of understanding the TRIP phenomena, and subsequent application of the inherited knowledge in further development of TRIP steel. This is evidently a marked improvement achieved by the dynamic discreduction, as compared to the previous method.

Rule Set 2 Rules obtained by dynamic discreduction 1. 2. 3. 4. 5.

If C = L If C = H If If If C = H

∧Mn = L ∧Si = M ∧Ta = L ∧Tb = L ∧Mn = L ∧Si = H ∧Ta = L Mn = H ∧Tb = L Mn = H ∧Ta = H ∧Si = H

then then ∧tb = M then ∧tb = M then ∧tb = M then

UTS = L UTS = M UTS = H UTS = H UTS = H

[86, 46] [92, 33] [100, 35] [89, 39] [87, 30]

5.2.2. Dynamic discreduction Thirty samples were taken from the data set, each containing 30 objects. For each sample a consistent set of cuts was determined using the MD-heuristic algorithm, with all the 8 conditional attributes being allowed to contribute cuts. The 10 most frequently occurring cuts were chosen using a proportional representation of attributes as described in the algorithm in subsection 4. The

5.2.3. Cuts in the dynamic discreducts Finally we present two characteristics of the cuts in the 30 samples from which the dynamic discreduct was constructed.

Table 7 Intervals in the dynamic discreduct of the TRIP steel data. C (wt%)

Mn (wt%)

Min

Max

Label

Min

0.12 0.145

0.145 0.29

L H

1.0 1.48 2.15

Ta (◦ C)

Si (wt%) Max 1.48 2.15 2.39

Label

Min

Max

Label

L M H

0.48 0.73 1.4

0.73 1.4 2.0

L M H

Tb (◦ C)

tb (s)

Min

Max

Label

Min

Max

Label

Min

Max

750 810

810 860

L H

350 415 457

415 457 500

L M H

30 230 450

230 450 1200

L: low; M: medium; H: high.

Label L M H

3896

P. Dey et al. / Applied Soft Computing 11 (2011) 3887–3897

Fig. 6. (a and b) Sample cuts in the TRIP steel data.

The number of cuts that were required to consistently classify a sample was plotted in a bar-chart in Fig. 6(a), while the shares of each of the eight attributes in the total number of cuts in the samples were plotted in another bar-chart shown in Fig. 6(b). Fig. 6(a) is interesting from the point that it almost represents a normal distribution, except a sharp rise for the value 11. The possible explanation for this is that few objects contained special information that was not included in other objects. This could represent results in a region that was not covered by other experiments; otherwise it would denote noise in the data, i.e. experimental errors. However, the average value comes to 8.4, which rounds off to 9. Keeping a safe limit, we take 10 cuts as the cardinality of the dynamic discreduct. The share of cuts in the sample discreducts shown in Fig. 6(b) clearly demarcate two attributes (d and ta ) as redundant, getting less than 5% of the cuts in the samples. Two other attributes, C and Ta , get around 10% of the sample cuts, while four others (viz. Mn, Si, Tb , and tb ) receive 15–20% cuts each. In constructing the dynamic discreduct, C and Ta are thus given one cut each, while each of Mn, Si, Tb , and tb are allotted two cuts (see Table 7).

6. Conclusion In the present paper the relation and interdependence between two vital tasks carried out through Rough Set Theory, namely reduction and discretization of attributes, have been investigated. Self-similarity and complementarity of the two processes have been utilized to devise a method that ﬁnds the optimally discretized set of attributes. The cuts that are most frequently required to classify all objects in a series of samples taken from the data, are collected to form a dynamic discreduct. Those attributes where one or more cut is placed form the reduct. The process of discretization and ﬁnding reducts are thus merged into a single seamless process, which has been named dynamic discreduction. The efﬁciency of the algorithm is dependant on two parameters, viz. the number of cuts and the size of samples. To obtain an optimum value of these two parameters, their effect on classiﬁcation accuracy has been studied. The method has been applied on some benchmark data sets, and the results clearly outperform all existing methods. A real life data set on TRIP steel has also been analysed, where the rules derived from dynamic discreduct are found to be simpler, more general, and appropriate from the metallurgical aspect compared to the the rules derived from discretized minimal reducts.

Acknowledgements The present research was conducted as part of a Fast Track Scheme for Young Scientists supported by the Department of Science and Technology, Government of India, vide Grant no. SR/FTP/ETA-02/2007. The ﬁnancial support is being duly acknowledged. References [1] H.S. Nguyen, S.H. Nguyen, Discretization Methods in Data Mining, vol. 1, Springer Physica-Verlag, 1998, pp. 451–482. [2] H. Liu, F. Hussain, C.L. Tan, M. Dash, Discretization: an enabling technique, Data Mining and Knowledge Discovery 6 (2002) 393–423. [3] R. Jin, Y. Breitbart, C. Muoh, Data discretization uniﬁcation, in: Seventh IEEE International Conference on Data Mining, 2007, pp. 183–192, doi:10.1109/ICDM.2007.35. [4] Y. Yang, Discretization for Naive-Bayes Learning, PhD thesis, School of Computer Science and Software Engineering, Monash University, 2003. [5] P. Blajdo, Z.S. Hippe, T. Mroczek, J.W. Grzymala-Busse, M. Knap, L. Piatek, An extended comparison of six approaches to discretization – a Rough Set approach, Fundamenta Informaticae 94 (2009) 121–131. [6] J. Zhao, Y. Zhou, New heuristic method for data discretization based on Rough Set Theory, The Journal of China Universities of Posts and Telecommunications 16 (2009) 113–120. [7] Y. He, D. Chen, W. Zhao, Integrated method of compromise-based ant colony algorithm and Rough Set Theory and its application in toxicity mechanism classiﬁcation, Chemometrics and Intelligent Laboratory Systems 92 (2008) 22–32. [8] L. Xu, F. Zhang, X. Jin, Discretization algorithm for continuous attributes based on niche discrete particle swarm optimization, Journal of Data Acquisition and Processing 23 (2008) 584–588 (in Chinese: Shuju Caiji Yu Chuli). [9] M. Boulle, Khiops: a statistical discretization method of continuous attributes, Machine Learning 55 (2004) 53–69. [10] G. Li, H. Sun, H. Li, X. Jiang, Discretization of continuous attributes based on statistical information, Journal of Computational Information Systems 4 (2008) 1069–1076. [11] T. Qureshi, D.A. Zighed, Using resampling techniques for better quality discretization, in: 6th International Conference on Machine Learning and Data Mining in Pattern Recognition, MLDM 2009, Leipzig, 2009, pp. 1515–1520. [12] J. Senthilkumar, D. Manjula, R. Krishnamoorthy, NANO: a new supervised algorithm for feature selection with discretization, in: IEEE International Advance Computing Conference, IACC 2009, 2009, pp. 1515–1520. [13] L. Tinghui, S. Liang, J. Qingshan, W. Beizhan, Reduction and dynamic discretization of multi-attribute based on Rough Set, in: World Congress on Software Engineering, 2009, WCSE 09, Xiamen, 2009. [14] F. Min, Q. Liu, C. Fang, Rough Sets approach to symbolic value partition, International Journal of Approximate Reasoning 49 (2008) 689–700. [15] Y.-Y. Guan, H.-K. Wang, Y. Wang, F. Yang, Attribute reduction and optimal decision rules acquisition for continuous valued information systems, Information Sciences 179 (2009) 2974–2984. [16] J. Mata1, J.-L. Alvarez, J.-C. Riquelme, Discovering numeric association rules via evolutionary algorithm, in: 6th Conference on Knowledge Discovery and Data Mining, 2002, pp. 40–51. [17] Z Pawlak, Rough Sets, International Journal of Computer & Information Sciences 11 (1982) 341–356. [18] G. Cantor, Contributions to the Founding of the Theory of Transﬁnite Numbers, Dover Publications, 1915.

P. Dey et al. / Applied Soft Computing 11 (2011) 3887–3897 [19] A. Frank, A. Asuncion, UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, 2010. URL: http://archive.ics.uci.edu/ml. [20] J. Komorowski, Z. Pawlak, L. Polkowski, A. Skowron, Rough Sets: A Tutorial, 2002. URL: alfa.mimuw.edu.pl/prace/1999/D5/Tutor06 09.ps. [21] I. Düntsch, G. Gediga, Rough Set data analysis: a road to non-invasive knowledge discovery, Metho␦os (2000). [22] U.M. Fayyad, K.B. Irani, On the handling of continuous-valued attributes in decision tree generation, Machine Learning 8 (1992) 87–102. [23] A. An, N. Cercone, Discretization of Continuous Attributes for Learning Classiﬁcation Rules LNAI 1574, Springer-Verlag, Berlin/Heidelberg, 1999, pp. 509–514. [24] R.C. Holte, Very simple classiﬁcation rules perform well on most commonly used datasets, Machine Learning 11 (1993) 63–91. [25] J.W. Grzymala-Busse, LERS – a system for learning from examples based on Rough Sets, in: Intelligent Decision Support – A Handbook of Applications and Advances in the Rough Set Theory, Kluwer Academic Publishers, 1992, pp. 3–18. [26] V.F. Zackay, E.R. Parker, D. Fahr, R. Bush, The enhancement of ductility in high strength steels, Transactions of the American Society of Metals 60 (1967) 252–259. [27] J. Bouquerel, K. Verbeken, B.C. De Cooman, Microstructure-based model for the static mechanical behaviour of multiphase steels, Acta Metallurgica Materiala 54 (2006) 1443–1456. [28] H.N. Han, C.G. Lee, C.-S. Oh, T.-H. Lee, S.-J. Kim, A model for deformation behavior and mechanically induced martensitic transformation of metastable austenitic steel, Acta Metallurgica Materiala 52 (2004) 5203–5214. [29] S.M.K. Hosseini, A. Zarei-Hanzaki, M.J.Y. Panah, S. Yue, ANN model for prediction of the effects of composition and process parameters on tensile strength and

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

3897

percent elongation of Si–Mn TRIP steels, Materials Science and Engineering A 374 (2004) 122–128. M. Mukherjee, S.B. Singh, O.N. Mohanty, Neural network analysis of strain induced transformation behaviour of retained austenite in TRIP-aided steels, Materials Science and Engineering A 434 (2006) 237–245. S. Datta, F. Pettersson, S. Ganguly, H. Saxén, N. Chakraborti, Identiﬁcation of factors governing mechanical properties of TRIP-aided steel using genetic algorithms and neural networks, Materials and Manufacturing Processes 23 (2008) 130–137. S. Dey, P. Dey, S. Datta, J. Sil, Rough Set approach to predict the strength and ductility of TRIP steel, Materials and Manufacturing Processes 24 (2009) 150–154. H.C. Chen, H. Era, M. Shimizu, Effect of phosphorus on the formation of retained austenite and mechanical properties in Si low-carbon steel sheet, Metallurgical Transactions A 20 (1989) 437–445. Y. Sakuma, O. Matsumura, O. Akisue, Inﬂuence of C content and annealing temperature on microstructure and mechanical properties of 400 ◦ C transformed steel containing retained austenite, ISIJ International 31 (1991) 1348–1353. M.D. Meyer, D. Vanderschueren, B.D. Cooman, The inﬂuence of the substitution of Si by Al on the properties of cold rolled C–Mn–Si TRIP steels, ISIJ International 39 (1999) 813–822. S. Papaefthymiou, W. Bleck, S. Kruijver, J. Sietsma, L. Zhao, S. van der Zwaag, Inﬂuence of intercritical deformation on microstructure of TRIP steels containing Al, Materials Science and Technology 20 (2004) 201–206. N.R. Bandyopadhyay, S. Datta, Effect of manganese partitioning on transformation induced plasticity characteristics in microalloyed dual phase steels, ISIJ International 44 (2004) 927–934.