A Novel Attribute Reduction Based on Rough Set ...

0 downloads 0 Views 946KB Size Report
Medicine,SriRamachandra Medical College & Research. Institute,. Sri Ramachandra University, Chennai, India. Abstract— Biological Datamining is the process ...
International Journal of Applied Engineering Research, ISSN 0973-4562 Vol. 10 No.80 (2015) © Research India Publications; http://www.ripublication.com/ijaer.htm

Biological Datamining – A Novel Attribute Reduction Based on Rough Set Theory. Dr. P.Venkatesan Faculty of Research, Department of Community Medicine,SriRamachandra Medical College & Research Institute, Sri Ramachandra University, Chennai, India

K.Anitha Department of Mathematics S.A.Engineering College, Chennai, India. [email protected]

Tab.1

Abstract— Biological Datamining is the process of extracting or mining, analyzing the biological information from the large biological database for discovering new knowledge that can be translated in to clinical applications. Feature Selection is the process of identifying the most relevant feature from a given dataset which is encountered in many fields such as Machine Learning, Image Processing, Signal Processing and Pattern recognition. Feature Selection Process should preserve the exact content after reduction. Rough Set theory plays vital role in Feature Selection. In this paper Rough Set Attribute reduction through Quick Reduct algorithm is performed for public domain data set available in UCI repository.

OBJECTS

Keywords— Rough Sets, Attribute Reduction, Indiscerbibility, Reduct

ATTRIBUTE

ATTRIBUTE

ATTRIBUTE

A1

A2

A3

(Age)

(Head Ache)

(Fever)

X1

Below 10

No

Yes

X2

10-15

Yes

Yes

X3

16-20

Yes

No

X4

21-25

No

Yes

X5

26-30

Yes

No

B. Decision System From the above Information system we can give the classification of patients for finding the patient is ill or healthy. Such Classification is known as Decision System and the corresponding attribute is called Decision Attribute.

I. INTRODUCTION A Rough Set first described by Polish Computer Scientist ZdzislawPawlak is a formal approximation of Crisp Set in terms of a pair of a sets which give the Lower and Upper approximation of the original set. Rough set theory based on the assumption that every object in the universe we associate some information. It is based on similarity relation in which objects are having same information. A Rough set itself is a approximation of vague concepts by a pair of crisp sets called Lower and Upper approximation. The Lower approximation is the set of objects with certainty for belong to the subset of interest where as upper approximation is the set of all objects that are possibly belongs to the subset. .

The decision System for the table (Tab.1) is given by OBJEC

ATTRIB

Tab.2 ATTRIB

ATTRIB

DECISION

TS

UTE A1

UTE A2

UTE A3

ATTRIBUT

(Age)

(Head

(Fever)

E(D)

Ache)

II. TERMINOLOGIES ON ROUGH SETS A. Information System An Information System is a pair 𝐼𝑆 = (𝑈, 𝐴), where 𝑈 is a non-empty finite set of objects called Universe and 𝐴is a nonempty finite set of attributes such that a:𝑈 ⟶ Va for every a 𝜖𝐴 . The set Vais called the value set of ‘a’.

(Disease)

X1

Below 10

No

Yes

Yes

X2

10-15

Yes

Yes

Yes

X3

16-20

Yes

No

No

X4

21-25

No

No

No

X5

26-30

Yes

No

No

C. Indiscernibility INDISCERNIBILITY For any 𝑃 ⊆ 𝐴 there is an equivalence relation 𝐼𝑁𝐷(𝑃) which is defined as 𝐼𝑁𝐷(𝑃) = { (x,y) 𝜖 U2/ ∀ a𝜖𝑃 , a(x) =a(y)}(1) Which represents two objects are equivalent if and only if their attribute values are same in𝑃.

The following table (Tab.1) represents the Information System. There are five objects with three attributes.

180

International Journal of Applied Engineering Research, ISSN 0973-4562 Vol. 10 No.80 (2015) © Research India Publications; http://www.ripublication.com/ijaer.htm

D. Upper and Lower Approximation Let 𝑋 ⊆ 𝑈, 𝑋 can be approximated using only the information contained within 𝑃 by constructing the Lower and Upper approximation of 𝑃. Equivalence classes contained within 𝑋 belong to the Lower Approximation (𝑃𝑋). Equivalence classes within 𝑋and along its border form the Upper Approximation(𝑃𝑋). They are expressed as 𝑃𝑋 = { x/ [x]P ⊆ X} (2) 𝑃𝑋={ x/ [x]P ⋂ X ≠ ∅} (3)

dependent on relative dependency on the set. The efficient reduction of attributes is achieved by comparing equivalence relations calculated by set of attributes.

QUICK REDUCT(ℂ, 𝔻) Input :ℂ , the set of all conditional features; 𝔻, the set of decision features. Output:𝑅, the feature subset (i) 𝑅 ← {} (ii) 𝒘𝒉𝒊𝒍𝒆𝛾𝑅 𝔻 ≠ 𝛾ℂ 𝔻 (iii) 𝑇←𝑅 (iv) foreach𝑥𝜖 (ℂ − 𝑅) (v) if𝛾𝑅 {𝑥} 𝔻 > 𝛾𝑇 𝔻

E. Positive, Negative, Boundary Regions and Feature Dependency Let 𝑃 and 𝑄 be sets of attributes inducing equivalence relations over 𝑈, then the positive, negative and boundary regions are defined a 𝑃𝑂𝑆𝑃(𝑄) = 𝑋𝜖𝑈 /𝑄 𝑃𝑋 (4) 𝑁𝐸𝐺𝑃(𝑄) = 𝑈 − 𝑋𝜖𝑈 /𝑄 𝑃 𝑋(5) 𝐵𝑁𝐷𝑃(𝑄) = 𝑋𝜖𝑈 /𝑄 𝑃 𝑋 − 𝑋𝜖𝑈 /𝑄 𝑃 𝑋(6) The important step in data analysis is finding dependencies between attributes. Rough set dependency is defined by, for 𝑃, 𝑄 ⊂ 𝐴, it is said that 𝑄 depends on 𝑃 in a degree k where 0 ≤ k ≤ 1 denoted by 𝑃 ⇒kQ if k=𝛾P(Q)=

𝑃𝑂𝑆 (𝑄 ) 𝑃 𝑈

(vi) (vii) (viii)

𝑇←𝑅 𝑥 𝑅←𝑇 return𝑅

IV. APPLICATION TO BIOMEDICINE

(7)

Dermatology database is taken from School of Medicine, Turkey. This database contains 36 attributes, 33 of which are linear valued and one of them is nominal.The differential diagnosis of erythemato-squamous diseases is a realproblem in dermatology. They all share the clinical features oferythema and scaling, with very little differences. The diseases in this group are psoriasis, seboreic dermatitis, lichen planus, pityriasisrosea, cronic dermatitis, and pityriasisrubrapilaris. By using Rough Set Attribute reduction – Quick reduct algorithm the following results are found and defined in Tab.3:

where 𝑆 is the cardinality of set S. If k = 1, 𝑄 depends totally on 𝑃, and if 0 < 𝑘 < 1, Q depends partially on 𝑃, and if 𝑘 = 0 then 𝑄 does not depend on 𝑃. For calculating the change of dependency it is necessary to estimate the significance value of particular feature that can be removed. The higher the change in dependency the more significant is the feature. If the significance value is zero then the feature is dispensable. F. Reduct Reduct is a minimal subset R of the initial attribute set 𝐶 such that for a given set of attributes D, 𝛾R(𝐷)=𝛾c(𝐷)(8) That is R is a minimal subset if 𝛾R-{a}(𝐷)≠ 𝛾c(𝐷)forall𝑎𝜖𝑅(9) Which means that no attributes can be removed from the subset without affecting dependency degree.

Attributes : erythema , scaling , definite_borders, itching,koebner_phenomenon,polygonal_papules, follicular_papules,oral_mucosal_involvement,knee_and_elbo w_involvement,scalp_involvement,family_history,melanin_in continence,eosinophils_in_the_infiltrate,PNL_infiltrate,fibrosi s_of_the_papillary_dermis, exocytosis,acanthosis…… Tab.3

RESULTS OBTAINED

III. QUICK REDUCT ALGORITHM One of the popular Rough Set based feature selection algorithm is Quick-Reduct Algorithm in which dependency or quality of approximation of a single attribute is first calculated with respect to the class labels or decision attribute. After selecting the best attribute other attributes are added to it to produce better quality. Additions of attributes are stopped when the final subset of attributes has the same quality as that of maximum possible quality of the data set or the quality of the selected attributes remains same. This Algorithm basically

181

Total Number of Attributes

36

Total number of subsets evaluated

608

Total Number of Attributes Selected by Rough Set attribute Selection

24

Merit for the selection of Attributes

98.6

International Journal of Applied Engineering Research, ISSN 0973-4562 Vol. 10 No.80 (2015) © Research India Publications; http://www.ripublication.com/ijaer.htm

V. CONCLUSION [1]

The experimental result presented here shows that Rough set attribute reduction technique gives the optimal attribute selection with 98.6% accuracy. The subsets obtained by supervised approach and classification of the reduced data shows that the method selects the feature with maximum quality.

[2]

[3]

[4] [5]

[6]

[7]

[8]

REFERENCES

182

A.Arauzo-Azofra et al., (2011), Empirical Study of Feature Selection Methods based on Individual Feature Evaluation for classification problems, Exper Systems with Applications, Vol.38, pp.8170-8177. A. A. Bakar, M. N. Sulaiman, M. Othman, M. H. Selamat, (2002), Propositional Satisfiability Algorithm to find Minimal Reducts for Data Mining. Int. J. Comput. Math. 79(4): 379–389.) A. Chouchoulas, J.Halliwell, Q.Shen, (2002), On the Implementation of Rough Set Attribute Reduction, Proceedings of 2002 UK Workshop on Computational Intelligence, 18-23. J.Komorowski, Z.Pawlak, L.Polkowski, A.Skowron,(1999), Rough Sets : A Tutorial. K.Thangavel,A.Pethalakshmi and P.Jaganathan(2006) A Novel Reduct Algorithm for Dimentionality Reduction With Missing Values Based on Rough Set Theroy- -International Journal Of Soft Computing:111-117. R.Jensen, Q.Shen., (2004), Semantics-preserving Dimensionality Reduction: Rough and Fuzzy-Rough Based Approaches, IEEE Transactions on Knowledge and Data Mining,Vol.16, No.12, pp.14571471 Thangavel .K., Qiang Shen ,A. Pethalakshmi, (2006) Application of Clustering for Feature Selection Based on Rough Set Theory Approach K, AIML Journal, Volume (6), Issue (1). U.Fayyad, G.Piatetsky-Shapiro, P.Smyth , (1996), From data Mining to Knowledge discovery in Databases. Artificial Intelligence, 17(3):37-54.