Dynamic Feature Selection with Fuzzy-Rough Sets - IEEE Xplore

12 downloads 142 Views 436KB Size Report
Aberystwyth University. Aberystwyth, SY23 3DB, UK. Email: {rrd09, ncm, qqs}@aber.ac.uk. Abstract—Various strategies have been exploited for the task.
Dynamic Feature Selection with Fuzzy-Rough Sets Ren Diao, Neil Mac Parthal´ain, and Qiang Shen Department of Computer Science Aberystwyth University Aberystwyth, SY23 3DB, UK Email: {rrd09, ncm, qqs}@aber.ac.uk

Abstract—Various strategies have been exploited for the task of feature selection, in an effort to identify more compact and better quality feature subsets. Most existing approaches focus on selecting from a static pool of training instances with a fixed number of original features. However, in practice, data may be gradually refined, and information regarding the problem domain may be actively added or removed. In this paper, a technique based on fuzzy-rough sets is extended to support dynamic feature selection. The proposed method is capable of carrying out on-line selection with incrementally added features or instances. Also, the cases of feature or instance removal are investigated. This brings a novel and beneficial addition to the current research in feature selection. Four possible dynamic selection scenarios are considered, with algorithms proposed in order to handle such individual situations. Simulated experimentation is carried out using real world benchmark data sets, in order to demonstrate the efficacy of the proposed work. Index Terms—Feature selection, dynamic selection, fuzzyrough sets.

I. I NTRODUCTION The main aim of feature selection (FS) is to discover a minimal feature subset for a given problem domain while retaining a suitably high accuracy in representing the original data [1]. When analysing data that has a very large number of features [2], it is uneasy to identify and extract patterns or rules due to the high inter-dependency amongst individual features, or the behaviour of combined subsets of features. This is the so-called “curse-of-dimensionality” [3] problem. Techniques that perform tasks such as text processing, data classification and systems control [4], [5], [6] can benefit greatly from FS, once the noisy, irrelevant, redundant or misleading features have been removed [7]. Various techniques have been developed in the literature for assessing the quality of the feature subsets, several of which rank the features based on a certain measure of importance, e.g., information gain or symmetrical uncertainty measure [8]. Recent trends focus on evaluating a given subset as a whole, forming an alternative type of approach to the aforementioned. Popular methods include the probabilistic consistencybased FS [9], correlation-based FS [10], and those based on fuzzy-rough set theory [11], [12], [13]. These techniques (together with the individual feature-based methods) are often collectively referred to as the filter-based approaches, and are independent of any learning algorithm that subsequently makes use of the selected feature subsets. In contrast, wrapper-based [14], [15] and hybrid algorithms [1] employ a learning or data

mining algorithm in place of an evaluation metric as used in a filter-based approach. Dynamic FS [16], [17], also referred to as on-line FS [18] has attracted significant attention recently. Unlike the conventional, off-line FS that is performed where all of the features and instances are present a priori, dynamic FS considers situations where the information regarding a certain problem domain is not fully available at the beginning. The extraction of features from the data samples, or the procedure of collecting new instances may be difficult and time consuming. Therefore, new sets of features or instances may only be presented in an incremental fashion, and the FS technique needs to be designed to adapt to the new information quickly and accurately. Existing studies in the literature typically work with a classifier learner (wrapper-based) [16], [19], [17], but also involve alternative applications such as prediction [20]. However, little work has been carried out for investigating events of feature or instance removal. Yet such scenarios may be common for applications where data has a limited validity [21], [22], outdated or incorrect information needs to be deleted to ensure data consistency. This paper investigates the feasibility of a dynamic, filterbased FS technique using fuzzy-rough sets. Theoretical discussions are carried out with respect to four possible dynamic FS scenarios: feature addition, feature removal, instance addition, and instance removal, with corresponding algorithms proposed to efficiently handle each of these cases. The paper also provides an insight of how nature-inspired meta-heuristics may play a role in more complex situations, which involve mixtures of any of these possible scenarios, hinting at a potentially more flexible approach. The remainder of the paper is organised as follows. Section II briefly introduces the theoretical background of the FS technique based on fuzzyrough sets. The individual dynamic FS scenarios, and a blue print for a potential combined approach using nature-inspired algorithms is presented in Section III. Section IV demonstrates the results of several simulated experiments, in order to show the potential efficacy of the proposed methods. Finally, Section V concludes the paper and identifies a number of areas for further investigation and potential extensions. II. F UZZY-ROUGH F EATURE S ELECTION Rough set theory (RST) has been successfully used for the task of FS in order to discover data dependencies and reduce

the number of features contained in a data set [23]. Given a data set with discrete feature values, RST can find a subset (termed a reduct) of the original features that are the most informative; all other features can be removed from the data set with minimal information loss. However, it is usually the case that the values of features are real-valued, and this is where traditional RST encounters a problem. It is not possible in the theory to say whether two different feature values are similar, and to what extent they are the same. For example, two close values may only differ as a result of noise, but in the standard RST-based approach they are considered to be as different as two values of a different order of magnitude. Data set discretisation must therefore take place before reduction methods based on crisp rough sets can be applied. This is often still inadequate, however, as the degrees of membership of values to discretised values are not considered and thus may result in information loss. In order to combat this, extensions of RST based on fuzzy-rough sets [24] have been developed. A fuzzy-rough set is defined by two fuzzy sets, a fuzzy lower and a fuzzy upper approximation, obtained by extending the corresponding crisp RST notions. In the crisp case, elements either belong to the lower approximation with absolute certainty or not at all. In the fuzzy-rough case, elements may have a membership in the range [0,1], allowing greater flexibility in handling uncertainty. Fuzzy-rough FS (FRFS) [12] is concerned with the reduction of information or decision systems through the use of fuzzy-rough sets. Let I = (U, A) be an information system, where U is a non-empty set of finite objects (the universe) and A is a non-empty finite set of attributes such that a : U → Va for every a ∈ A. Va is the set of values that attribute a may take. For decision systems, A = {C ∪ D} where C is the set of input features and D is the set of decision features.

Algorithm 1: FRQuickReduct (C,D) C, the set of all conditional features. D, the set of decision features. R ← ∅, γbest ← 0, γprev ← 0 repeat T ←R γprev ← γbest foreach x ∈ (C − R) do if γR∪{x} (D) > γT (D) then T ← R ∪ {x} γbest ← γT (D) R←T until γbest = γprev return R

in Algorithm 1. It employs a quality measure termed the fuzzy-rough dependency function γP (Q) that measures the dependency between two sets of attributes P and Q, which is defined by: P x∈U µP OSRP (Q) (x) (6) γP (Q) = |U| where the fuzzy positive region, which contains all objects of U that can be classified into classes of U/Q using the information in P , is defined as: µP OSRP (Q) (x) = sup µRP X (x)

(7)

X∈U/Q

|a(x) − a(y)| amax − amin

(4)

(a(x) − a(y))2 ) 2σa2

(5)

In this paper, for simplicity, it may be viewed as a measure of quality for a given feature subset P ∈ C, with respect to the set of decision features D: 0 ≤ γP (D) ≤ 1, where γ∅ (D) = 0. A fuzzy-rough reduct R can then be defined as a subset of features that preserves the dependency degree of the entire data set, i.e., γR (D) = γC (D). The evaluation of γR (D) enables QuickReduct to choose which features to add to the current reduct candidate. Note that the algorithm always selects the feature resulting the highest fuzzy-rough dependency improvement. The algorithm terminates when the addition of any remaining feature does not result in an increase in dependency. As with the original crisp algorithm, for a dimensionality of n, the worst case data set will result in (n2 + n)/2 evaluations of the dependency function. However, as fuzzy-rough set based FS is used for dimensionality reduction prior to any involvement of a given application which will exploit those features belonging to the resultant reduct, this operation has no negative impact upon the run-time efficiency of the application system.

where σa2 is the variance of feature a. The choices of I, T , and the fuzzy similarity relation have a great influence upon the resultant fuzzy partitions, and the subsequently selected feature subsets. The fuzzy-rough lower approximation-based QuickReduct algorithm [12], which extends the crisp version [23], is shown

This section describes four common scenarios that may occur in a given dynamic FS task, where features or instances may be added or removed, respectively. The proposed algorithms that deal with such scenarios assume the previously selected subset of features Rk always satisfies the highest

µRP X (x) = inf I(µRP (x, y), µX (y))

(1)

µRP X (x) = sup T (µRP (x, y), µX (y))

(2)

y∈U

y∈U

Here, I is a fuzzy implicator and T is a t-norm. RP is the fuzzy similarity relation induced by the subset of features P µRP (x, y) = Ta∈P {µRa (x, y)}

(3)

where µRa (x, y) is the degree to which objects x and y are similar for feature a. Many similarity relations can be constructed for this purpose, for example: µRa (x, y) = 1 − µRa (x, y) = exp(−

III. DYNAMIC F EATURE S ELECTION S CENARIOS

possible fuzzy-rough dependency measure for the previous state of the data. Here Rk denotes the reduct candidate that has been selected with respect to state k of the dynamic data set. Of course, if the information embedded in the data set can fully discern all of the existing instances, then Rk should be a reduct: γRk (D) = γC (D) = 1. The aim of these dynamic FS algorithms is to compute a Rk+1 that reflects the changes made to the data set. For the scope of this paper, it is assumed that the possible decision labels are pre-determined and not changed throughout the process: Dk+1 = Dk = D. A. Feature Addition The scenario discussed here considers the case where new features are incrementally added during the FS process, whilst the set of training instances U remains static. The properties of a fuzzy-rough reduct Rk can be exploited here to significantly simplify such a dynamic process. If the existing set of features Ck , γCUk (D) = 1 can already fully discern all of the instances in U with respect to their associated classes, any subsequent feature addition will bring no improvement to the overall discernibility of the current data set. for ∀k 0 > k, γCUk (D) = γCUk0 (D) = 1, if Ck ⊆ Ck0

(8)

Therefore no further modification to a previously discovered Rk is necessary. Note that features x ∈ Rk may be replaced by new, more informative features, meaning that its size |Rk | may still be reduced. However, if the fuzzy-rough dependency γCUkk (Dk ) < 1 for the full data set is not achieved in the previous step, it is then crucial to examine the new features, in order to improve the discernibility of the current subset candidate. Ideally, those previously selected features should also be checked, and removed (or replaced) as the new features may be more informative. This step may be omitted for time critical applications with the corresponding risk of a sub-optimal solution (i.e., a possibly non-minimal reduct). The method that modifies Algorithm 1 and handles feature addition is detailed in Algorithm 2. Algorithm 2: Feature Addition U if γR (D) = 1 then return Rk k ∗ U γ ← γR (D) k U while γ ∗ 6= γC+1 (D) do foreach x ∈ Ck+1 \ Ck do U if γR (D) > γ ∗ then k ∪{x} U Rk+1 ← Rk ∪ {x}, γ ∗ ← γR (D) k ∪{x}

Rk ← Rk+1 return Rk+1

case, the discernibility of a data set may change, due to the possibility of certain informative features being removed: γCUk+1 (D) ≤ γCUk (D), Ck+1 ⊆ Ck More importantly, a given feature x of the selected subset Rk may be deleted: x ∈ Ck \ Ck+1 , substitute features must therefore be found in order to restore the discernibility of the previous reduct. Again, if the application is less timecritical, it may be preferable to periodically conduct a full search, in order to locate better subsets. This is because that a greedy backward search may result in sub-optimal solutions. The procedure for handling the feature removal scenario is given in Algorithm 3. Here the selection process Algorithm 3: Feature Removal Rk+1 ← Rk \ (Ck \ Ck+1 ) if Rk+1 = Rk then return Rk+1 U γ ∗ ← γR (D) k while γ ∗ < γCUk+1 (D) do foreach x ∈ Ck+1 \ Rk do U if γR (D) > γ ∗ then k ∪{x} U Rk+1 ← Rk ∪ {x}, γ ∗ ← γR (D) k ∪{x} Rk ← Rk+1 return Rk+1 is initiated only if features are removed from Rk , since the remaining features are also the previously unselected features: (Ck+1 \ Rk ) ⊂ (Ck \ Rk ) which are either less informative or redundant and thus, can be ignored. C. Instance Addition Instance addition (while the set of original features C remains constant) may be the most commonly encountered event. Medical diagnosis [4], monitoring-based applications [6], or any procedures that involve time-series (streaming) data are typical examples of such a case. When a new batch of instances is added, it is often necessary to re-evaluate the fuzzy-rough dependency of the selected subset using all the available instances. This may be counter-intuitive, since it is commonly expected that only the new instances should be checked. This is because the addition of objects will inevitably change the fuzzy positive regions µPOSR (D) (x), as the universe U may now be different. There are exceptional cases, however, if the data set has accumulated a sufficiently large number of samples, and these instances have an almost full coverage of the underlying concept. Any “new” instances are either the same, or approximately equivalent (judged by the fuzzy similarity functions), as the objects already analysed: U

B. Feature Removal In contrast to the previous scenario, a particular application may be initialised with an abundance of features, which are subsequently removed throughout the FS process. In this

Uk Uk+1 /D ' Uk /D, γRk+1 (D) ' γR (D) k k

Algorithm 4 details the dynamic FRFS process for the case of instance addition. In practical applications where |Uk+1 \ Uk |  |Uk |, as the number of new objects is very small when

compared to the existing pool of instances. The amount of features required to be further selected (or replaced) is also minimal. To further improve the efficiency of the algorithm, the newly added objects may be checked against the current fuzzy-rough lower and upper approximations of the existing classes, in order to determine whether they can be subsumed by the already established partitions. If a given new object (or a group of objects) do not belong to the existing partitions to a satisfactory degree, it is then an indication that modifications of the lower and upper approximations are necessary.

Algorithm 5: Instance Removal Rk+1 ← Rk foreach x ∈ Rk do U U if γRk+1 (D) = γRk+1 (D) then k k+1 \{x} Rk+1 ← Rk+1 \ {x} return Rk+1

Algorithm 4: Instance Addition U

Uk if γRk+1 (D) ≥ γR (D) then return Rk k k Uk ∗ γ ← γRk (Dk ) U while γ ∗ < γC k+1 (D) do foreach x ∈ C \ Rk do U if γRk+1 (D) > γ ∗ then k ∪{x}

Rk+1 ← Rk ∪ {x}, γ ∗ ← γRk+1 (D) k ∪{x} U

Rk ← Rk+1 return Rk+1

Fig. 1.

D. Instance Removal The last scenario considered in this paper is that of dynamic instance removal. A lot of training objects may be available in the beginning, but then have to be removed later either because the information has become outdated, or simply a storage limitation has been reached. This is the most simplistic case for FRFS, since the removal of instances does not necessarily violate the existing fuzzy partitioning of the input space, a previously obtained feature subset will maintain its full discernibility throughout the process: U

Uk γR (D) = γRk+1 (D), Uk+1 ⊂ Uk k k

However, since the fuzzy partitioning of the input space is calculated with respect to all of the objects, the removal of an instance with boundary feature values, i.e., a(x) = amax or a(x) = amin , may affect the result of fuzzy similarity calculation such as Eqn. 5, and further cause the overall fuzzyrough dependency to change. Similar to the above scenarios, it is possible to improve the quality of the reducts, since the removed instances may relax the boundaries of the fuzzy positive region given in Eqn. 7, and less features may be required to maintain full discernibility [25]. Therefore, Algorithm 5 below suggests the procedure to prune redundant features, which is equally applicable to the scenario of feature addition, since Algorithms 2 avoids further evaluation (ignoring potentially more informative features) providing that the current subset Rk is a reduct. E. Combined Approach using Nature-Inspired Algorithms: An Initial Strategy In real applications, it may be impractical to assume that only one of the above scenarios could occur, but rather a com-

Nature-Inspired Dynamic FS Approach

bination of those previously described. A directly combined approach could be derived from Algorithms 2 to 5. However, the subsets are still modified in a greedy manner, which may lead to sub-optimal solutions, particularly in terms of subset size. A nature-inspired meta-heuristic may be employed as an alternative to this approach however. Algorithms such as genetic algorithms [17], particle swarm optimisation [26], or harmony search [27] have already been applied successfully to conventional FS problems, and could also be extended to support dynamic FS. In order to adapt to a continuously changing data set, three essential modifications should be made to these heuristic search algorithms, regardless of their actual implementations. Fig. 1 outlines the basic idea underlying such an approach, where the key difference with respect to a static heuristicbased FS algorithm is the iterative interaction between the dynamic data set and the FS mechanism. 1) The originally fixed pool of available features must remain up-to-date at all times. Unlike conventional optimisation problems where the function to be optimised has several independent variables, and each variable has its own range of values, most nature-inspired FS algorithms employ a single, shared variable domain that is the pool of all available features. This makes it straightforward to propagate the changes to the features to the algorithms’ internal mechanisms. 2) If the search method in question keeps a record of past good solutions for future use, e.g., the harmony memory in harmony search, when features are removed from the data set, these outdated features need to be removed from all of the existing solutions. This also means that

any individual population must no longer work with these removed features. 3) All (both historical and emerging) solutions maintained by the search heuristic must be re-evaluated each and every time a change has occurred to the data set, if the solutions are sorted by their fitness values, a re-ordering may also be required. This make sure that the fitness values and the rankings of the solutions are always upto-date. IV. E XPERIMENTATION As this is a preliminary investigation, this paper simluates the dynamic FS scenarios in order to demonstrate the efficacy of the proposed methods. Four real-valued UCI [28] benchmark data sets are employed, all of which are of high dimensionality and contain a large number of objects, thereby presenting reasonably realistic challenges for the proposed approaches. Table I provides a summary of these data sets.

Data Set Features Instances Classes

TABLE I S UMMARY OF DATA S ETS arrhythmia handwritten multifeat 280 452 16

257 1593 10

650 2000 10

secom 591 1567 2

Fig. 2 shows the results obtained for the simulated feature addition scenario. Reducts of size 14 are discovered for both the multifeat and the secom data set at the beginning, and none of the subsequently added features are more informative, thus the reducts remain constant at all times. This observation confirms the theoretical assumption made in Section III-A. For the arrhythmia and the handwritten data sets, the available features in the beginning do not provide full discernibility. As a result, the subsets are incrementally refined as new features are added, and the process stabilises when reducts (with full discernibility) are found. Fig. 3 demonstrates the performance of the algorithm for feature removal. For all tested data sets, the dependency measure falls every time as certain features are removed. This is as expected, largely due to the prior removal of features which were part of those previously selected subsets. This fact is also revealed by the reductions in terms of subset size. Since the algorithm is greedy-based, the final subsets at termination are not guaranteed to be minimal. Similar to the feature addition case, the quality of the subsets in the instance addition scenario gradually reaches full discernibility, once a sufficient number of instances have been added. Recall the arguments made in Section III-C, newly added objects also modify the existing fuzzy partitioning of the input space (by a small degree), therefore, new features need to be selected to discern between all of the training objects in their respective classes. During simulation, backward elimination is performed, as suggested in Algorithm 5, in an attempt to reduce the subset size while maintaining discernibility. The effect can be observed for the arrhythmia (at 332 instances)

and the secom (at 811 instances) data sets. Finally, the instance removal results of Fig. 5 show that a constant reduct size is maintained while instances are removed, even when the number of objects removed is rather significant. With the use of stochastic search algorithms, better size reduction is to be expected. Note the minor falls in terms of evaluation scores for the arrhythmia and the secom data set, possibly caused by objects at the edge of the feature value range being removed (as discussed in Seciton III-D). These subsets are subsequently improved by selecting alternative features, effectively performing feature replacements. TABLE II C4.5 CLASSIFICATION ACCURACY TRAINED USING DYNAMIC DATA Scenario

arrhythmia

handwritten

multifeat

secom

Feature + Feature Instance + Instance Base

49.67±4.12 44.57±5.35 55.73±4.67 52.46±5.34 65.97

58.92±1.97 61.33±4.14 53.20±7.09 56.88±8.32 75.74

82.00±0.00 84.75±4.58 90.43±1.66 92.50±1.30 94.54

93.63±0.00 94.23±0.15 93.73±1.02 94.27±0.00 89.56

Before carrying out the dynamic FS, 10% of the objects in each of the original data sets are held-out for testing. The test results presented in Table II are obtained via the use of the C4.5 classifier [29], which is also trained dynamically during the simulation, but tested using the 10% held-out samples. The classification accuracy of using the full base data sets is also supplied. Apart from the multifeat data set, performing dynamic FS reduces the prediction accuracy attainable by the use of the base data sets, which is as expected. The original arrhythmia data set has the least number of features and objects, while having the most class labels (16). The loss of information in the dynamic simulations may have caused the reduction in accuracy for this data set in particular. V. C ONCLUSION This paper has presented a collection of FS techniques based on fuzzy-rough set theory, in an attempt to deal with FS scenarios where features and instances may be dynamically added or removed throughout the training process. Simulations have been carried out in order to demonstrate the efficacy of the proposed algorithms, by employing several real world benchmark data sets. This paper has also suggested a generic dynamic FS framework that may bring filter-based FRFS and nature inspired meta-heuristics together. Although promising, much could be done in the area of dynamic filterbased FS. A key concern of many practical scenarios is the responsiveness of the feature selector, where a fuzzyrough evaluator may become less favourable due to its higher computational complexity. To this end, the results gathered from the present experimentation are indeed encouraging, as the reducts found at an early stage are generally resilient against the later changes. However, an in-depth investigation of the underlying theory, especially with regards to the choice of fuzzy implicator, t-norm, and fuzzy similarity relation, may reveal even more optimised methods. The study of the fuzzyrough set core [11] of a dynamically modified data set, as well

Fig. 2. Feature addition results, showing the size of the selected subset (crosses) and its associated evaluation score (squares), plotted against increasing numbers of features

Fig. 3.

Fig. 4.

Feature removal results, showing subset size (crosses) and score (squares) plotted against decreasing numbers of features

Instance addition results, showing subset size (crosses) and score (squares) plotted against increasing numbers of instances

Fig. 5.

Instance removal results, showing subset size (crosses) and score (squares) plotted against decreasing numbers of instances

as the existing technique to perform instance selection using fuzzy-rough sets [25], are of particular interest to the further development of this work. R EFERENCES [1] H. Liu and H. Motoda, Computational Methods of Feature Selection. Chapman & Hall/CRC, 2008. [2] E. P. Xing, M. I. Jordan, and R. M. Karp, “Feature selection for highdimensional genomic microarray data,” in Proceedings of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann, 2001, pp. 601–608. [3] R. Bellman, Dynamic Programming, 1st ed. Princeton, NJ, USA: Princeton University Press, 1957. [4] N. Mac Parthal´ain, R. Jensen, Q. Shen, and R. Zwiggelaar, “Fuzzyrough approaches for mammographic risk analysis,” Intell. Data Anal., vol. 14, no. 2, pp. 225–244, Apr. 2010. [5] C. Shang and D. Barnes, “Support vector machine-based classification of rock texture images aided by efficient feature selection,” in The International Joint Conference onNeural Networks, june 2012, pp. 1–8. [6] Q. Shen and R. Jensen, “Selecting informative features with fuzzyrough sets and its application for complex systems monitoring,” Pattern Recognition, vol. 37, no. 7, pp. 1351–1363, 2004. [7] R. Jensen and Q. Shen, “Are more features better? a response to attributes reduction using fuzzy rough sets,” IEEE Trans. Fuzzy Syst., vol. 17, no. 6, pp. 1456–1458, 2009. [8] S. Senthamarai Kannan and N. Ramaraj, “A novel hybrid feature selection via symmetrical uncertainty ranking based local memetic search algorithm,” Know.-Based Syst., vol. 23, no. 6, pp. 580–585, Aug. 2010. [9] M. Dash and H. Liu, “Consistency-based search in feature selection,” Artif. Intell., vol. 151, no. 1-2, pp. 155–176, Dec. 2003. [10] M. A. Hall, “Correlation-based feature subset selection for machine learning,” Ph.D. dissertation, University of Waikato, Hamilton, New Zealand, 1998. [11] R. Jensen and Q. Shen, Computational Intelligence and Feature Selection: Rough and Fuzzy Approaches. Wiley-IEEE Press, 2008. [12] ——, “New approaches to fuzzy-rough feature selection,” IEEE Trans. Fuzzy Syst., vol. 17, no. 4, pp. 824–838, Aug. 2009. [13] N. Mac Parthal´ain, Q. Shen, and R. Jensen, “A distance measure approach to exploring the rough set boundary region for attribute reduction,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 3, pp. 305–317, Mar. 2010. [14] C.-N. Hsu, H.-J. Huang, and D. Schuschel, “The ANNIGMA-wrapper approach to fast feature selection for neural nets,” IEEE Trans. Syst., Man, Cybern. B, vol. 32, no. 2, pp. 207–212, 2002. [15] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artificial Intelligence, vol. 97, no. 1, pp. 273–324, 1997. [16] R. Santos-Rodriguez and D. Garcia-Garcia, “Cost-sensitive feature selection based on the set covering machine,” in 2010 IEEE International Conference on Data Mining Workshops, 2010, pp. 740–746.

[17] W. Zhao, Y. Wang, and D. Li, “A dynamic feature selection method based on combination of ga with k-means,” in 2010 2nd International Conference on Industrial Mechatronics and Automation, vol. 2, may 2010, pp. 271 –274. [18] X. Wu, K. Yu, W. Ding, H. Wang, and X. Zhu, “Online feature selection with streaming features,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 5, pp. 1178–1192, 2013. [19] S. C. H. Hoi, J. Wang, P. Zhao, and R. Jin, “Online feature selection for mining big data,” in Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications. New York, NY, USA: ACM, 2012, pp. 93–100. [20] A. Fern, R. Givan, B. Falsafi, and T. N. Vijaykumar, “Dynamic feature selection for hardware prediction,” Purdue University, Tech. Rep., 2000. [21] V. Braverman, R. Ostrovsky, and C. Zaniolo, “Optimal sampling from sliding windows,” J. Comput. Syst. Sci., vol. 78, no. 1, pp. 260–272, Jan. 2012. [22] J. H. Chang and W. S. Lee, “Finding recently frequent itemsets adaptively over online transactional data streams,” Inf. Syst., vol. 31, no. 8, pp. 849–869, Dec. 2006. [23] Q. Shen and A. Chouchoulas, “A rough-fuzzy approach for generating classification rules,” Pattern Recognition, vol. 35, no. 11, pp. 2425 – 2438, 2002. [24] D. Dubois and H. Prade, Putting rough sets and fuzzy sets together. Intelligent Decision Support, Kluwer Academic Publishers, Dordrecht,, 1992. [25] R. Jensen and C. Cornelis, “Fuzzy-rough instance selection,” in IEEE International Conference on Fuzzy Systems, 2010, pp. 1–7. [26] X. Wang, J. Yang, X. Teng, W. Xia, and R. Jensen, “Feature selection based on rough sets and particle swarm optimization,” Pattern Recognition Letters, vol. 28, no. 4, pp. 459 – 471, 2007. [27] R. Diao and Q. Shen, “Feature selection with harmony search,” IEEE Trans. Syst., Man, Cybern. B, vol. 42, no. 6, pp. 1509–1523, 2012. [28] A. Frank and A. Asuncion, “UCI machine learning repository,” 2010. [29] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed., ser. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, Jun. 2005.