Discrete feature weighting & selection algorithm - Semantic Scholar

Discrete feature weighting & selection algorithm Norbert Jankowski Department of Informatics Nicolaus Copernicus University ul. Grudziadzka ˛ 5, 87-100 Toruń, Poland [email protected]

Abstract— A new method of feature weighting, useful also for feature selection has been described. It is quite efficient and gives quite accurate results. In general weighting algorithm may be used with any kind of learning algorithm. The weighting algorithm with k-nearest neighbors model was used to estimate the optimal feature base for a given distance measure. Results obtained with this algorithm clearly show its superior performance in several benchmark tests.

I. I NTRODUCTION It is well known [1], [2] that initial feature set of data sets/databases used for classification or approximation is not the optimal information source and the analysis of input features may be very helpful in further processing of data. Several methods which use different strategy of input feature extraction and weighting were already presented in books and articles [3], [4], [1], [5]. Some of them measure amount of information belonging to a given attribute (or a subset1 ) using information theory or statistics. Other methods use learning models to observe accuracy changes what help to estimate weights in the next phase of feature weighting (or extraction) process [6], [7], [4]. Such type of feature weighting may be used with several learning models types. The algorithm presented in this article belongs to the second type of feature weighting/extracting algorithms, and will be presented in conjunction with k-nearest neighbors model [8], although the algorithm itself may be used with any learning model. II. F EATURE WEIGHTING ALGORITHM The idea of a Discrete quasi-gradient algorithm is based on looking for an optimal vector of weight changes in given state of weighting procedure. The process is repeated as long as any improvement of accuracy may be done. In parallel some control parameters change to stimulate quasi-gradient directions of weights changes. Another goal of this algorithm is to create stable feature weighting algorithm. It means that new (weighted) set of input features should never significantly decrease accuracy of the final model. Of course it is impossible to expect any progres if original input feature set is optimal. Next, repeating the whole procedure of weighting a few distinguishable solutions may be found. This is not an error, it is nature of some databases. Such information may be important in propagation of this information (separate weighted sets of 1 But

if subset become bigger the complexity grow exponentially.

features) to other models and it may help to obtain different final models. The general scheme of this algorithm is as follows: Crossvalidation for learning calls the F indW eights procedure to estimate sets of the weights vectors (one CV part generates one vector of weights). Final weights vector is estimated as an average over all weights vectors from CV learning. The detailed scheme can be found on page 2 and the description below. a) Cross-validation for learning: The main loop of the algorithm consists of cross-validation (CV) as a learning technic. This means that a whole data set is divided in p equal (if possible) parts (folds) Si , i = 1, . . . , p. In each i-th CV– iteration procedure F S indW eights is called. F indW eights use two sets Sî (Sî = k=1,...,p, k6=i Sk ) as a learning set, and Si as a validation set. As a result procedure F indW eights returns vector of feature weights wi . Lets define: wi = F indW eights(Sî , Si )

(1)

Each weight wji of wi = [w1i , . . . , wni ] is bounded by interval [0, 1]. Details of procedure F indW eights will be presented later. b) Final weights estimation: Vectors wi consist of weights for each feature i = 1, . . . , n, where n is number of input features of data set S. Let w be the sum of weights vectors wi obtained in cross-validation phases: w=

n X

wi

(2)

i=1

and now the final weights vector w∗ is obtained through normalization: w∗ = w/ max wk (3) k=1,...,n

where w = [w1 , . . . , wn ], w∗ = [w1∗ , . . . , wn∗ ]. The above part of CV learning is really simple: if all CV iterations are completed, the final model is estimated as a consequence of intermediate models. Now the F indW eights as the heart of whole algorithm will be described. c) Procedure F indW eights: This is the core procedure of each cross-validation phase. First, the initial weights values are assigned: wk = 1, k = 1, . . . , n. In such case the algorithm starts from full input feature set. It may be useful sometimes to start with weights initialized with zeros wk = 0, what means that algorithm starts from an empty feature set and

Algorithm 1 Scheme of feature weighting algorithm Function F eatureW eighting(S) S1 , . . .S , S1 — n equal parts of S : Sî := k=1,...,n, k6=i Sk for i = 1 to n do wi := F indW eights(Sî , Si ) end for w := w1 + . . . + wn wmax := maxdk=1 wk w := w/wmax return w ˆ S) Function F indW eights(S, lastAcc := −1; back := f alse repeat ∆ := 1 lastP haseAcc := lastAcc lastAcc := newAcc := validate(w) while ∆ >= ∆min do repeat if back then lastAcc := newAcc else back := f alse end if for k = 1 to d do vi+ := validate(w+ , i) vi− := validate(w− , i) if vi± > lastAcc then wi′ := wi ± θ∆ if wi′ < 0 then wi′ := 0 end if end for newAcc := validate(w′ ) if newAcc > lastAcc then w := w′ else back := true end if until newAcc = lastAcc return w

each feature: vi+ vi−

= =

validate(w+ (i)) validate(w− (i))

(4)

where w± (i) = [w1 , . . . , wi−1 , wi ± ∆, wi+1 , . . . , wn ]. validate(w, i) is a function which compute the accuracy on validation set S (see eq. 1) for the given learning model M which is trained on a data set Sˆ scaled by vector w. The k-nearest neighbor [8] model was used as the M to obtain results presented in section III. Vectors v+ and v− contain information about weight changes, plus/minus ∆, that may help to improve validation accuracy. If any scalar vi± indicate that given change may help the new set of weights is defined according to: wi′ wi′

= =

wi − θ∆ wi + θ∆

if if

vi− > v vi+ > v ∧ vi−