Adaptive Nonlinear Auto-Associative Modeling Through ... - CiteSeerX

6 downloads 41157 Views 163KB Size Report
Abstract. We propose adaptive nonlinear auto-associative modeling (ANAM) ... (Y ⊂ Rd) of the training set X (X ⊂ RN ,N ≫ d) in the paper. Then the data set.
Adaptive Nonlinear Auto-Associative Modeling Through Manifold Learning Junping Zhang1,2 and Stan Z. Li3 1

2

Intelligent Information Processing Laboratory, Department of Computer Science and Engineering, Fudan University, Shanghai 200433, China [email protected] The Key Laboratory of Complex Systems and Intelligence Science, Chinese Academy of Sciences 3 National Laboratory of Pattern Recognition, Institute of Automation, CAS, Beijing, 100080, China [email protected]

Abstract. We propose adaptive nonlinear auto-associative modeling (ANAM) based on Locally Linear Embedding algorithm (LLE) for learning intrinsic principal features of each concept separately and recognition thereby. Unlike traditional supervised manifold learning algorithm, the proposed ANAM algorithm has several advantages: 1) it implicitly embodies discriminant information because the suboptimal parameters of ANAM are determined based on error rate of the validation set. 2) it avoids the curse of dimensionality without loss accuracy because recognition is completed in the original space. Experiments on character and digit databases show that the advantages of the proposed ANAM algorithm.

1

Introduction

Much manifold learning literature has been published for discovering intrinsic information embedded in the high-dimensional space[1]. Two major algorithms (LLE and ISOMAP)[2, 3] are devoted to discover some intrinsic regularity underlying in the highdimensional data. Also some algorithms based on manifold learning are proposed for supervised learning. However, most supervised manifold learning algorithms assume that data can be projected into the same subspace and recognition without considering the properties of concepts [4]. The disadvantage of the supervised manifold learning approach is that the separability of data would be impaired because data from different classes would be overlapped in the low-dimensional subspace [5]. Based on our observation, we assume that data are projected different subspaces are more suitable than being projected a common subspace if data contain remarkably distinct concepts, for example, character and digit. Assuming that data manifold of each concept is generated by some intrinsic principal features, we propose adaptive nonlinear auto associative modeling (ANAM) for learning intrinsic features and recognition (Section 2). First, the low-dimensional subspace of each class are attained with LLE algorithm. Second, based on the error rates T.B. Ho, D. Cheung, and H. Liu (Eds.): PAKDD 2005, LNAI 3518, pp. 599–604, 2005. c Springer-Verlag Berlin Heidelberg 2005 

600

J. Zhang and S.Z. Li

of validation set, the parameters of each ANAM are adaptively obtained by computing minimum error rate of the validation set. Consequently, a ANAM-classifier is developed without LLE algorithm. The proposed ANAM will not lead to the overlapped data in the low-dimensional subspace and loss corresponding accuracy. Therefore, it partially overcomes the curse of dimensionality. Experiments (Section 3) on several character and digit databases show the advantages of the proposed ANAM algorithm.

2

Adaptive Nonlinear Auto-Associative Modeling

To establish the mapping and inverse mapping relationship of ANAM between the observed data and the corresponding low-dimensional one, locally linear embedding (LLE) algorithm [2] is first used to form the corresponding low-dimensional one Y (Y ⊂ Rd ) of the training set X (X ⊂ RN , N  d) in the paper. Then the data set (X, Y ) is used for modelling the subsequently ANAM. The main principle of LLE algorithm is to preserve local neighborhood relation of data in both the embedded Euclidean space and the intrinsic one. Each sample in the observation space is a linearly weighted average of samples under neighbor constrain. Thus, we obtain the corresponding low-dimensional one Y of the original data X in the embedding space. And the completed set (X, Y ) is used for the subsequent model of ANAMs. While the mapping idea of unknown sample in the LLE framework can’t obtain the optimal mapping solution based on our experiments, in addition, it is used to avoid to calculate the parameters of inverse mapping matrices and mapping matrices of ANAM simultaneously. To construct ANAMs, the forward mapping and inverse mapping matrices need to be estimated. In the proposed algorithm, we utilize mis-classified rate on the validation set to adjust model parameter to obtain the suboptimal model. First, validation set V  ∈ RN are mapped into the corresponding low-dimensional one Vd ∈ Rd with LLE mapping idea for avoiding the simultaneous computation of the mapping and inverse mapping matrices. After Vd is obtained, the reconstruction procedure is then formulated with inverse mapping matrices of ANAM. On the basis of weierstrass approximation theorem, the inverse mapping formula in the i-th ANAM would be achieved with nonlinear polynomial function as follows: vx (i) =

ni 

βj (i)krec (yj (i), vy (i))

yj (i) ∈ Y (i), vy (i) ∈ Rd , vx (i) ∈ RN

(1)

j=1

where vx (i) is a reconstructed sample through the i-th ANAM, ni is the number of samples used to construct the i-th ANAM, B(i) = {βj (i)} is the N × ni weighted inverse mapping matrix or reconstruction matrix of the i-th ANAM, and vy (i) is the low-dimensional validation sample based on LLE algorithm. Without loss of generality, let the reconstruction kernel function be Gaussian kernel as: 2 krec (yj (i), vy (i)) = exp(−  yj (i) − vy (i)  /2σrec (i))

(2)

Adaptive Nonlinear Auto-Associate Modeling Through Manifold Learning

601

2 For computational simplicity, parameters σrec (i) of all concepts are set to the same value in the proposed ANAM algorithm. Once the validation sample is auto-associated through different ANAMs, the similarity measure can be used for recognition as follows: (3) C(vx ) = arg max(exp(−vx − vx (i))) i = 1, · · · , L i

WhereL denotes the number of concepts. The geometrical explanation on Formula (3) is that sample is re-projected to the original space with the ANAM of same concept is closer to the original sample than these reconstructed samples through the ANAMs of different concepts. Considering the geometrical property of Gaussian kernel, it is not difficult to see 2 (i) can be adaptively obtained through searching that the suboptimal parameter σoptrec some value which is related to the minimum recognition error rate of validation set. After the parameters of inverse mapping matrices are obtained, the mapping function of validation set can be formulated as: vy∗ (i)

=

ni 

αi kmap (xj (i), vx (i)), xi ∈ X; vx (i) ∈ RN ; vy∗ (i) ∈ Rd

(4)

j=1

Where A = αi is a d × ni weighted mapping matrix, kmap (xj (i), vx (i)) denotes the similarity metric of data vx (i) with sample xj (i) as follows: 2 kmap (xj (i), vx (i)) = exp(−  xj (i) − vx (i)  /2σmap (i)).

(5)

And the reconstruction matrix is the same as Eq.(1). The only difference is that vx (i), vy (i) is alternative with vx∗ (i), vy∗ (i) in Eq. (1) and Eq.(3), and the suboptimal parame2 ters σoptmap (i) of mapping matrices is adaptively computed based on the error rate of

2 validation set with fixed reconstruction parameters σoptrec (i). It is worthy noting that given the completed data set (X, Y ), the computation of the weighted mapping matrix A(i) and the weighted inverse mapping matrix B(i) are calculated as follows:

A(i) = Y (i) · (kmap (xj (i), xk (i)))−1 , j, k = 1, . . . , ni −1

B(i) = X(i) · (krec (yj (i), yk (i)))

, i, j = 1, . . . , ni

i = 1, . . . , L i = 1, . . . , L

(6) (7)

Until then, test sample is projected into low-dimensional space and reconstructed the corresponding set in the original space with ANAM, and recognition is completed based on Eq. (3). Different from ANN Bourlard proposed[6], the proposed ANAN generalizes the model into high-dimensional nonlinear data and avoids the problem of convergence neural network often suffers.

3

Experiments

Experiments are carried out on four databases to evaluate the recognition ability of the proposed ANAM approach. Two sets are UCI character database[7] and OCR (optical character recognition) database[8], and the other two sets are OPTDigits databases

602

J. Zhang and S.Z. Li Table 1. Experimental Databases and Data Partitions

The number of Samples Original Dimensions ClassNum Training Set Validation Set Test Set

UCI 20,000 16 26 300*26 50*26 10,900

OCR OPTDigits PENDigits 16,280 5,620 10,992 26 64 16 26 10 10 250*26 300*10 600*10 50*26 823 1,494 8,480 1,797 3,498

Table 2. The Average Error Rates and Standard Deviations of several algorithms for ANAM, K-nearest Neighbor (K-NN, K=3) C LASSIFIER UCI % OCR % OPTD IGITS % PEND IGITS % ANAM 6.99 ± 0.34 (10D IM ) 10.79 ± 0.56 (10D IM ) 1.28 (10 D IM ) 4.26 (10 D IM ) K -NN 10.10 10.5 2.00 (K=1) 2.26 (K=1) MLP 20.7 23.8 — —-

and PENDigits database from UCI repository[9, 10]. The details on the mentioned four databases and data partitions are illustrated in Table 1. It is noticeable that each dimension in three datasets except OPTDigits Database was linearly scaled to [0,1] in this experiments. The former two databases are randomly partitioned three disjointing sets, that is, training set, validation set and test set. And the final results are the average of 10 repetitions. Meanwhile, the training set and validation set of the latter two databases are randomly partitioned disjointing sets, and test set has been separated in the original databases. In our experiments, training sets from different concepts are used for building the different low-dimensional subspaces with LLE algorithm separately, validation set is used for searching the suboptimal parameters of ANAMs based on the error rate, and test set is used for evaluating the generalization performance of the proposed ANAM algorithm. Moreover, several additional parameters need to be predefined. Without loss of generality, the neighbor parameter K is set to 50 for all the four databases. The ranges of 2 2 and reconstruction parameter σrec are both set to [10−5 , 1010 ], mapping parameter σmap 0.5 and the size of each step is 10 so the optimal parameter can be adaptively searched. The experimental results on these databases are reported as in Table 2. For comparing the recognition performance between the proposed NAMs and other known state-ofthe-art algorithms, experimental results from [8] are cited. By analyzing the above results, it is not difficult to see that the proposed ANAM algorithm is comparable with other algorithm. For UCI letter and OPTDigits databases, the error rates of the proposed algorithm is lowest when comparing with other algorithms. For instances, in UCI character database, the error rates of NAMs is about 69.20% of the K-NN, 33.76% of the MLP. Furthermore, our proposed NAMs for the four databases using fewer features (10 dimensions) to model intrinsical feature spaces.

Adaptive Nonlinear Auto-Associate Modeling Through Manifold Learning

The ANAM Algorithm For UCI Letter Database

The ANAM Algorithm For OCR Letter Database

14

15 Error Rates Standard Deviations

12.69%

Error Rates

13.44%

Standard Deviations

12.25%

12

11.53%

10.63%

10.79%

10.43%

10

9.99%

9.02%

9.98%

9.79%

10

7.52%

8

6.99% 6.53% 6.07%

6.32%

6.20%

6

Error Rates

Error Rates

603

5 4

2

0.45% 0

100

0.36% 150

0.30% 200

0.32% 250

0.34% 300

0.38% 350

0.44% 400

0.74% 450

0.59% 500

The Number of Training Sample Each Class

(a) UCI Letter Database

0.33% 0

100

0.34% 150

0.22% 200

0.56% 250

0.45% 300

0.39% 350

0.48% 400

0.50% 450

The Number of Training Sample Each Class

(b) OCR Letter Database

Fig. 1. The Influence of Training Samples

Actually, we also investigate the influence of training sample with the fixed number of validation set for the error rates of test set (The experimental results also are the average of 10 runs). For example, the results in UCI and OCR database are illustrated as Figure 1. It can be seen that the error rate gradually decreases as the number of training sample increases. For example, when the number of training sample is equal to 400, the error rate and deviation are 6.07% ± 0.49% in the UCI Letter Database.

4

Discussions/Conclusions

In this paper, we propose ANAM for modeling different concepts and classifying samples which belong to remarkably distinct concepts. Unlike the other supervised manifold learning approaches, the advantages of the proposed ANAM is that it overcomes the curse of dimensionality without loss accuracy. And the discriminant information is implicity embodied because the parameters are determined based on the error rate of validation set. In the further study, we will consider how to combine cluster algorithm with ANAM for improving recognition rate.

References 1. J. Zhang, Stan. Z. Li, and J. Wang,“Manifold Learning and Applications in Recognition,” in Intelligent Multimedia Processing with Soft Computing. Yap Peng Tan, Kim Hui Yap, Lipo Wang (Ed.), Springer-Verlag, Heidelberg, 2004. 2. S. T. Roweis, and K. S. Lawrance,“Nonlinear Dimensionality reduction by locally linear embedding,” Science, 2000, 290, pp: 2323-2326. 3. J. B. Tenenbaum, D. Silva, V.& Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, 2000, 290, pp: 2319-2323. 4. J. Zhang, H. Shen, and Z. H. Zhou,“Unified Locally Linear Embedding and Linear Discriminant Analysis Algorithm (ULLELDA) for Face Recognition,” Advances in Biometric Personal Authentication. Stan Z. Li, Jianhuang Lai, Tieniu Tan, Guocan Feng, Yunhong Wang (Ed.), LNCS 3338, Springer-Verlag, 2004, pp: 209-307.

604

J. Zhang and S.Z. Li

5. P. M. Baggenstoss,“Class-Specific Classifier: Avoiding the Curse of Dimensionality,” IEEE A& E Sysmtes Magazine, Vol 19, No. 1, 2004, pp: 37-52 6. H. Bourlard, Y. Kamp, “Auto-Association by Multilayer Perceptrons and Singular Value Decomposition,” Biological Cybernetics, 59, 1988, pp: 291–294. 7. P. W. Frey, D. J. Slate,“letter recognition using holland-style adaptive classifiers,” Machine Learning, 6, 1991, pp: 161–182. 8. S. Kumar, J. Ghosh, M. Crawford,“A Bayesian Pairwise Classifier for Character Recognition,”Cognitive and Neural Models for Word Recognition and Document Processing, Nabeel Mursheed (Ed), World Scientific Press, 2000. 9. F. Alimoglu, E. Alpaydin,“Methods of Combining Multiple Classifiers Based on Different Representations for Pen-based Handwriting Recognition,” Proceedings of the Fifth Turkish Artificial Intelligence and Artificial Neural Networks Symposium (TAINN 96), June 1996, Istanbul, Turkey. http://www.cmpe.boun.edu.tr/ alimoglu/tainn96.ps.gz 10. C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/ mlearn/MLRepository.html.