Statistically Independent Feature Extraction for SAR Imagery

Statistically Independent Feature Extraction for SAR Imagery Jose C. Principe, Ph.D. Computational NeuroEngineering Laboratory University of Florida Gainesville, FL 32611 [email protected]

1.0 Abstract This paper reports on the work conducted in the University of Florida Computational NeuroEngineering Laboratory during 1998 under a DARPA grant. We have developed and applied a new feature extraction methodology based on the training of nonlinear mappers using an information theoretic criterion. We constructed a pose estimator with our method and showed the ability to estimate the pose of any vehicle in the MSTAR I and II database within 8 degrees. We further proposed a new classifier architecture based on the pose estimator followed by a set of simpler classifiers that are discriminantly trained.

2.0 Introduction The original goal of this project is to develop a new feature extraction methodology which would improve the performance of automatic target recognition algorithms applied to SAR (synthetic aperture radar). SAR ATR is still largely based on the detection and specific geometric arrangement of point scatters. Point scatters are highly dependent upon the geometry of the metallic object, which makes them a plausible candidate for object discrimination. However, they are also highly variable with the pose of the object and so their value is still unclear for statistical pattern recognition. Pattern recognition imposes a metric that is based on the Mahalonabis distance (Euclidean distance normalized by the covariance). So any feature extraction method-

ology that produces high variance should be looked at with lots of caution. Alternatively, we sought the power of nonlinear mappers trained with an information theoretic criterion to automatically establish which are the best features for a given goal. Information theory guarantees that the minimum of information is lost (i.e. the maximum of information is transmitted) from the input to the output of the mapper. So in principle this method should yield the best possible classifier. The problem of this reasoning is that there was no method available to estimate entropy (and mutual information) directly from examples, as is required in SAR ATR. So the first year of this project was dedicated to develop the theory and the algorithms that enable the use of information theory to train arbitrary nonlinear mappers directly from samples. This was successfully accomplished. The second year was devoted to test the algorithms. Since our approach was new, we decided to attack a problem of intermediate difficulty - pose estimation - before applying the method to classification of SAR vehicles. We can now report that we also have achieved success in this step. Our pose estimator algorithm achieves a precision in pose better than 8 degrees in all the MSTAR I and II database vehicles. Having developed a reliable pose estimator, we proposed a new architecture for ATR which allows discriminant training, unlike the present ATR classifiers. We will in the third year of the grant develop fully and test this new classifier. The report will highlight the major contributions undertaken this year.

3.0 Algorithm Development 3.1 Entropy estimation

The first year research yielded a novel nonparametric method to estimate entropy directly from samples [1]. The crux of the technique came from the realization that the Euclidean distance between the Parzen estimated PDF (probability density function) at the output of a mapper and a known PDF reduces to computing local interaction among pairs of output samples [1]. From these interactions a local error can be calculated and the parameters of a nonlinear mapper (a multilayer perceptron since it is an universal mapper) adapted using the well establish technique of error backpropagation [2]. Although the method was utilized to successfully train a MACE filter [3], two small problems needed improvement: the a priori use of the mean square error to measure the difference between PDFs required justification and a more principled extension to mutual information was needed. This was accomplished during 1998. The theoretical understanding of the method was improved by exploiting alternate definitions of entropy besides Shannon’s definition. Renyi’s quadratic entropy integrated with the Parzen window estimator produces an entropy estimator that can be used to maximize or minimize entropy without any approximations beyond the nonparametric estimation of the PDF. This work was reported at several conferences [4], [5]. We can say that a novel methodology called information theoretic learning, as general as the mean square error criterion was developed to adapt nonlinear mappers (Figure 1). Note that the method can be applied beyond nonlinear regression, since a desired response is not needed. The nonlinear mapper self-organizes with a property of its outputs. This aspect is critical to pursue applications of the technique to feature extraction where the desired response (the feature) is unknown. Our method of manipulating entropy has an interesting physical interpretation: samples can be thought as information particles (IPTs). A group of IPTs create an information potential (IP) field. Maximizing entropy is equivalent to minimizing the IP. The forces exerted by the IP on each IPT can be thought as an information force (IF). IFs take the role of the injected error in the nonlinear mapper to adapt its parameters such that the IP is mini-

mized. Hence, when we seek a nonparametric sample by sample algorithm to manipulate entropy we end up with a gratifying physical interpretation for the operations, reviving the strong link between information theory and physics. This analogy was recently written in a paper [6]

xi

g ( w, [ • ] )

BP

yi

εi

Information Theoretic Criterion

Mapping Network Figure 1: Training a mapper (linear or nonlinear) with ITL

3.2 Mutual Information Estimation

The problem of mutual information is slightly more complex and it has not been solved exactly yet. Mutual information is related to the “distance” between PDFs. The problem is that when the Parzen window estimator is applied to the Kulback Leibler divergence, a very complex expression results, which is not amenable to adaptive algorithms. We therefore approximated the Kulback-Leibler divergence with quadratic distances. We proposed the Cauchy-Schwartz [4] and the triangular inequality [5]. Both lead to extensions of the information potential (IP) field, and can therefore be easily integrated with the Parzen window estimator. Now the IP fields are more involved, since we have three types of fields (associated with the joint PDF, the marginals and the product of the two PDF terms). But since information is being used, the IPs add up, and an IF can still be derived for each sample which allows for the adaptation of the nonlinear mapper with mutual information [5]. Mutual information (MI) is a much more desirable criterion than entropy for engineering applications. We consider MI as a “super-general” learning paradigm that encompasses both supervised and unsupervised learning. In fact, depending upon how it is applied we can be training the nonlinear mapper in supervised or unsupervised modes. For instance,

when one minimizes mutual information among the output of the mapper, the paradigm is unsupervised; however we can also maximize the mutual information between the output of the mapper and an external desired response, in which case the method falls into supervised learning. We were able to show that for the minimization of MI, both the Cauchy-Schwartz and the triangular inequality are good approximations for the K-L divergence. This could be expected since a nonlinear space locally resembles an Euclidean space. The problem resides in the maximization of MI. In this case our L2 measures do not take into consideration the possible curvature of the space, and they may not lead to the best possible results. However, experimentally we verified that they yield good results. However, further theoretical work is needed in the case of maximization of MI.

4.0 Experimental Results 4.1 MSTAR I testing

Our work concentrated on the estimation of target pose using the MSTAR I and II databases. The first tests utilized the three targets of MSTAR I which have the same depression angle. We formulated pose estimation as a maximum likelihood estimation problem where given the target data we would like to find the angle that provides the best possible match for the pose. Therefore we have to estimate the conditional probability of given the image which is the most likely pose, what requires the estimation of the conditional PDF of the image set for all the possible poses. Since the image is very high dimensional, we estimated the conditional PDF at the output of a mapper, trained to maximize the mutual information between its output and the known pose (Figure 2). Therefore we trained the mapper to find a projection that best preserved the pose information. We were interested in testing the following variables: - mapper topology (linear or nonlinear) - 180 versus 360 degree estimation - performance of several training criteria - accuracy and generalization of the estimator

Angles A n gles a

a

Images

x

x

y

y

Information Potential Field

In pu ts BBack-Propagation ack prop ag ation

F orces Forces

Figure 2. Pose estimation Our conclusions are the following: Mapper topology For pose estimation we tested a multilayer perceptron (MLP) (6400 inputs, 4 hidden nodes, 2 outputs) and a linear mapper (6400x2). The linear mapper performance is as good as the MLP, probably because the input space is so high dimensional that for this problem it is possible to find a linear projection that preserves the pose. So we conclude that a linear mapper should be utilized due to its faster and more reliable training. We also experimented with different number of output PEs. We concluded that 2 PEs are sufficient for 0-180 degree pose estimation. Three PEs may have to be used if 0-360 degree pose is required. 180-360 degree estimation We concluded that the symmetry of the vehicle creates a local minima in training which makes the 0360 pose estimators more difficult to train. It also worsened the overall accuracy of the mapping. However, the performance is not as bad as the variance of the estimated angle may indicate, since the errors, when made, tend to be in multiples of 180 degrees (i.e. instead of let us say 30 degree pose, the value estimated is 210). So we conclude that the pose estimator should be trained for 0-180 degrees, and a classifier used to determine if the vehicle is facing up or down. Performance of several training criteria We modified several details in the algorithm and compared performance. We concluded that the triangular inequality is preferable to the CauchySchwartz inequality for pose estimation. The triangular inequality produces an output that stabilizes

into a circle when the training vehicle is moved across all the poses (0-180 degrees). This is rather appealing since the pose can be measured as an angle in a circle. So it is very easy to estimate pose with this training criterion by using a simple Euclidean distance on the output space. (Figure 3) When the Cauchy-Schwartz inequality is utilized

y1

y2 Figure 3. System outputs: diamonds training, triangles testing data.

to estimate MI, the output of the mapper stabilizes in a complex 2-D pattern that tends to fill the full output space (it resembles a fractal). So this definition induces a metric in the output space that is not Euclidean, so it is much more difficult to use the output of the mapper as a pose estimator. We also implemented Shannon definition of mutual information (difference between the output entropy and the conditional entropy of the input given the output) and the performance is good but slightly worse than the triangular inequality. We also tested several definitions for the pose of the training image. Since in this framework the pose of the training exemplars is a random variable, we have several possibilities of defining the joint probability. We conclude that the triangular inequality should be utilized for pose estimation, and that a simple coding of the angles of the training exemplars work well. Accuracy and generalization The training data has a resolution of 3.5 degrees. Test set performance is shown in Table I. We conclude that the accuracy in the same vehicle is dictated by the training set resolution, and the accuracy in the other vehicle types gets worse with a mean value of ~7 degree.

Note however, that the system shows good generalization, since it was trained in the BMP2 (a personnel carrier) and provided reasonable results in the tank (T72). When we used two vehicles for training, the accuracy improved overall as we can see comparing columns 1 and 2 of Table I.

CLASS/TYPE

1 vehicle (error degrees)

2 vehicle (error degrees)

BMP2/c21* BMP2/c21 T72/132* T72/132 BMP2/9563 BMP2/9566 BTR70/c71 T72/s7

.45 2.23 8.36 5.56 3.20 3.61 3.63 7.41

.36 2.39 .37 3.28 3.05 3.17 2.92 3.28

So our conclusion is that the pose estimator using MI is a practical method for several targets at the same depression angle. 4.2 MSTAR II Testing

The testing in the MSTAR II database is more realistic, since it provided more targets and several depression angles. It is also much more time consuming since our algorithm trains proportional to the square of the number of training patterns (O(N2)). Therefore we tried to find simpler ways to train our system. Since we found out that the output of the network trained with MI always stabilizes onto a circle, we decided to utilize this output as the desired response for our mapper, and train it with the well established MSE criterion. The advantage is that the training becomes O(N). Nevertheless we would like to experimentally compare mappers trained with the MI criterion and the MSE criterion. We were interested in finding out the effect on performance of the following variables: - training with MSE versus MI - depression angle - target types MSE versus MI training Once the desired response of the MSE training was selected from the MI determined response, the two methods should be the same. We confirmed this experimentally. The training of the linear mapper with MSE and with the MI criterion performed at the same level. Therefore we conclude that first the

MI criterion should be run on a simple problem to find out which is the response that preserves most of the information of the mapping under the constraint of maximization of MI. Then this response can be used as the desired response and the more traditional MSE criterion applied. The advantage is that the training becomes now O(N). Depression Angle The goal was to find out how the algorithm was able to generalize across different depression angles. We found out that the best depression angle for training is mid way between the extremes (for the MSTAR II 30 degrees). The maximum error across five different target types (one trained on a single target) was 10 degrees. We then tested how performance improved when two depression angles (the extremes of 15 and 45 degrees) where used in the training. We found out that the performed improved to 8 degrees. Target types Next we trained the algorithm with two target types with three depression angles to compare performance with the simpler training. We conclude that we can obtain an accuracy better than 8 degrees in all the target types and depression angles. See Table II.

degrees, it is foreseeable to utilize an alternate architecture for ATR classifiers. The conventional classifier design is based on the matched filter approach, i.e. for each target of interest a template is created through training. All the templates are applied to the image chip under analysis and the image chip is classified to the class providing the largest output [3]. This architecture is sub-optimal in most cases. In fact, only if all the classes have the same covariance the matched filter approach provides optimal classification. In pattern recognition terminology, we also say that the classifier is not discriminantly trained. We propose to improve this type of classifiers, by introducing discriminant training. Training a MLP (or any other type of classifier) on the raw data is not recommended due to the huge input space and the traditional lack of data. Therefore, we propose first to estimate the pose of the target, and then for each pose design a classifier for the classes under test (Figure 4). This architecture divides the input space (hence the complexity of the task) in sub regions for which a smaller feature set should provide sufficient discrimination. We have initiated preliminary tests of this novel classifier architecture.

0 F e ature + C la ssifie r

CLASS/ TYPE T72/a04-15 T72/a05-15 T72/a04-17 T72/a05-17 T72/a64-30 T72/a64-30 T72/a64-45 T72/a64-45 d7-17 brdm/2-17 zsu23/4-17 brdm2/c-30 zsu23/4c-30

1 vehicle/1 angle (mean error deg)

2 vehicles/3 angle (mean error deg)

4.05 4.74 3.77 3.98 1.06 3.32 6.39 5.15 10.10 5.24 6.17 8.70 7.67

3.68 4.13 3.56 3.75 2.75 3.91 2.84 4.84 6.97 4.69 6.41 5.04 7.59

5.0 Classification Architecture With the availability of a pose estimator that provides an accuracy of pose estimation better than 8

P ose E stim ator

v v v v

18 0 selector

igure 4. A SAR-ATR classifier exploiting the pose

6.0 References 1. Fisher J., Principe J., “Entropy manipulation of arbitrary nonlinear mappings”, in Proc. IEEE Workshop NNSP7, 14-23, Amelia Island, 1997. 2. Haykin S., Neural networks: a comprehensive foundation, MacMillan, 1995. 3. Fisher J., Principe J., “Recent advances to nonlinear MACE filters” Optical Eng. 36, #10, 2697-2709, 1998. 4. Xu D., Principe J., Fisher J., Wu H., A novel measure f o r i n d e p e n d e n t c o m p o n e n t a n a l y s is , i n P r o c . ICASSP98, vol II 1161-1164, 1998. 5. Xu D, Fisher J., Principe J., “Mutual information approach to pose estimation”, accepted in SPIE 98.