Blurred Face Recognition via a Hybrid Network ...

5 downloads 0 Views 203KB Size Report
Jun 13, 1999 - Recognition of blurred images is a challenging problem. .... is characterized by periodic lines of zeros, which are orthogonal to the direction of ...
Blurred Face Recognition via a Hybrid Network Architecture Inna Stainvas

Nathan Intrator

Department of Solid Mechanics, Materials, and Structures Tel-Aviv University Ramat Aviv 69978, Israel [email protected]

Computer Science Department Sackler Faculty of Exact Sciences Tel-Aviv University Ramat Aviv 69978, Israel [email protected]

Amiram Moshaiov

Department of Solid Mechanics, Materials, and Structures Tel-Aviv University Ramat Aviv 69978, Israel [email protected] June 13, 1999 Abstract

Recognition of blurred images is a challenging problem. We introduce a hybrid recognition/reconstruction architecture that is especially suitable for recognition of images degraded by various forms of blur. This architecture is an ensemble of feed-forward networks that are constrained to reconstruct the inputs as well. Networks are trained on the original as well as the arti cially blurred Gaussian images, so as to achieve higher robustness to di erent blur operators. The networks di er by their initial weights and reconstruction constraint levels de ned by the  parameter. Face recognition results are compared with results obtained by unconstrained network architectures and with architectures excluding the blurred Gaussian images during training. In addition, a combination of the constrained architecture with state-of-the-art restoration methods is considered. It is shown that the combination of these methods is superior to each of the methods separately. We emphasize that incorporating of blur images in recognition system development is more important than restoration methods applied post factum.

1 Introduction We have recently shown that imposing reconstruction constraints on a feed-forward architecture improves generalization performance for various image degradation such as salt and pepper noise, partial occlusion and recognition from compressed images (Stainvas et al., 1998). In this paper we concentrate on ways to improve recognition performance under various forms of image blur. We review several methods for image blur and image restoration and demonstrate the superiority of our proposed architecture and training methodology compared with such techniques and with other forms of constraining network architecture.

2 Methodology

2.1 Reconstruction constraints and regularization scheme

The hybrid architecture which combines recognition and reconstruction is presented in Figure 1. This network attempts to improve the low dimensional representation by minimizing concurrently the mean squared error (MSE) of reconstruction and classi cation outputs. In other words, it attempts to improve the quality of the hidden layer representation by imposing "good" feature selection which are useful for both tasks: classi cation and reconstruction. Clearly, the hidden layer should have a smaller number of units compared with the input, so as to achieve a bottleneck compression and allow for generalization. 1

A similar network architecture has been proposed for reliability estimation (Pomerleau, 1993) and saliency map construction (Baluja, 1996) for autonomous navigation. Such architecture has also been proposed as a model of Hippocampus (Gluck and Myers, 1993). The main di erence from our approach is the presence of a trade-o parameter  between the emphasis on learning to classify and the emphasis on learning to reconstruct and the integration over a family of this trade-o parameter. We have experimentally found that training several networks on di erent  values around an optimal value that was found once, and then averaging the di erent network results, yields performance that are close to the optimal (a posteriori)  and sometimes are even better (see next section for details). We thus, do not regard the need to estimate an appropriate  as a problematic one. Reconstruction

Input Hidden layer

Classification

Figure 1: A combined recognition/reconstruction architecture. A single hidden layer drives both the classi cation output layer and the reconstruction. Generative models are closely related to our hybrid architecture as they contain a probabilistic model of the recognition and the reconstruction tasks. In the binary case, the wake-sleep architecture has been proposed (Hinton et al., 1995) and the Recti ed Gaussian Belief Network (Hinton and Ghahramani, 1997) has been proposed in the continuous case. The di erence from our architecture is that there are no two hidden representations, one for the top-down and one for the bottom-up, but instead a single internal representation is used for both.

2.2 Neural Network Ensembles

It is well known that an ensemble of experts is capable of improving the performance of single experts (Tesauro et al., 1991; Wolpert, 1992). The improvement depends on the level of independence of the errors made by the experts. In arti cial neural networks, the independence is enhanced due to non identi ability of their models, i.e., the existence of many (local) minima of the error surface. This independence reduces the contribution of the variance portion of the error when ensemble average is used (Raviv and Intrator, 1996). An ensemble classi cation rule is based on averaging the real values of the outputs of all the ensemble members and then producing a decision by Bayesian classi cation rule. The reduction in the variance portion of the error is greater when the predictors are independent. One conventional way to generate ensemble is to average networks with di erent initial weights. In our model, these results correspond to training without reconstruction constraints, i.e., with trade-o parameter  = 0. It turns out that by averaging over ensemble members that have been trained with di erent values of tradeo parameter , some additional independence is achieved, leading to a useful collective decision. We call this a reconstruction ensemble. Our joined ensemble results are obtained by integrating over several networks with di erent  values and several networks with a  = 0.

2.3 Training with processed images

Bootstrap is a well known method for increasing arti cially the size of the training data. Usually the bootstrapped data is taken directly from the original data, however, the smooth bootstrap which adds noise to the original data leads to further independent networks and a better ensemble behavior. For example, the two-spiral problem can be solved using conventional neural network and strong injection of noise into the inputs (Raviv and Intrator, 1996). 2

We take this approach a step further and add blurred images to the training set. It appears that when a single form of blur (Gaussian blur) is added to the training data, it is sucient to improve the (test) performance on other blur operations as well. However, when training with blurred images, we enforce reconstruction to the original and not the blurred image. This biases the hidden units to become insensitive to blur operation and thus improves the results of the classi cation system.

3 Image degradation Usually degradation process is modeled as both a space-invariant blurring with a convolution operator h and a corruption with additive noise n : g = h f + n. The major known causes of blurring are misfocus, camera jitter, object motion and atmospheric turbulence. All these types of blurring lead to a low pass operation on the image. Of particular interest is blur with band-pass lter, that is a di erence of Gaussians (DOG) lter. A third family of image lters is the high pass lter which leads to image sharpening. This is quiet common in medical imaging, industrial inspection and military applications. The nal form of image degradation discussed in this paper is noise. Noise results from image sampling, recording, transmission, etc. We consider two types of additive noise: Gaussian white noise and pulse noise. We limit ourselves to Gaussian noise that acts independently on each pixel, with zero mean and some variance . Pulse noise replaces pixel intensities by either the maximum or minimum grey-level values with some probability (Rosenfeld and Kak, 1982), producing separate high contrast black-and-white points. This noise is common in video transmission.

3.1 Main lters

The ltering may be done both in the frequency and spatial domains. Convolution in the spatial domain is equivalent to multiplication of the Fourier transforms of the image and the lter. In each particular case we indicate in which domain ltering is done and represent point spread function or its Fourier transform (referred to as a transfer function) as required. Examples of images with di erent degradation are presented in the Figure 2. original

a

b

c

d

e

Figure 2: Blurred and enhanced images a) Result of Gaussian blur with  = 2; b) Result of the out-of-focus lter with the blur radius R = 5; c) Motion blur with blur propagation on 7 pixels; d) Result of the DOG lter with on and o centers equal to 1 = 1 and 2 = 2; e) Result of the root lter with = 0:6 3.1.1 Motion blur

Motion blur is another form of image degradation that may degrade recognition performance. It is common when the imaging system is mounted on a moving vehicle, especially fast moving, but the image of a slow moving robot may also su er from motion blur. Assuming that a relative camera motion is horizontal and uniform and the total displacement during the exposure time T is a, the transfer function H (u; v) (Gonzalez and Wintz, 1993) is given by: T sin(ua) exp(?iua): (1) ua H vanishes at values of u given by u = na , where n is a nonzero integer. In general, the amplitude of H (u; v) H (u; v) =

is characterized by periodic lines of zeros, which are orthogonal to the direction of motion and are spaced at intervals of a1 in both sides of the frequency plane. 3

3.1.2 Out-of-focus blur

The point spread function (PSF) of a defocused lens with a circular aperture is approximated by the cylinder whose radius R depends on the extent of the focus defect (Cannon, 1976):  1 if x2 + y2  R2 h(x; y) = 0R2 otherwise, where R is the \blur radius" which is proportional to the extent of defocusing. The Fourier transform of h(x; y) in this case is H (u; v) = J1 (Rr)=(Rr), where J1 is the rst-order Bessel function and is characterized by \almost-periodic" circles with zero valued H (u; v). This occurs for r satisfying: 2Rr = 3:83; 7:02; 10:2; 13:3; 16:5 : : : The well-de ned structure of H (u; v) zeros in the case of motion and misfocus blur is used for the identi cation of the blur parameter (Cannon, 1976; Fabian and Malah, 1991) for purpose of image restoration. 3.1.3 Gaussian blur

Gaussian blur, which is often used, can be caused by atmospheric and optical blur. It is known that the lens of the eyes blur images in this manner. Computer tomography images also su ers from Gaussian blur (Kimia and Zucker, 1993). The Gaussian convolution lter written in polar coordinates in the spatial domain is given by:

?r2 h(r; ) = C?2 exp( 2 ); 2

(2)

where C is a normalization constant. The lack of zero crossing of the Gaussian lter in the frequency domain makes its identi cation very dicult. Moreover, Gaussian deblurring is numerically unstable (Humel et al., 1987; Kimia and Zucker, 1993). 3.1.4 DOG lter

The di erence of Gaussian (DOG) lter is a good approximation to the circular symmetric Mexican hat type receptive elds (center-surround) found in early mammal vision (Marr, 1982). It performs a band-pass lter that is the result of applying the Laplacian operator r2 to an image that is blurred with a Gaussian lter. The zero-crossings of the resulting convolved image are commonly used for edge detection and segmentation. The DOG lter written in polar coordinates is described by:

?r2 ?r2 h(r; ) = C1?2 exp( 2 ) ? C2?2 exp( 2 ); 21 22

(3)

where 1 < 2. The standard deviations of the on and o center (positive and negative Gaussians) were 1 and 2 respectively in the simulations discussed below. 3.1.5 Root lter

Root lter is commonly used for image enhancement and deblurring (Jain, 1989). It is nonlinear and acts explicitly on the magnitude of the frequency response of the image V changing it to V^ : k V^ k=k V k : For small values of  1, it acts as a high pass lter, increasing the ratio between amplitudes in the high and low frequencies.

4 Image restoration Image restoration refers to the problem of recovering an image from its blurred and noisy version using some a priori knowledge of the degradation phenomenon and the image nature. It is well-known that restoration problem is an ill-posed problem (Gonzalez and Wintz, 1993; Jain, 1989; Stark, 1987), i.e. a small noise in the observed image results in an unbounded perturbation in the solution. This instability is often addressed by a regularization approach (Tikhonov and Arsenin, 1977; Katsaggelos, 1989; Sezan and Tekalp, 1990; Rudin et al., 1992; You and Kaveh, 1996) that includes restricting the set of admissible solutions and introducing some a priori knowledge about the image and the degradation model. In the following sections, we brie y review the restoration techniques we have used in this work. 4

4.1 MSE minimization and regularization

Assuming the blur operator H is known, a natural criterion for estimating an original pixel image f from an observed pixel image g in the absence of any knowledge about noise, is to minimize the di erence between the observed image and a blurred version of the restored image: min M(f ) = min k g ? Hf k2 : f f

(4)

Often, gradient or conjugate gradient descent methods are used for M(f ) minimization (Katsaggelos, 1989; Sezan and Tekalp, 1990). An application of the gradient method to the minimization problem (4) produces the following iterative scheme: fk+1 = fk + (Ht g ? Ht Hfk ):

(5)

When the blur matrix H is nonsingular and is suciently small, the iterative scheme converges to the f^ = H ?1g. This solution is known as the inverse lter method. In the frequency domain, it corresponds to the following estimation of the ideal image frequency response: G(u; v) F^ (u; v) = : (6) H (u; v) As mentioned before, blur such as motion or defocusing leads to a singular H matrix. In this case, the above optimization method yields an iterative scheme that converges to the minimum norm least square solution f + = H + g of the (4) (Katsaggelos, 1989; Jain, 1989), where H+ is the generalized inverse of matrix H. In the presence of noise the iterative algorithm converges to f + + H + n and thus contains noise ltered by the pseudo-inverse matrix. Often, H is a low-pass lter, therefore, the noise is ampli ed and the obtained solution may be very far from the desired one. To overcome this sensitivity to noise, some a priori information about the noise or the ideal image is often introduced as a quantitative constraint that replaces an ill-posed problem by a well-posed one. This method is called regularization. The most well known regularization methods (Tikhonov and Arsenin, 1977; Sezan and Tekalp, 1990) have a general formulation as a minimization of the function: L(f ) =k Hf ? g k2 + k Cf k2 ;

where the regularization operator C is chosen to suppress the energy of the restored image in the high frequencies, that is equivalent to a smoothness of the original image in the spatial domain. Since usually the H lter is a low pass lter, in order to get the smooth original image, the regularization operator C is taken to be a Laplacian. A regularization parameter may be known a priori or estimated, but theoretically it is inversely proportional to the signal to noise ratio (SNR). Although regularization of the MSE criterion with smoothness constraint k Cf k is the basis for most of the work in image restoration, it often leads to unacceptable ringing artifacts around sharp intensity transitions. This e ect is due to image blurring around lines and edges. Some solution to this problem is given by the following functional minimization (Katsaggelos, 1989): X X L(f ) = [g(x) ? h(x) f (x)]2 +  !(x)[c(x) f (x)]2 : (7) x2

x2

The rst term in (7) represents the delity of the restored image with respect to an observation and the second a smoothness constraint. The space adaptivity is achieved through the introduction of the weight function !. The weight function ! is set to be small around the edge areas, larger near smooth areas and usually taken, in practice, as the inverse of the local variance of the image. The space adaptivity approach has been extended to the case of an unknown blur operator (You and Kaveh, 1996; Chan and Wong, 1996). The method incorporates a priori knowledge about the image and the point spread function (PSF) simultaneously. It proceeds by minimizing the cost function, which consists of a restoration error measure and two regularization terms R for the image and the blurring kernel; In (Chan and Wong, 1996) the regularization term has the form jrf jdx, called a total variation (TV) norm (Rudin et al., 1992). This follows the idea that the image consists of the smooth patches, instead of being smooth everywhere, thus providing better recovering of image edges. 5

4.2 Denoising

Denoising may be considered a particular restoration method, when the PSF of the blur operator is a delta function. Thus, some of the methods described above are appropriate for denoising as well (Rudin et al., 1992; You and Kaveh, 1996). We also consider some of the rank algorithms (Yaroslavsky and Eden, 1996) which are especially designed for noise reduction. First, we consider an averaging techniques, called peer group averaging (PGA) in which a central pixel intensity is replaced by an average intensity of some prede ned neighboring pixels which are closest by intensity value. The number of pixels over which averaging is performed is called peer group size and it controls the amount of smoothing. The second method { the median lter, replaces the gray level intensity of each pixel by the median of its neighboring pixels intensities. This method is particularly e ective when the noise is spike-like. It is nonlinear, is very robust and preserves edge sharpness.

5 Data-set description and some implementation details In our simulations we have used a data-set locally collected by the Tel-Aviv University Computer Vision Group (Tankus, 1996). The data-set contains images of 37 male and female faces with 10 pictures for each person in high resolution 84  56. We split the data to 6 training images and 4 testing images for each person and used a normalization preprocessing. This preprocessing partially removes the variability due to viewpoint, by setting (automatically) the eyes to the same position in all images (Tankus et al., 1997). Further preprocessing evaluates the di erence between each image and an average over all the training set, leading to the so called \caricature" images (Kirby and Sirovich, 1990). The number of hidden units was set to 10 by trial and the constant learning rate was taken enough small (of order 0:01) to avoid the divergence of the gradient descent algorithm. The initial weights were generated using Matlab's random function, with the values distributed uniformly in the interval [0 0:001]. The number of epochs about 5000{10000 was used.

6 Neural Network Architectures We have considered six network architectures. These architectures represent three network ensembles: unconstrained, reconstruction and joined, trained with and without arti cially blurred Gaussian images. The reconstruction ensembles were composed from networks with a trade-o parameter  varying from 0 to 0:3 with an increment of 0:05. The number of networks in the unconstrained ensembles ( = 0) is 6. The joined ensemble includes networks of unconstrained and reconstruction ensembles.

7 Results Table 1 presents misclassi cation results for all network architectures. It shows that recognition is signi cantly impaired by blur operators, leading to an increase (approximately double) in the misclassi cation error. In experiments, either with and without blurred images, the joined ensembles (the 3rd and 6th columns) are less sensitive to the blur operation. Incorporation of the blurred images in training process improves recognition, so that the the unconstrained ensemble with blurred images (the 4th column) is superior to the joined ensemble (the 3rd column) trained on the original images only. Table 2 demonstrates recognition is done with methods described above. The results demonstrate that restoration methods applied before recognition improve classi cation results. The new results are more robust and in fact, become similar to the performance on the non-degraded images. At the same time, we observe that in some cases, the joined ensemble with blurred images (the 6th column of Table 1) is superior to the recognition of the joined ensemble on the restored images (the 3rd column of Table 2).

8 Conclusions Two methods improving the challenging problem of blurred image recognition were proposed: (i) Preprocessing the blurred images using blind deconvolution methods before recognition; (ii) Application of the 6

Types of corruption

Unconstrained Reconstruction Joined =0 ensemble ensemble ensemble

"Clean data" \Salt and Pepper" noise with d = 0:2 Gaussian noise with snr = 1 Gaussian blur with =2 Blur with pruned Gaussian lter + noise, snr = 100 Horizontal motion blur, d = 5 pixels + Gaussian noise, snr=10 Horizontal motion blur, d = 7 pixels Out-of-focus blur with r = 5 DOG lter with 1 = 1 and 2 = 2 Root lter with = 0:6

Unconstrained

 = 0 ensemble

Reconstruction Joined ensemble ensemble with blurred with blurred images images 10.8 8.8

12.8

12.8

13.5

with blurred images 9.5

25.0

20.3

20.3

20.3

18.2

14.2

15.5 19.6

16.9 16.2

12.8 16.9

10.1 14.2

10.8 11.5

8.1 10.8

18.9

16.2

18.2

10.8

10.1

10.8

21.6

23.6

19.6

16.9

15.5

12.2

27.0

29.1

23.6

20.3

23.0

16.2

20.9

20.9

17.6

16.2

10.8

11.5

31.8 16.9

26.4 17.6

23.6 12.8

23.0 10.1

26.4 10.8

20.9 8.1

Table 1: Misclassi cation error (in percentage) for degraded TAU images.

Types of corruption "Clean data" Median lter for \Salt and Pepper" noise PGA denoising with ng = 6 for Gaussian noise Root lter with = 0:8 for Gaussian blur Blind deconvolution for pruned Gaussian Smoothing + deblurring for horizontal motion blur, d = 5 pixels Blind deconvolution for horizontal motion blur, d = 7 pixels

Unconstrained Reconstruction Joined =0 ensemble ensemble ensemble

Unconstrained

 = 0 ensemble

Reconstruction Joined ensemble ensemble with blurred with blurred images images 10.8 8.8

12.8

12.8

13.5

with blurred images 9.5

13.5

12.8

12.8

9.5

8.8

8.8

14.9

15.5

12.8

10.8

12.8

8.8

13.5 12.8

14.2 14.2

14.2 12.8

12.2 10.1

12.2 10.1

8.8 8.8

14.9

14.9

14.2

9.5

10.8

9.5

13.5

15.5

12.8

10.8

11.5

9.5

Table 2: Misclassi cation error (in percentage) after restoration for degraded TAU images. 7

regularized reconstruction constraints technique to a training set expanded by blurred images of some form. These constrains force the reconstruction scheme to become less sensitive to the blur operation. Two training schemes with and without blurred images have been compared and di erent network ensembles have been considered. The best classi cation scheme is the scheme that includes both the hybrid recognition/reconstruction architecture and the usage of the Gaussian blurred images. The best network ensemble is the joined ensemble, obtained by merging of the unconstrained and the reconstruction ensembles trained with the Gaussian blurred images. We have shown that the combination of the methods (i) and (ii) are superior to each of the methods separately. In addition, we emphasize that incorporating of the Gaussian blur images in the recognition system development is more important than restoration methods applied post factum. It should be noted that the scheme we introduce is one of the simplest possible top down/bottom up architectures and we have shown its importance and usefulness in degraded image recognition.

References

Baluja, S. (1996). Expectation-based selective attention. PhD thesis, School of computer science, CMU. Cannon, M. (1976). Blind deconvolution of spatially invariant image blurs with phase. icassp, 24:58{63. Chan, T. F. and Wong, C. K. (1996). Total variation blind deconvolution. Ucla technical report 96-45. Fabian, R. and Malah, D. (1991). Robust identi cation of motion and out-of-focus blur parameters from blurred and noisy images. cvgip, 53(5):403{412. Gluck, M. A. and Myers, C. E. (1993). Hippocampal mediation of stimulus representation: A computational theory. Hippocampus, 3(4):491{516. Gonzalez, R. C. and Wintz, P. (1993). Digital Image Processing. Addison-Wesley Publishing Company. Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995). The "Wake-Sleep" algorithm for unsupervised neural networks. Science, 268:1158{1161. Hinton, G. E. and Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society B, 352:1177{1190. Humel, R. A., Kimia, B. B., and Zucker, S. W. (1987). Deblurring Gaussian blur. cvgip, 38(1):66{80. Intrator, N., Reisfeld, D., and Yeshurun, Y. (1996). Face recognition using a hybrid supervised/unsupervised neural network. Pattern Recognition Letters, 17:67{76. Jain, A. K. (1989). Fundamentals of Digital Image Processing. Prentice Hall, London. Katsaggelos, A. K. (1989). Iterative image restoration algorithms. Optical Engineering, 28(7):735{748. Kimia, B. B. and Zucker, S. W. (1993). Analytic inverse of discrete Gaussian blur. Optical Engineering, 32(1):166{176. Kirby, M. and Sirovich, L. (1990). Application of the Karhunen-Loeve procedure for characterization of human faces. PAMI, 12(1):103{108. Marr, D. (1982). Vision. Imprint FREEMAN, New York. Pomerleau, D. A. (1993). Input reconstruction reliablility estimation. In Giles, C. L., Hanson, S. J., and Cowan, J. D., editors, Advances in Neural Information Processing Systems, volume 5, pages 279{286. Morgan Kaufmann. Raviv, Y. and Intrator, N. (1996). Bootstrapping with noise: An e ective regularization technique. Connection Science, Special issue on Combining Estimators, 8:356{372. Rosenfeld, A. and Kak, A. C. (1982). Digital Picture Processing. Academic press, New York. Rudin, L. I., Osher, S., and Fatemi, E. (1992). Nonlinear total variation based noise removal algorithms. Physica D, 60:259{268. Sezan, M. I. and Tekalp, A. M. (1990). Survey of recent developments in digital image restoration. Optical Engineering, 29(5):393{404. Stainvas, I., Intrator, N., and Moshaiov, A. (1998). Improving recognition via reconstruction. Preprint, pages 1{15. Stark, H. (1987). Image recovery: Theory and application. Academic press, San Diego. Tankus, A. (1996). Automatic face detection and recognition. Master thesis, Tel-Aviv University. Tankus, A., Yeshurun, Y., and Intrator, N. (1997). Face detection by direct convexity estimation. In Proceedings of the 1st Intl. Conference on Audio- and Video-based Biometric Person Authentication, Switzerland. Springer. Tesauro, G., Touretzky, D., and Leen, T., editors (1991). Neural Network Ensembles, Cross Validation, and Active Learning. The MIT Press, London.

8

Tikhonov, A. N. and Arsenin, V. Y. (1977). Solutions of Ill-Posed Problems. V. H. Winston and Sons, Washington. Turk, M. and Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3:71{86. Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5:241{259. Yaroslavsky, L. and Eden, M. (1996). Fundamentals of digital optics. Imprint BIRKHAUSER, Boston. You, Y.-L. and Kaveh, M. (1996). A regularization approach to joint blur identi cation and image restoration. IEEE Transactions on Image Processing, 5(3):416{427.

9