S Sample Quality The intrinsic characteristic of a biometric signal may be used to determine its suitability for further processing by the biometric system or to assess its conformance to preestablished standards. The quality of a biometric signal is a numerical value (or a vector) that measures this intrinsic attribute (See also ▶ Biometric Sample Quality). ▶ Biometric Algorithms ▶ Fusion, Quality-Based

Sample Size ▶ Manifold Learning ▶ Performance Evaluation, Overview ▶ Test Sample and Size

Scalability Scalability is the ability of a biometric system to extend adaptively to larger population without requiring major changes in its infrastructure. ▶ Performance Evaluation, Overview

Scenario Tests Scenario tests are those in which biometric systems collect and process data from test subjects in a specified application. An essential characteristic of scenario testing is that the test subject is ‘‘in the loop,’’ interacting with capture devices in a fashion representative of a target application. Scenario tests evaluate endto-end systems, inclusive of capture device, quality validation software, enrollment software, and matching software. ▶ Performance Testing Methodology Standardization

Sampling Frequency Scene Marks Sampling frequency is the number of samples captured in a second from the continuous hand-drawn signal to generate a discrete signal. ▶ Digitizing Tablet

#

2009 Springer Science+Business Media, LLC

Crime scene marks are generally any physical phenomenon created or left behind and in relation to a crime scene, these can be fingerprints, blood spatter,

1134

S

Scent Identification Line-Ups

intentional and unintentional damage, or alteration to objects in the environment of the crime.

to a fusion rule, e.g., the unknown sample is accepted only if both modalities yield an accept decision.

▶ Footwear Recognition

▶ Multibiometrics, Overview

Scent Identification Line-Ups Procedure where a trained dog matches a sample odor provided by a person to its counterpart in an array (or line-up) of odors from different people, following a fixed protocol. Scent identification line-ups are used in forensic investigations as a tool to match scent traces left by a perpetrator at a crime scene to the odor of a person suspected of that crime. The protocol includes certification of the team involved, collecting and conserving scent samples at crime scenes, collecting, conserving and presenting suspect, and other array odors, working procedures and reporting. Scent identification line-ups have evolved from simple line-ups that are used in human scent tracking/trailing, where a dog has to walk up to the person whose track it has been following and through some trained behavior indicate the person. ▶ Odor Biometrics

Score Fusion ▶ Fusion, Score-Level ▶ Multiple Experts

Score Fusion and Decision Fusion Score fusion is a paradigm, which calculates similarity scores for each of the two modalities, then combines the two scores according to a fusion formula, e.g., the overall score is calculated as the mean of the two modality scores. Decision fusion is a paradigm, which makes an accept–reject decision for each of the two modalities, then combines the two decisions according

Score Normalization The score normalization techniques aim, generally, to reduce the scores variabilities in order to facilitate the estimation of a unique speaker-independent threshold during the decision step. Most of the current normalization techniques are based on the estimation of the impostors scores distribution where the mean, μ, and the standard deviation v, depend on the considered speaker model and/or test utterance. These mean and standard deviation values will then be used to normalize any incoming score s using the normalization function sm scoreN ðsÞ : v Two main score normalization techniques used in speaker recognition are: 1. Znorm. The zero normalization (Znorm) method (and its variants like Hnorm (Heck, L.P., Weintraub, M.: Handset-dependent background models for robust text-independent speaker recognition. In: ICASSP. (1997))) normalizes the score distribution using the claimed speaker statistics. In other words, the claimed speaker model is tested against a set of impostors, resulting in an impostor similarity score distribution which is then used to estimate the normalization parameters μ and v. The main advantage of the Znorm is that the estimation of these parameters can be performed during the training step. 2. Tnorm. The test normalization (Tnorm) (Auckenthaler, R., Carey, M., Lloyd-Thomas, H.: Score normalization for text-independent speaker verification systems. Digital Signal Processing 10 (2000) 4254) is another score normalization technique in which the parameters μ and v are estimated using the test utterance. Thus, during testing, a set of impostor models is used to calculate impostor scores for the given test utterance. μ and v are estimated using these scores. The Tnorm is known to improve the performances particularly in the region of low false alarm.

Score Normalization Rules in Iris Recognition

Any of a number of rules for adjusting a raw similarity score in a way that takes into account factors such as the amount of data on which its calculation was based, or the quality of the data. One purpose of score normalization in biometrics is to prevent the arising of false matches simply because only a few elements (e.g., biometric features) were available for comparison. So an accidental match by chance would be more like tossing a coin only a few times to produce a perfect run of all head. Another purpose of score normalization is to make it possible to compare or to fuse different types of measurements, as in multibiometrics. For example, Z-score normalization redefines every observation in units of standard deviation from the mean, thereby allowing incommensurable scores (like height and weight) to become commensurable (e.g., he is 3.2 standard deviations heavier than normal but 2.3 standard deviations taller than normal). Frequently the goal of score normalization is to map samples from different distributions into normalized samples from a universal distribution. For example, in iris recognition a decision is made only after the similarity score (fractional Hamming Distance) has been converted into a normalized score that compensates for the number of bits that were available for comparison, thereby preventing accidental False Matches just because of a paucity of visible iris tissue. ▶ Score Normalization Rules in Iris Recognition ▶ Session Effects on Speaker Modeling ▶ Speaker Matching

Score Normalization Rules in Iris Recognition J OHN DAUGMAN Cambridge University, Cambridge, UK

Synonyms Commensurability; Decision criterion adjustment; Error probability non-accumulation; Normalised Hamming Distance

Definition All biometric recognition systems are based on similarity metrics that enable decisions of ‘‘same’’ or ‘‘different’’ to

S

1135

be made. Such metrics require normalizations in order to make them commensurable across comparison cases that may differ greatly in the quantity of data available, or in the quality of the data. Is a ‘‘perfect match’’ based only on a small amount of data better or worse than a less perfect match based on more data? Another need for score normalization arises when interpreting the best match found after an exhaustive search, in terms of the size of the database searched. The likelihood of a good match arising just by chance between unrelated templates must increase with the size of the search database, simply because there are more opportunities. How should a given ‘‘best match’’ score be interpreted? Addressing these questions on a principled basis requires models of the underlying probability distributions that describe the likelihood of a given degree of similarity arising by chance from unrelated sources. Likewise, if comparisons are required over an increasing range of image orientations because of uncertainty about image tilt, the probability of a good similarity score arising just by chance from unrelated templates again grows automatically, because there are more opportunities. In all these respects, biometric similarity ▶ score normalization is needed, and it plays a critical role in the avoidance of False Matches in the publicly deployed algorithms for iris recognition.

Introduction Biometric recognition of a person’s identity requires converting the observed degree of similarity between presenting and previously enrolled features into a decision of ‘‘same’’ or ‘‘different.’’ The previously enrolled features may not be merely a single feature set obtained from a single asserted identity, but may be a vast number of such feature sets belonging to an entire national population, when identification is performed by exhaustively searching a database for a sufficiently good match. The ▶ similarity metrics used for each comparison between samples might be simple correlation statistics, or vector projections, or listings of the features (like fingerprint minutiae coordinates and directions) that agreed and of those that disagreed as percentages of the total set of features extracted. For each pair of feature sets being compared, varying amounts of data may be available, and the sets might need to be compared under various transformations such as image rotations when the orientation is uncertain. An example is seen

S

1136

S

Score Normalization Rules in Iris Recognition

Score Normalization Rules in Iris Recognition. Figure 1 Illustration of limited data being available in an iris image due to eyelid occlusion, as detected in a segmentation process.

in Figure 1, in which only 56% of the annular iris area is visible between the eyelids. Iris images may have also been acquired with a tilted camera (not unusual for handheld cameras), or with the head tilted or the eye rotated (cyclovergence) by an unknown degree, requiring comparisons to be made over a range of configurations for each of the possible identities, and with varying amounts of template data being available in each case. This article is concerned with the methods of ▶ score normalization that are used in iris recognition to make all of those comparison cases ▶ commensurable with each other, preventing False Match probability from rising simply because there is less data available for comparison or because there are many more candidates and match configurations to be considered.

Score Normalisation by the Amount of Iris Visible The algorithms used in all current public deployments of iris recognition [2] work by a test of statistical independence: A match is declared when two templates fail the test of statistical independence; comparisons between different eyes are statistically guaranteed to pass that test [1]. The test of independence is based on measuring the fraction of bits that disagreed between two templates, called ▶ IrisCodes, and so the similarity metric is a ▶ Hamming Distance between 0 and 1. (The method by which an IrisCode is created is described in this encyclopedia in the entry on Iris Encoding and Recognition using Gabor Wavelets.)

If two IrisCodes were derived from different eyes, about half of their bits should agree and half should disagree (since any given bit is equally likely to be 1 or 0), and so a Hamming Distance close to 0.5 is expected. If both IrisCodes were computed from the same eye, then a much larger proportion of the bits should agree since they are not independent, and so a Hamming Distance much closer to 0 is expected. But what is the effect of having varying numbers of bits available for comparison, for example, because of eyelid occlusion? Eyelid boundaries are detected (as illustrated by the spline curve graphics in Figure 1 where each lid intersects the iris), and the parts of the IrisCode that are then unavailable are marked as such by setting masking bits. The box in the lower-left corner of Figure 1 shows Active Contours computed to describe the pupil boundary (lower ‘‘snake’’) and the iris outer boundary (upper snake). As these snakes are curvature maps, a circular boundary would be described by a snake that was flat and straight. The two thick grey regions in the box containing the upper snake represent the limited regions where the iris outer boundary is visible and possesses a large radial gradient (or derivative) in brightness. The gaps that separate the two thick grey regions correspond to parts of the trajectory around the iris where no such boundary is visible, because it is occluded by eyelids. Thus the outer boundary of the iris must be estimated (dotted curve) by two quite limited areas on the left and right sides of the iris where it is visible. In the coordinate system that results, the iris regions obscured by eyelids are marked as such by masking bits. The logic for comparing two IrisCodes to generate a raw Hamming Distance HDraw is given in Equation (1), where the data parts of the two IrisCodes are denoted {codeA, codeB} and the vectors of corresponding masking bits are denoted {maskA, maskB}: HD raw

T T kðcodeA codeBÞ maskA maskBk T ð1Þ ¼ kmaskA maskBk

N The symbol signifies the logical Exclusive-OR (XOR) operator which detects disagreement between T bits; signifies logical AND whereby the masks discount data bits where occlusions occurred; and the norms k k count the number of bits that are set in the result. Bits may be masked for several reasons other than eyelid or eyelash occlusion. They are also deemed

Score Normalization Rules in Iris Recognition

unreliable if specular reflections are detected in the part of the iris they encode, or if the signal-to-noise ratio there is poor, for example, if the local texture energy is so low that the computed wavelet coefficients fall into the lowest quartile of their distribution, or on the basis of low entropy (information density). The number of bits pairings available for compariT son between two IrisCodes, kmaskA maskBk, is usually almost a thousand. But if one of the irises has (say) almost complete occlusion of its upper half by a drooping upper eyelid, and if the other iris being compared with it has almost complete occlusion of its lower half, then the common area available for comparison may be almost nil. How can the test of statistical independence remain a valid and powerful basis for recognition when very few bits are actually being compared? It may well be that a less exact match on a larger quantity of data is better evidence of a match than is a perfect match on less data. An excellent analogy is a test of whether or not a coin is ‘‘fair’’ (i.e., gives unbiased outcomes when tossed): Getting a result of 100% ‘‘heads’’ in few tosses (e.g., 10 tosses) is actually much more consistent with it being a fair coin than getting a result of 60% / 40% after 1,000 tosses. (The latter result is 6.3 standard deviations away from expectation, whereas the former result is only 3.2 standard deviations away from expectation; so the 60/40 result is actually much stronger evidence against the hypothesis of a fair coin, than is the result of ‘‘all heads in

S

1137

10 tosses’’.) Similarly, in biometric comparisons, getting perfect agreement between two samples that extracted only ten features may be much weaker evidence of a good match than a finding of 60% agreement among a much larger number of extracted features. This is illustrated in Table 1 for an actual database of 632,500 IrisCodes computed from different eyes in a border-crossing application in the Middle East [3]. A database of this size allows 200 billion different pair comparisons to be made, yielding a distribution of 200 billion HDraw similarity scores between different eyes. These HDraw scores were broken down into seven categories by the number of bits mutually available for comparison (i.e., unmasked) between each pair of IrisCodes; those bins constitute the columns of Table 1, ranging from 400 bits to 1,000 bits being compared. The rows in Table 1 each correspond to a particular decision threshold being applied; for example, the first row is the case that a match is declared if HDraw is 0.260 or smaller. The cells in the Table give the observed False Match Rate in this database for each decision rule and for each range of numbers of bits being compared when computing HDraw . Using the findings in Table 1, it is informative to compare performance for two decision criteria: a very conservative criterion of HDraw ¼ 0.260 (the first row), and a more liberal criterion HDraw ¼ 0.285 (the sixth row) which allows more bits to disagree (28.5%) while still declaring a match. Now if the False Match Rates

Score Normalization Rules in Iris Recognition. Table 1 False match rate without score normalisation: dependence on number of bits compared and criterion HDCrit

400 bits

500 bits

600 bits

700 bits

9

10

10

10

800 bits

900 bits

1,000 bits

0

0

0

0.260

2 10

0.265 0.270 0.275 0.280

3 10 9 4 10 9 7 10 9 1 10 8

8 10 10 1 10 9 2 10 9 4 10 9

5 10 10 9 10 10 1 10 9 2 10 9

2 10 10 5 10 10 9 10 10 2 10 9

4 10 11 2 10 10 5 10 10 1 10 9

0 0 3 10 11 2 10 10

0 0 0 0

0.285 0.290 0.295 0.300

2 10 8 3 10 8 4 10 8 6 10 8

7 10 9 1 10 8 2 10 8 3 10 8

4 10 9 8 10 9 1 10 8 3 10 8

3 10 9 7 10 9 1 10 8 2 10 8

2 10 9 4 10 9 9 10 9 2 10 8

5 10 10 1 10 9 3 10 9 7 10 9

2 10 11 1 10 10 4 10 10 9 10 10

0.305 0.310 0.315 0.320

9 10 8 1 10 7 2 10 7 3 10 7

6 10 8 1 10 7 2 10 7 3 10 7

5 10 8 8 10 8 1 10 7 2 10 7

4 10 8 8 10 8 2 10 7 3 10 7

4 10 8 7 10 8 1 10 7 3 10 7

1 10 8 3 10 8 6 10 8 1 10 7

2 10 9 5 10 9 1 10 8 2 10 8

5 10

3 10

1 10

S

1138

S

Score Normalization Rules in Iris Recognition

are compared in the first and last columns of these rows, namely when only about 400 bits are available for comparison and when about 1,000 bits are compared, it can be seen that, in fact, the more conservative criterion (0.260) actually produces 100 times more False Matches using 400 bits than does the more liberal (0.285) criterion when using 1,000 bits. Moreover, the row corresponding to the HDraw ¼ 0.285 decision criterion reveals that the False Match Rate is 1,000 times greater when only 400 bits are available for comparison than when 1,000 bits are compared. The numerical data of Table 1 is plotted in Figure 2 as a surface, showing how the logarithm of the False Match Rate decays as a function of both variables. The surface plot reveals that there is a much more rapid attenuation of False Match Rate with increase in the number of bits available for comparison (lower-left axis), than by reduction of the HDraw decision criterion in the range of 0.260 - 0.320 (lower-right axis). This is to be expected, given that iris recognition works by a test of statistical independence. The observations of

Table 1 and Figure 2 clearly demonstrate the need for similarity scores to be normalized by the number of bits compared when calculating them. A natural choice for the score normalization rule is to rescale all deviations from HDraw ¼ 0.5 in proportion to the square-root of the number of bits that were compared when obtaining that score. The reason for such a rule is that the expected standard deviation in the distribution of coin-tossing outcomes (expressed as a fraction pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ of the n tosses having a given outcome), is s ¼ pq=n where p and q are the respective outcome probabilities (both nominally 0.5 in this case). Thus, decision confidence levels can be maintained irrespective of how many bits n were actually compared, by mapping each raw Hamming Distance HDraw into a normalized score HDnorm using a re-scaling rule such as: rﬃﬃﬃﬃﬃﬃﬃ n HD norm ¼ 0:5 ð0:5 HD raw Þ ð2Þ 911 This normalization should transform all samples of scores obtained when comparing different eyes into

Score Normalization Rules in Iris Recognition. Figure 2 The data of Table 1 plotted as a surface in semilogarithmic coordinates, showing a range factor of 10,000-to-1 in the False Match Rate as the number of bits compared ranges from 400 to 1,000. This bit count is more influential than is the HDraw decision criterion for unnormalised scores in the 0.260 - 0.320 range.

Score Normalization Rules in Iris Recognition

samples drawn from the same ▶ binomial distribution, whereas the raw scores HDraw might be samples from many different binomial distributions having standard deviations s dependent on the number of bits n that were actually available for comparison. This normalization maintains constant confidence levels for decisions using a given Hamming Distance threshold, regardless of the value of n. The scaling parameter 911 is the typical number of bits compared (unmasked) between two different irises. The effect of using this normalization rule (‘‘SQRT’’) is shown in Figure 3 for the 200 billion comparisons between different irises, plotting the observed False Match Rate as a function of the new HDnorm normalized decision criterion. Also shown for comparison is the unnormalized case (upper curve), and a ‘‘hybrid’’ normalization rule which is a linear combination of the other two, taking into account the number of bits compared only when in a certain range [4]. The benefit of score normalization is profound: it is noteworthy that in this semilogarithmic plot, the ordinate spans a factor of 300,000 to 1.

S

1139

The price paid for achieving this profound benefit in robustness against False Matches is that the match criterion becomes more demanding when less of the iris is visible. Table 2 shows what fraction of bits HDraw (column 3) is allowed to disagree while still accepting a match, as a function of the actual number of bits that were available for comparison (column 1) or the approximate percent of the iris that is visible (column 2). In every case shown in this Table, the probability of making a False Match is about 1 in a million; but it is clear that when only a very little part of two irises can be compared with each other, the degree of match required by the decision rule becomes much more demanding. Conversely, if more than 911 bits (the typical case, corresponding to about 79% of the iris being visible) are available for comparison, then the decision rule becomes more lenient in terms of the acceptable HDraw while still maintaining the same net confidence level. Finally, another cost of using this score normalization rule is apparent if one operates in a region of the ROC curve corresponding to a very nondemanding

S

Score Normalization Rules in Iris Recognition. Figure 3 Comparing the effects of three score normalisation rules on False Match Rate as a function of Hamming Distance.

1140

S

Score Normalization Rules in Iris Recognition

Score Normalization Rules in Iris Recognition. Table 2 Effect of score normalisation on the match quality required with various amounts of iris visibility Number of bits compared

Approximate percent of iris visible (%)

Maximum acceptable fraction of bits disagreeing

200 300 400 500

17 26 35 43

0.13 0.19 0.23 0.26

600 700 800 911

52 61 69 79

0.28 0.30 0.31 0.32

1,000 1,152

87 100

0.33 0.34

False Match Rate, such as 0.001, which was the basis for NIST ICE (Iris Challenge Evaluation 2006) reporting. The ICE iris database contained many very difficult and corrupted images, often in poor focus, and with much eyelid occlusion, with motion blur, raster shear, and sometimes with the iris partly outside of the image frame. As ROC curves require False Matches, NIST used a much more liberal decision criterion than is used in any actual deployments of iris recognition. As seen in Figure 4, using liberal thresholds that generate False Match Rates (FMR) in the range of 0.001–0.00001, score normalization adversely impacts on the ROC curve by increasing the False nonMatch Rate (FnMR). The Equal Error Rate (where FnMR = FMR, indicated by the solid squares) is about 0.001 without score normalization, but 0.002 with the normalization. Similarly at other nominal points of interest in this region of the ROC curve, as tabulated within Figure 4, the cost of score normalization is roughly a doubling in the FnMR, because marginal valid matches are rejected due to the penalty on fewer bits having been available for comparison. In conclusion, whereas Table 1, and Figures 2 and 3 document the important benefit of score normalization when operating with very large databases that require several orders of magnitude higher confidence against False Matches, Figure 4 shows that in scenarios which are much less demanding for FMR, the FnMR is noticeably penalized by score normalization, and so the ROC curve suffers.

Adapting Decision Thresholds to the Size of a Search Database Using the SQRT normalization rule, Figure 5 presents a histogram of all 200 billion cross-comparison similarity scores HDnorm among the 632,500 different irises in the Middle Eastern database [3]. The vast majority of these IrisCodes from different eyes disagreed in roughly 50% of their bits as expected, since the bits are equiprobable and uncorrelated between different eyes [2, 1]. Very few pairings of IrisCodes could disagree in fewer than 35% or more than 65% of their bits, as is evident from the distribution. The form of this distribution needs to be understood, assuming that it is typical and predictive of any other database, in order to understand how to devise decision rules that compensate for the scale of a search. Without this form of score normalization by the scale of the search, or an adaptive decision threshold rule, False Matches would occur simply because large databases provide so many more opportunities for them. The solid curve that fits the distribution data very closely in Figure 5 is a binomial probability density function. This theoretical form was chosen because comparisons between bits from different IrisCodes are Bernoulli trials, or conceptually ‘‘coin tosses,’’ and Bernoulli trials generate binomial distributions. If one tossed a coin whose probability of ‘‘heads’’ is p in a series of n independent tosses and counted the number m of ‘‘heads’’ outcomes, and if one tallied this fraction x ¼ m ∕n in many such repeated runs of n tosses, then the expected distribution of x would be as per the solid curve in Figure 5: f ðxÞ ¼

n! pm ð1 pÞðnmÞ m!ðn mÞ!

ð3Þ

The analogy between tossing coins and comparing bits between different IrisCodes is deep but imperfect, because any given IrisCode has internal correlations arising from iris features, especially in the radial direction [2]. Further correlations are introduced because the patterns are encoded using 2D Gabor wavelet filters, whose lowpass aspect introduces correlations in amplitude, and whose bandpass aspect introduces correlations in phase, both of which linger to an extent that is inversely proportional to the filter bandwidth. The effect of these correlations is to reduce the value of the distribution parameter n to a number significantly smaller than the number of bits that are actually

Score Normalization Rules in Iris Recognition

S

1141

Score Normalization Rules in Iris Recognition. Figure 4 Adverse impact of score normalisation in ROC regions where high False Match Rates are tolerated (e.g., 0.00001 to 0.001 FMR). In these regions, the False nonMatch Rate is roughly doubled as a result of score normalization.

compared between two IrisCodes; n becomes the number of effectively independent bit comparisons. The value of p is very close to 0.5 (empirically 0.499 for this database), because the states of each bit are equiprobable a priori, and so any pair of bits from different IrisCodes is equally likely to agree or disagree. The binomial functional form that describes so well the distribution of normalized similarity scores for comparisons between different iris patterns is key to the robustness of these algorithms in large-scale search applications. The tails of the binomial attenuate extremely rapid, because of the dominating central tendency caused by the factorial terms in (3). Rapidly attenuating tails are critical for a biometric system to survive the vast numbers of opportunities to make False Matches without actually making any, when applied in an ‘‘all-against-all’’ mode of searching for any matching or multiple identities, as is contemplated in

some national ID projects. The requirements of biometric operation in ‘‘identification’’ mode by exhaustively searching a large database are vastly more demanding than operating merely in one-to-one ‘‘verification’’ mode (in which an identity must first be explicitly asserted, which is then verified in a yes/no decision by comparison against just the single nominated template). If P1 is the False Match probability for single oneto-one verification trials, then (1P1) is the probability of not making a False Match in single comparisons. The likelihood of successfully avoiding this in each of N independent attempts is therefore (1P1)N, and so PN, the probability of making at least one False Match when searching a database containing N different patterns is: P N ¼ 1 ð1 P 1 ÞN

ð4Þ

S

1142

S

Score Normalization Rules in Iris Recognition

Score Normalization Rules in Iris Recognition. Figure 5 Binomial distribution of normalised similarity scores in 200 billion comparisons between different eyes. Solid curve is (3).

Observing the approximation that PN NP1 for small P 1 Thresholdcurv)∕Tw

39 41 43 45

(integrated abs. centr. acc. aIc)/amax T(vx 0)∕Tw T(vx >0jpen-up)∕Tw N(vy ¼0) (standard deviation of x)/Dx

47

T ððdx=dtÞðdy=dtÞ>0Þ T ððdx=dtÞðdy=dtÞ > > > > > ðkf ½m1Þ > < ð f ½mf ½m1Þ f ½m 1 k f ½m ð6Þ H m ½k ¼ > ð f ½mþ1k Þ > f ½ m k f ½ m > > ð f ½mþ1f ½mÞ > > > > : 0 k > f ½m þ 1; where 0 1 pðxjCk Þ ¼ exp ðx mk Þ S ðx mk Þ : 2 ð2pÞD=2 jSj1=2 1

1

ð3Þ From (1), we have

where

ak ðxÞ ¼ w > k x þ w k0 ;

ð4Þ

w k ¼ S1 mk :

ð5Þ

1 w k0 ¼ m> S1 mk þ ln pðCk Þ: 2 k

ð6Þ

We see that the equal covariance matrices make ak(x) to be linear in x, and the resulting decision boundaries will also be linear. As a special case of LDA, the nearestneighbor classifier can be obtained, when S¼s2I. If the prior probabilities pðCk Þ are equal, we assign a feature vector x to the class Ck with the minimum Euclidean distance jjx mk jj2 , which is equivalent to the optimum decision rule based on the maximum posterior probability. Another extension of LDA could be obtained by allowing for mixtures of Gaussians for the class-conditional densities instead of the single Gaussian. Mixture discriminant analysis (MDA) [6] incorporates the Gaussian mixture distribution for the class-conditional densities to provide a richer class of density models than the single Gaussian. The class-conditional density for class Ck has the form of the Gaussian mixture model, pðxjCk Þ ¼ PR coefficients r¼1 pkr N ðxjmk ; SÞ, where the mixing PR pkr must satisfy pkr 0 together with r¼1 pkr ¼ 1. In this model, the same covariance matrix S is used within and between classes. The Gaussian mixture model allows for more complex decision boundaries although it does not guarantee the global optimum of maximum likelihood estimates.

Linear

Quadratic Discriminant Analysis Quadratic

Linear

1299

Linear Discriminant Analysis

ð2Þ

In parametric approaches to classification, we directly model the class-conditional density with a parametric form of probability distribution (e.g., multivariate Gaussian). Many parametric methods for classification have been proposed based on different assumptions for pðxjCk Þ [3, 4, 5] (see Table 1):

S

If the covariance matrices Sk are not assumed to be equal, then we get quadratic functions of x for ak(x) 1 1 ak ðxÞ ¼ ðx mk Þ> S1 k ðx mk Þ ln jSk j þ ln pðC k Þ: 2 2

ð7Þ

S

1300

S

Supervised Learning

In contrast to LDA, the decision boundaries of QDA are quadratic, which is resulted from the assumption on the different covariance matrices. From the added flexibility obtained from the quadratic decision boundaries, QDA often outperforms LDA when the size of training data is very large. However, when the size of the training set D is small compared to the dimension D of the feature space, the large number of parameters of QDA relative to LDA causes over-fitting or ill-posed estimation for the estimated covariance matrices. To solve this problem, various regularization or Bayesian techniques have been proposed to obtain more robust estimates: 1. Regularized discriminant analysis (RDA) [7, 8] employs the regularized form of covariance matrices by shrinking Sk of QDA towards the common covariance matrix S of LDA, that is, Sk(a) ¼ aSk þ (1a)S for a 2 [0, 1]. Additionally, the common covariance matrix S could be shrunk towards the scalar covariance, S(g) ¼ gSþ(1g)s2I for g 2[0, 1]. The pair of parameters is selected by crossvalidation based on the classification accuracy of the training set. 2. Leave-one-out covariance estimator (LOOC) [9] finds an optimal regularized covariance matrices by mixing four different covariance matrices of Sk,diag(Sk), S, and diag(S), where the mixing coefficients are determined by maximizing the average leave-one-out log likelihood of each class. 3. Bayesian QDA introduces prior distributions over the mean mk and the covariance matrices Sk [10], or over the Gaussian distributions themselves [11]. The expectations of the class-conditional densities are calculated analytically in terms of the parameters. The hyper-parameters of the prior distributions are chosen by cross-validation.

Naive Bayes Classifier In the naive Bayes classifier, the conditional independence assumption makes the factorized class-conditional densities of the form pðxjCk Þ ¼

D Y i¼1

pðx i jCk Þ:

ð8Þ

The component densities pðx i jCk Þ can be modeled with various parametric and nonparametric distributions, including the following: 1. For continuous features, the component densities are chosen to be Gaussian. In this case, the naive Bayes classifier is equivalent to QDA with diagonal covariance matrices for each class. 2. For discrete features, multinomial distributions are used to model the component densities. The multinomial assumption makes ak(x) and the resulting decision boundaries to be linear in x. 3. The component densities can be estimated using one-dimensional kernel density or histogram estimates for non-parametric approaches. The naive Bayes model assumption is useful when the dimensionality D of the feature space is very high, making the direct density estimation in the full feature space unreliable. It is also attractive if the feature vector consists of heterogeneous features including continuous and discrete features.

Nonparametric Approaches One major problem of parametric approaches is that the actual class-conditional density is not a linear nor a quadratic form in many real-world data. It causes the poor classification performance, since the actual distribution of data is different from a functional form we specified, regardless of parameters. To solve this problem, one can increase the flexibility of the density model by adding more and more parameters, leading to a model with infinitely many number of parameters, called nonparametric density estimation. Otherwise, rather than modeling the whole distribution of a class, one can model only a decision boundary that separates one class from the others, since restricting the functional form of the boundary is a weaker assumption than restricting that of the whole distribution of data. Either using a nonparametric density model or modeling a decision boundary are called nonparametric approaches. In this article, the latter approach is only considered. We define a function ak(x) as a relevancy score of x for Ck, such that ak(x) > 0 if x is more likely to be assigned to Ck , and ak(x) < 0 otherwise. Then, the surface ak(x) ¼ 0 represents the decision boundary

Supervised Learning

S

1301

Supervised Learning. Table 2 Comparison among non-parametric methods for classification ak(x)

Method k-NN ANNs SVMs

ðiÞ

jfx 2 Ck gj fk(Lþ 1)(x) P aki >0 aki y ki kðx i ; xÞ

Number of parameters

Decision boundary

k PL

Nonlinear Linear (L¼0) or nonlinear (L>0)

O(K N)

Linear (k(xi,x)¼xiTx) or nonlinear (otherwise)

‘¼0 ðW‘ þ 1ÞW‘ þ 1

between Ck and the other classes, and a test point x is assigned to Ck if k ¼ arg maxkak(x), which is called one-against-all. Many nonparametric methods have been derived from various models for ak(x). We introduce three representative methods [12, 13, 14] (see Table 2): 1. k-nearest neighbor algorithm (k-NN) chooses k data points in the training set, which are closest from x, then ak(x) is the number of those selected points belonging to Ck . 2. Artificial neural networks (ANNs) represent ak(x) as a multilayered feed-forward network. The ℓth layer consists of Wℓ nodes, where the jth node in the layer sends a (non)linear function value fj(ℓ)(x) as a signal to the nodes in the (L þ 1)th layer. Then, ak(x) is the signal of the kth node in the final layer, fk(Lþ1)(x). 3. Support vector machines (SVMs) choose some ‘‘important’’ training points, called support vectors, then represent ak(x) as a linear combination of them. SVM is known to be the best supervised learning method for most real-world data.

k-NN is widely used in biometrics, especially for computer vision applications such as face recognition and pose estimation, where both the number of images N and dimension of data D are quite large. However, traditional k-NN takes O(ND) time to compute distances between a test point x and all training points x1, . . .,xN, which is too inefficient for practical use. Thus, extensive research has focused on fast approximations based on hashing, embedding or something [15].

Artificial Neural Networks In ANNs, the signal of the jth node in the (ℓþ1)th layer is determined by the signals from the ℓth layer:

ð‘þ1Þ ð‘Þ> ð‘Þ fj ðxÞ ¼ g w j f ð‘Þ ðxÞ þ w j0 ; ð10Þ ð‘Þ

ð‘Þ >

ð‘Þ

where w ‘j ¼ ½w j1 ; w j2 ; ; w jW ‘ ð‘Þ

ð‘Þ

ð‘Þ

and

f ð‘Þ ðxÞ ¼

>

Given a set of data points X ¼ {x1,x2, . . . xN} and a set of the corresponding labels Y ¼ {y1, y2, yN}, K-NN assigns a label for a test data point x by majority voting, that is to choose the most frequently occurred label in {y(1), y(2), . . . y(k)}, where x(i) denotes the ith nearest point of x in X and y(i) is the label of x(i). That is, we have

½f 1 ðxÞ; f 2 ðxÞ; ; f W ‘ ðxÞ . The input layer, f (0)(x), is simply x. g() is a nonlinear, nondecreasing mapping, causing ANNs to yield a nonlinear decision boundary. There are two popular mappings: (1) sigmoid, g(x) ¼1 ∕(1þexp{x}); (2) hyperbolic tangent, g(x) ¼ tanh(x). More nodes and layers increase the nonlinearity of decision boundary obtained by ANNs. However, it is difficult to train ANNs having a number of nodes and layers, since the model can easily fall into poor solutions, called local minima. Radial basis function (RBF) networks [16] are another type of ANNs, having the form

ak ðxÞ ¼ jfx ðiÞ 2 Ck gj;

ak ðxÞ ¼ w > k FðxÞ þ w k0 :

k-Nearest Neighbor Algorithm

ð9Þ

where jj denotes the number of elements in a set. The decision boundary is not restricted to a specific functional form. It depends only on the local distribution of neighbors and the choice of k. Larger k makes the decision boundary more smooth.

ð11Þ

That is, RBF networks contain only one hidden layer, denoting by F(x)¼[f1(x),f2(x),. . .,fW(x)], and the network output is simply a linear combination of the hidden nodes. The main difference between RBF networks and ANNs with L ¼ 1 is the mapping from the

S

1302

S

Supervised Learning

input to the hidden. In RBF networks, each jj() is a nonlinear function similar to Gaussian density: n o fj ðxÞ ¼ exp bj jjx c j jj2 ; ð12Þ for some bj > 0 and the center vector cj. That is, each hidden node represents local region whose center is cj, and its signal would be stronger if x and cj are closer. In general, cj is fixed to one of the training points and bj is chosen by hand, thus the global optimum of wk and wk0 can be simply found by least squares fitting.

Theoretically, the generalization power of SVMs is guaranteed by Vapnik–Chervonenkis theory [17]. Training SVMs can be rewritten as the following convex optimization problem min jjw k jj;

w k ;w k0

subject to y ki ak ðxi Þ 1 for all i; ð14Þ

where yki ¼ 1 if x i 2 Ck and otherwise1. At the optimum, ak(x) has the form ak ðxÞ ¼

Support Vector Machines Similar to RBF networks, SVMs obtain a linear decision boundary in a transformed space: ak(x)¼wkTF(x)þ wk0, where F() is an arbitrary mapping, either linear or nonlinear. The difference betwen SVMs and ANNs is the optimality of the decision boundary. In SVMs, the optimal decision boundary is such that the distance between the boundary and the closest point from that boundary, called the margin, is maximized: jak ðxi Þj max min : ð13Þ w k ;w k0 i jjw k jj This optimization problem always converges to the global solution, the maximum margin boundary. Figure 1 shows the motivation for SVMs intuitively. One can expect that the generalization error of the maximum margin boundary is less than that of other boundaries.

n X

aki y ki Fðxi Þ> FðxÞ;

ð15Þ

i¼1

where aki 0 is a Lagrangian multiplier of the ith constraint, ykiak(xi) 1. If a data point xi is exactly on the margin, i.e., ykiak(xi) ¼ 1, then xi is called support vector and aki > 0. Otherwise, aki ¼ 0 and ykiak(xi) > 1. Hence, ak(x) only depends on the support vectors. To compute F(xi)TF(x), we can introduce a function of the form k(xi, x), representing the inner product in the feature space can be used, without computing the mapping F() explicitly. Such a function is called kernel function [18]. There are two popular kernel functions: (1) polynomial kernel, k(xi, x)¼ (xiTxþc)p for some c and p > 0; (2) Gaussian kernel (also called as RBF kernel), kðx i ; xÞ ¼ expf 2s1 2 jjx i xjj2 g for some s > 0. Various algorithms and implementations have been developed to train SVMs efficiently. Two most popular softwares are LIBSVM [19] and SVMlight [20], both

Supervised Learning. Figure 1 (Left) Possible solutions obtained by neural networks. (Right) SVMs give one global solution, the maximum margin boundary.

Support Vector Machine

implement several techniques such as working set selection, shrinking heuristics, and LRU caching to speed up optimization, and provide various kernel functions with choosing appropriate parameters of those functions automatically (automatic model selection). Two recent extensions of SVMlight – SVMstruct for structured data, SVMperf for training with more than hundredthousands of data points – are also popular in biometrics.

Related Entries

S

1303

15. Shakhnarovich, G., Darrell, T., Indyk, P.: Nearest-Neighbor Methods in Learning and Vision: Theory and Practice. MIT Press, Cambridge, MA (2006) 16. Moody, J., Darken, C.J.: Fast learning in networks of locally tuned processing units. Neural Comput. 1, 281–294 (1989) 17. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag, New York (1995) 18. Scho¨lkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, Cambridge, MA (2002) 19. Chang, C.C., Lin, C.J.: LIBSVM – A Library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm (2000) 20. Joachims, T.: SVMlight, http://svmlight.joachims.org (2004)

▶ Classifier Design ▶ Machine-Learning ▶ Probability Distribution

Supervisor References 1. Jain, A.K., Ross, A., Prabhakar, S.: An introduction to biometric recognition. IEEE Trans. Circuits Syst. Video Technol. 14(1), 4–20 (2004) 2. Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: A review. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 4–37 (2000) 3. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, New York (2001) 4. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006) 5. Hastie, T., Tibsjirani, R., Friedman, J.: The Elements of Statistical Learning. Springer, New York (2001) 6. Hastie, T., Tibshirani, R.: Discriminant analysis by Gaussian mixtures. J. R. Stat. Soc. Ser. B 58, 158–176 (1996) 7. Friedman, J.H.: Regularized discriminant analysis. J. Am. Stat. Assoc. 84, 165–175 (1989) 8. Ye, J., Wang, T.: Regularized discriminant analysis for high dimensional, low sample size data. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia, PA (2006) 9. Hoffbeck, J.P., Landgrebe, D.A.: Covariance matrix estimation and classification with limited training data. IEEE Trans. Pattern Anal. Mach. Intell. 18(7), 763–767 (1996) 10. Geisser, S.: Predictive Inference: An Introduction. Chapman & Hall, New York (1993) 11. Srivastava, S., Gupta, M.R., Frigyik, B.A.: Bayesian quadratic discriminant analysis. J. Mach. Learn. Res. 8, 1277–1305 (2007) 12. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory IT–13, 21–27 (1967) 13. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by backpropagating errors. Nature 323, 533–536 (1986) 14. Boser, B.E., Guyon, I., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop of Computational Learning Theory, pp. 144–152 (1992)

A generic term for a method or a system that is able to output an aggregated opinion. ▶ Multiple Experts

Supervisor Opinion The output of the supervisor which can be a strict score (0 or 1) or a graded score (2 [0, 1]) representing the belief of the supervisor on an identity claim by aggregating expert opinions. ▶ Multiple Experts

Support Vector Machine M ATHIAS M. A DANKON , M OHAMED C HERIET University of Quebec ETS, Montreal, Canada

Synonyms SVM; Margin classifier; Maximum margin classifier; Optimal hyperplane

S

1304

S

Support Vector Machine

Definition Support vector machines (SVMs) are particular linear ▶ classifiers which are based on the margin maximization principle. They perform ▶ structural risk minimization, which improves the complexity of the classifier with the aim of achieving excellent ▶ generalization performance. The SVM accomplishes the classification task by constructing, in a higher dimensional space, the hyperplane that optimally separates the data into two categories.

Introduction Considering a two-category classification problem, a linear classifier separates the space, with a hyperplane, into two regions, each of which is also called a class. Before the creation of SVMs, the popular algorithm for determining the parameters of a linear classifier was a single-neuron perceptron. The perceptron algorithm uses an updating rule to generate a separating surface for a two-class problem. The procedure is guaranteed to converge when the ▶ training data are linearly separable, however there exists an infinite number of hyperplanes that correctly classify these data (see Fig. 1). The idea behind the SVM is to select the hyperplane that provides the best generalization capacity. Then, the SVM algorithm attempts to find the

maximum margin between the two data categories and then determines the hyperplane that is in middle of the maximum margin. Thus, the points nearest the decision boundary are located at the same distance from the optimal hyperplane. In machine learning theory, it is demonstrated that the margin maximization principle provides the SVM with a good generalization capacity, because it minimizes the structural risk related to the complexity of the SVM [1].

SVM Formulation Let consider a dataset fðx 1 ; y 1 Þ; . . . ; ðx ‘ ; y ‘ Þg with x i 2 Rd and yi 2{1,1}. SVM training attempts to find the parameters w and b of the linear decision function f(x) ¼ w.x þ b defining the optimal hyperplane. The points near the decision boundary define the margin. Considering two points x1, x2 on opposite sides of the margin with f(x1)¼1 and f(x2)¼ 1, the margin equals ½ f ðx 1 Þ f ðx 2 Þ=kw k ¼ 2=kw k. Thus, maximizing the margin is equivalent to minimizing ||w||∕2 or ||w||2 ∕2. Then, to find the optimal hyperplane, the SVM solves the following optimization problem: 1 min w 0 w w;b 2 s:t yi ðw 0 :xi þ bÞ 1 8i ¼ 1; . . . ; ‘

ð1Þ

Support Vector Machine. Figure 1 Linear classifier: In this case, there exists an infinite number of solutions. Which is the best?

Support Vector Machine

The transformation of this optimization problem into its corresponding dual problem gives the following quadratic problem: max /

s:t

‘ X

ai

i¼1

‘ X

‘

1X ai aj yi yi xi :xj 2 i;j¼1

ð2Þ

yi ai ¼ 0; a 0 8i ¼ 1; . . . ; ‘

i¼1

Where W 0 denotes the transpose of W. The solution of the previous problem gives P the parameter w ¼ ‘i¼1 y i ai x i of the optimal hyperplane. Thus, the decision function becomes P f ðxÞ ¼ ‘i¼1 ai y i ðx i :xÞ þ b in dual space. Note that the value of the bias b does not appear in the dual problem. Using the constraints of the primal problem, the bias is given by b ¼ 1=2½max y¼1 ðw:x i Þ þ miny¼1 ðw:x i Þ. It is demonstrated with the Karush-Kuhn-Tucker conditions that only the examples xi that satisfy yi(w. xi þb)¼1 are the corresponding ai non-zero. These examples are called support vectors (see Fig. 2).

SVM in Practice In real-world problems, the data are not linearly separable, and so a more sophisticated SVM is used to solve

S

1305

them. First, the slack variable is introduced in order to relax the margin (this is called a soft margin optimization). Second, the kernel trick is used to produce nonlinear boundaries [2]. The idea behind kernels is to map training data nonlinearly into a higher-dimensional feature space via a mapping function F and to construct a separating hyperplane which maximizes the margin (see Fig. 3). The construction of the linear decision surface in this feature space only requires the evaluation of dot products f(xi).f(xj)¼k(xi,xj), where the application k : Rd Rd ! R is called the kernel function [3, 4]. The decision function given by an SVM is: yðxÞ ¼ sign½w 0 fðxÞ þ b;

ð3Þ

where w and b are found by resolving the following optimization problem that expresses the maximization of the margin 2∕||w|| and the minimization of training error: ‘ X 1 min w 0 w þ C xi ðL1 SVMÞ or w;b;x 2 i¼1 ð4Þ ‘ X 1 0 2 xi ðL2 SVMÞ min w w þ C w;b;x 2 i¼1 subject to : yi ½w 0 fðxi Þ þ b 1 xi 8i ¼ 1; . . . ; ‘ ð5Þ

S

Support Vector Machine. Figure 2 SVM principle: illustration of the unique and optimal hyperplane in a two-dimensional input space based on margin maximization.

1306

S

Support Vector Machine

Support Vector Machine. Figure 3 Illustration of the kernel trick: The data are mapped into a higher-dimensional feature space, where a separating hyperplane is constructed using the margin maximization principle. The hyperplane is computed using the kernel function without the explicit expression of the mapping function. (a) Nonlinearly separable data in the input space. (b) Data in the higher-dimensional feature space.

xi 0 8i ¼ 1; . . . ; ‘:

ð6Þ

By applying the Lagrangian differentiation theorem to the corresponding dual problem, the following decision function is obtained: X‘ yðxÞ ¼ sign½ ai y i kðx i ; xÞ þ b; ð7Þ i¼1

with a solution of the dual problem. The dual problem for the L1-SVM is the following quadratic optimization problem: maximize : W ðaÞ ¼

‘ X

ai

i¼1

‘

1X ai aj yi yj k xi ; xj 2 i;j¼1

ð8Þ subject to :

‘ X

ai yi ¼ 0 and 0 ai C;i ¼ 1;...;‘: ð9Þ

Support Vector Machine. Table 1 Common kernel used with SVM Gaussian (RBF) Polynomial Laplacian Multi-quadratic

kðx; yÞ ¼ expðjjx yjj2 =s2 Þ kðx; yÞ ¼ ðax:y þ bÞn ðx; yÞ ¼ expðajjx yjj þ bÞ kðx; yÞ ¼ ðajjx yjj þ bÞ1=2

Inverse multiquadratic KMOD

kðx; yÞ ¼ ðajjx yjj þ bÞ1=2 " kðx; yÞ ¼ a exp

g2 jjxyjj2 þs2

# 1

In practice, the L1-SVM is used most of the time, and its popular implementation developed by Joachims [5] is very fast and scales to large datasets. This implementation, called SVMlight, is available at svmlight.joachims.org.

i¼1

Using the L2-SVM, the dual problem becomes : maximize : W ðaÞ ¼

‘ X

SVM Model Selection ai

i¼1

1 1 dij ai aj yi yj k xi ; xj þ 2 i:j¼1 2C ‘ X

ð10Þ subject to :

‘ X

ai yi ¼ 0 and 0 ai ; i ¼ 1; . . . ‘: ð11Þ

i¼1

where dij ¼ 1 if i ¼ j and 0 otherwise.

To achieve good SVM performance, optimum values for the kernel parameters and for the hyperparameter C much be chosen. The latter is a regularization parameter controlling the trade-off between the training error minimization and the margin maximization. The kernel parameters define the kernel function used to map data into a higher-dimensional feature space (see Table 1). Like kernel functions, there are the Gaussian kernel k(xi,xj)¼exp(||xi xj||2 ∕s2) with parameter s and

Support Vector Machine

S

1307

Support Vector Machine. Figure 4 (a) and (b) show the impact of SVM hyperparameters on classifier generalization, while (c) illustrates the influence of the choice of kernel function.

the polynomial kernel k(xi, xj)¼(axi0 xj þ b)d with parameters a, b and d. The task of selecting the hyperparameters that yield the best performance of the machine is called model selection [6, 7, 8, 9]. As an illustration, Fig. 4a shows the variation of the error rate on a validation set versus the variation of the Gaussian kernel with a fixed value of C and Fig. 4b shows the variation of the error rate on the validation set versus the variation of the hyperparameter C with a fixed value of the RBF kernel parameter. In each case, the binary problem described by the ‘‘Thyroid’’ data taken from the UCI benchmark is resolved. Clearly, the best performance is achieved with an optimum choice of the kernel parameter and of C. With the SVM, as with other kernel classifiers, the choice of kernel corresponds to choosing a function space for learning. The kernel determines the functional form of all possible solutions. Thus, the choice of kernel is very important in the construction of a good machine. So, in order to obtain a good performance from the SVM classifier, one first need to design or choose a type of kernel, and then optimize the SVM’s hyperparameters to improve the generalization capacity of the classifier. Figure 4c illustrates the influence of the kernel choice, where the RBF and the polynomial kernels are compared on datasets taken from the challenge website on model selection and prediction organized by Isabelle Guyon.

Resolution of Multiclass Problems with the SVM The SVM is formulated for the binary classification problem. However, there are some techniques used to

combine several binary SVMs in order to build a system for the multiclass problem (e.g., a 10-class digit recognition problem). Two popular methods are presented here: OneVersustheRest: The idea of one versus the rest is to construct as many SVMs as there are classes, where each SVM is trained to separate one class from the rest. Thus, for a c-class problem, c SVMs are built and combined to perform multiclass classification according to the maximal output. The ith SVM is trained with all the examples in the ith class with positive labels, and all the other examples with negative examples. This is also known as the One-Against-All method. Pairwise(orOneAgainstOne): The idea of pairwise is to construct c(c1)2 SVMs for a c-class problem, each SVM being trained for every possible pair of classes. A common way to make a decision with the pairwise method is by voting. A rule for discriminating between every pair of classes is constructed, and the class with the largest vote is selected.

SVM Variants The least squares SVM (LS-SVM) is a variant of the standard SVM, and constitutes the response to the following question: How much can the SVM formulation be simplified without losing any of its advantages? Suykens and Vandewalle [10] proposed the LS-SVM where the training algorithm solves a convex problem like the SVM. In addition, the training algorithm of the LS-SVM is simplified, since a linear problem is resolved instead of a quadratic problem in the SVM case.

S

1308

S

Surface Curvature

The Transductive SVM (TSVM) is an interesting version of the SVM, which uses transductive inference. In this case, the TSVM attempts to find the hyperplane and the labels of the test data that maximize the margin with minimum error. Thus, the label of the test data is obtained in one step. Vapnik [1] proposed this formulation to reinforce the classifier on the test set by adding the minimization of the error on the test set during the training process. This formulation has been used elsewhere recently for training semi-supervised SVMs.

Applications The SVM is a powerful classifier which has been used successfully in many pattern recognition problems, and it has also been shown to perform well in biometrics recognition applications. For example, in [11], an iris recognition system for human identification has been proposed, in which the extracted iris features are fed into an SVM for classification. The experimental results show that the performance of the SVM as a classifier is far better than the performance of a classifier based on the artificial neural network. In another example, Yao et al. [12], in a fingerprint classification application, used recursive neural networks to extract a set of distributed features of the fingerprint which can be integrated into the SVM. Many other SVM applications, like handwriting recognition [8, 13], can be found at www.clopinet. com/isabelle/Projects/SVM/applist.html.

3. Scholkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, Cambridge, MA (2002) 4. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines. Cambridge University Press (2000) 5. Joachims, T.: Making large-scale support vector machine learning practical. In: Scholkopf, Burges, Smola (eds.) Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge, MA (1998) 6. Chapelle, O., Vapnik, V.: Model selection for support vector machines. Advances in Neural Information Processing Systems (1999) 7. Ayat, N.E., Cheriet, M., Suen, C.Y.: Automatic Model Selection for the Optimization of the SVM kernels. Pattern Recognit. 38(10), 1733–1745 (2005) 8. Adankon, M.M., Cheriet, M.: Optimizing Resources in Model Selection for Support Vector Machines. Pattern Recognit. 40(3), 953–963 (2007) 9. Adankon, M.M., Cheriet, M.: New formulation of svm for model selection. In: IEEE International Joint Conference in Neural Networks 2006, pp. 3566–3573. Vancouver, BC (2006) 10. Suykens, J.A.K., Van Gestel, T., De Brabanter, J., De Moor, B., Vandewalle, J.: Least Squares Support Vector Machines. World Scientific, Singapore (2002) 11. Roy, K., Bhattacharya, P.: Iris recognition using support vector machine. In: APR International Conference on Biometric Authentication (ICBA), Hong Kong, January 2006. Springer Lecture Note Series in Computer Science (LNCS), pp. (3882) 486–492 (2006) 12. Yao, Y., Marcialis, G.L., Pontil, M., Frasconi, P., Rolib, F.: Combining flat and structured representations for fingerprint classification with recursive neural networks and support vector machines. Pattern Recognit. 36(2), 397–406 (2003) 13. Matic, N., Guyon, I., Denker, J., Vapnik, V.: Writer adaptation for on-line handwritten character recognition. In: IEEE Second International Conference on Pattern Recognition and Document Analysis, pp. 187–191. Tsukuba, Japan (1993)

Related Entries ▶ Classifier ▶ Generalization ▶ Structural Risk ▶ Training

References 1. Vapnik, V.N.: Statistical learning theory. Wiley, New York (1998) 2. Boser, B.M.E., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In: Proceedings of Fifth Annual Workshop on Computational Learing Theory, pp. 144–152 (1992)

Surface Curvature Measurements of the curvature of a surface are commonly used in 3D biometrics. The normal curvature on a point p on the surface is defined as the curvature of the curve that is formed by the intersection of the surface with the plane containing the normal vector and one of the tangent vectors at p. Thus the normal curvature is a function of the tangent vector direction. The minimum and maximum values of this function are the principal curvatures k1 and k2 of the surface

Surveillance

at p. Other measures of surface curvature are the Gaussian curvature defined as the product of principal curvatures, the mean curvature defined as the average of principal curvatures and the shape index given by SI ¼

2 k2 þ k1 p k2 k1

Computation of surface curvature on discrete surfaces such as those captured with 3D scanners is usually accomplished by locally fitting low order surface patches (e.g. biquadratic surfaces, splines) over each point. Then the above curvature features may be computed analytically. ▶ Finger Geometry, 3D

Surface Matching 3D biometrics work by computing the similarity between 3D surfaces of objects belonging to the same class. The majority of the techniques used measure the similarity among homologous salient geometric features on the surfaces (e.g. based on curvature). The localization of these features is usually based on prior knowledge of the surface class (e.g. face, hand) and thus, specialized feature detectors may be used. The geometric attributes extracted are selected so that they are invariant to transformations such as rotation, translation and scaling. In the case that knowledgebased feature detection is difficult, a correspondence among the surfaces may be established by randomly selecting points on the two surfaces and then trying to find pairs of points with similar geometric attributes. Several such techniques have been developed for rigid surface matching (e.g. Spin Images) which may be extended for matching non-rigid or articulated surfaces. Another technique for establishing correspondences is fitting a parameterized deformable model to the points of each surface. Since the fitted models are deformations of the same surface, correspondence is automatically determined. Creation of such deformable models requires however a large number of annotated training data. ▶ Finger Geometry, 3D

S

1309

Surveillance R AMA C HELLAPPA , A SWIN C. S ANKARANARAYANAN University of Maryland, College Park, MD, USA

Synonyms Monitoring; Surveillance

Definition Surveillance refers to monitoring of a scene along with analysis of behavior of the people and vehicles for the purpose of maintaining security or keeping a watch over an area. Typically, traditional surveillance involves monitoring of a scene using one or more close circuit television (CCTV) cameras with personnel watching and making decisions based on video feeds obtained from the ▶ cameras. There is a growing need towards building systems that are completely automated or operate with minimal human supervision. Biometric acquisition and processing is by far the most important component of any automated surveillance system. There are many challenges and variates that show up in acquisition of biometrics for robust verification. Further, in surveillance, behavioral biometrics is also of potential use in many scenarios. Using the patterns observed in a scene (such as faces, speech, behavior), the system decides on a set of actions to perform. These actions could involve access control (allowing/ denying access to facilities), alerting the presence of intruders/abandoned luggage and a host of other security related tasks.

Introduction Surveillance refers to monitoring a scene using sensors for the purposes of enhanced security. Surveillance systems are becoming ubiquitous, especially in urban areas with growing deployment of cameras and CCTV for providing security in public areas such as banks, shopping malls, etc. It is estimated that UK alone has more than four million CCTV cameras. Surveillance technologies are also becoming common for

S

1310

S

Surveillance

other applications [1] such as traffic monitoring, wherein it is mainly used for detecting violations and monitoring traffic. Typically, video cameras are finding use for detecting congestion, accidents, and in adaptive switching of traffic lights. Other typical surveillance tasks include portal control, monitoring shop lifting, and suspect tracking as well as postevent analysis [2]. A traditional surveillance system involves little automation. Most surveillance systems have a set of cameras monitoring a scene of interest. Data collected from these sensors are used for two purposes. 1. Real time monitoring of the scene by human personnel. 2. Archiving of data for retrieval in the future. In most cases, the archived data is only retrieved after an incident has occurred. This, however is changing with introduction of many commercial surveillance technologies that introduce more automation thereby alleviating the need or reducing the involvement of humans in the decision making process [3]. Simultaneously, the focus has also been in visualization tools for better depiction of data collected by the sensors and in fast retrieval of archived data for quick forensic analysis. Surveillance systems that can detect elementary events in the video streams acquired by several cameras are commercially available today. A very general surveillance system is schematically shown in Fig. 1.

Biometrics form a critical component in all (semi-)automated surveillance systems, given the obvious need to acquire, validate, and process biometrics in various surveillance tasks. Such tasks include: 1. Verification. Validating a person’s identity is useful in access control. Typically, verification can be done in a controlled manner, and can use active biometrics such as iris, face (controlled acquisition), speech, finger/hand prints. The system is expected to use the biometrics to confirm if the person is truly whom he/she claims to be. 2. Recognition. Recognition of identity shows up in tasks of intruder detection and screening, which finds use in a wide host of scenarios from scene monitoring to home surveillance. This involves cross-checking the acquired biometrics across a list to obtain a match. Typically, for such tasks, passive acquisition methods are preferred making face and gait biometrics useful for this task. 3. Abnormality detection. Behavioral biometrics find use in surveillance of public areas, such as airports and malls, where the abnormal/suspicious behavior exhibited by a single or group of individual forms is the biometric of interest. Biometrics finds application across a wide range of surveillance tasks. We next discuss the variates and trade-offs involved in using biometrics application for surveillance.

Surveillance. Figure 1 Inputs from sensors are typically stored on capture. The relevant information is searched and retrieved only after incidents. However, in more automated systems the inputs are pre-processed for events. The system monitors these certain patterns to occur which initiates the appropriate action. When multiple sensors are present, for additional robustness, data across sensors might be fused.

Surveillance

Biometrics and Surveillance The choice of biometric to be used in a particular task depend on the match between the acquisition and processing capability of the biometric to the requirements of the task. Such characteristics include the discriminative power of the biometric, ease of acquisition, the permanence of the biometric, and miscellaneous considerations such as acceptability of its use and ▶ privacy concerns [4, 5]. Towards this end, we discuss some of the important variates that need be considered in biometric surveillance. 1. Cooperative acquisition. Ease of acquisition is probably the most important consideration for use of a particular biometric. Consider the task of home surveillance, where the system tries to detect intruders by comparing the acquired biometric signature to a database of individuals. It is not possible in such a task to use iris as a biometric, because acquisition of iris pattern requires cooperation of the subject. Similarly, for the same task, it is also unreasonable to use controlled face recognition (with known pose and illumination) as a possible biometric for similar reasons. Using the cooperation of subject as a basis, allows us to classify biometrics into two kinds: cooperative and non-cooperative. Fingerprints, hand prints, speech (controlled), face (controlled), iris, ear, and DNA are biometrics that need the active cooperation of the subject for acquisition. These biometrics, given the cooperative nature of acquisition, can be collected reliably under a controlled setup. Such controls could be a known sentence for speech, a known pose and favorable illumination for face. Further, the subject could be asked for multiple samples of the same biometric for increased robustness to acquisition noise and errors. In return, it is expected that the biometric performs at increased reliability with lower false alarms and lower mis-detections. However, the cooperative nature of acquisition makes these biometrics unusable for a variety of operating tasks. None the less, such biometrics are extremely useful for a wide range of tasks, such as secure access control, and for controlled verification tasks such as those related to passports and other identification related documents. In contrast, acquisition of the biometric without the cooperation of the subject(s) is necessary for surveillance of regions with partially or completely unrestricted access, wherein the sheer number of subjects

S

involved does not merit the use of active acquisition. Non-cooperative biometrics are also useful in surveillance scenarios requiring the use of behavioral biometrics, as with behavioral biometrics the use of active acquisition methods might inherently affect the very behavior that we want to detect. Face and gait are probably the best examples of such biometrics. 2. Inherent capability of discrimination. Each biometric depending on its inherent variations across subjects, and intra-variations for each individual has limitations on the size of the dataset it can be used before its operating characteristics (false alarm and mis-detection rate) go below acceptable limits. DNA, iris, and fingerprint provide robust discrimination even when the number of individuals in the database are in tens of thousands. Face (under controlled acquisition) can robustly recognize with low false alarms and mis-detections upto datasets containing many hundreds of individuals. However, performance of face as a biometric steeply degrades with uncontrolled pose, illumination, and other effects such as aging, disguise, and emotions. Gait, as a biometric provides similar performance capabilities as that of face under uncontrolled acquisition. However, as stated above, both face and gait can be captured without the cooperation of the subject, which makes them invaluable for certain applications. However, their use also critically depends on the size of the database that is used. 3. Range of operation. Another criterion that becomes important in practical deployment of systems using biometrics is the range at which acquisition can be performed. Gait, as an example, works with the human silhouette as the basic building block, and can be reliably captured at ranges upto a 100 m (assuming a common deployment scenario). In contrast, fingerprint needs contact between the subject and the sensor. Similarly, iris requires the subject to be at much closer proximity than what is required for face. 4. Miscellaneous considerations. There exist a host of other considerations that decide the suitability of a biometric to a particular surveillance application. These include the permanence of the biometric, security considerations such as the ease of imitating or tampering, and privacy considerations in its acquisition and use [4, 5]. For example, the permanence of face as a biometric depends on the degradation of its discriminating capabilities as the subject ages [6, 7].

1311

S

1312

S

Surveillance

Similarly, the issue of wear of the fingerprints with use becomes an issue for consideration. Finally, privacy considerations play an important role in the acceptability of the biometrics’ use in commercial systems.

Behavioral Biometrics in Surveillance Behavioral biometrics are very important for surveillance, especially towards identifying critical events before or as they happen. In general, the visual modality (cameras) is most useful for capturing behavioral information, although there has been some preliminary work on using motion sensor for similar tasks. In the presence of a camera, the processing of data to obtain such biometrics falls under the category of event detection. In the context of surveillance systems, these can be broadly divided into those that model actions of single objects and those that handle multi-object interactions. In the case of single-objects, an understanding of the activity (behavior) being performed is of immense interest. Typically, the object is described in terms of a feature-vector [8] whose representation is suitable to identify the activities while marginalizing nuisance parameters such as the identity of the object or view and illumination. Stochastic models such as the Hidden Markov models and Linear Dynamical Systems have been shown to be efficient in modeling activities. In these, the temporal dynamics of the activity are captured using state-space models, which form a generative model for the activity. Given a test activity, it is possible to evaluate the likelihood of the test sequence arising from the learnt model. Capturing the behavioral patterns exhibited by multiple actors is of immense importance in many surveillance scenarios. Examples of such interactions include an individual exiting a building and driving

a car, or an individual casing vehicles. A lot of other scenarios, such as abandoned vehicles, dropped objects fit under this category. Such interactions can be modeled using context-free grammars [9, 10] (Fig. 2). Detection and tracking data are typically parsed by the rules describing the grammar and a likelihood of the particular sequence of tracking information conforming to the grammar is estimated. Other approaches rely on motion analysis of humans accompanying the abandoned objects. The challenges towards the use of behavioral biometrics in surveillance tasks are in making algorithms robust to variations in pose, illumination, and identity. There is also the need to bridge the gap between the tools for representation and processing used for identifying biometrics exhibited by individuals and those by groups of people. In this context, motion sensors [11] provide an alternate way for capturing behavioral signatures of groups. Motion sensors register time-instants when the sensor observes motion in its range. While this information is very sparse, without any ability to recognize people or disambiguate between multiple targets, a dense deployment of motion sensors along with cameras can be very powerful.

Conclusion In summary, biometrics are an important component of automated surveillance, and help in the tasks of recognition and verification of a target’s identity. Such tasks find application in a wide range of surveillance applications. The use of a particular biometric for a surveillance application depends critically on the match between the properties of the biometrics and the needs of the application. In particular, attributes

Surveillance. Figure 2 Example frames from a detected casing incident in a parking lot. The algorithm described in [10] was used to detect the casing incident.

SVM Supervector

such as ease of acquisition, range of acquisition and discriminating power form important considerations towards the choice of biometric used. In surveillance, behavioral biometrics are useful in identifying suspicious behavior, and finds use in a range of scene monitoring applications.

Related Entries ▶ Border Control ▶ Law Enforcement ▶ Physical Access Control ▶ Face Recognition, Video Based

S

1313

SVM ▶ Support Vector Machine

SVM Supervector An SVM (Support Vector Machine) is a two class classifier. It is constructed by sums of kernel function K(.,.): f ðxÞ ¼

L X

ai ti K ðx; xi Þ þ d

ð1Þ

i¼1

References 1. Remagnino, P., Jones, G.A., Paragios, N., Regazzoni, C.S.: VideoBased Surveillance Systems: Computer Vision and Distributed Processing. Kluwer, Dordecht (2001) 2. Zhao, W., Chellappa, R., Phillips, P., Rosenfeld, A.: Face recognition: a literature survey. ACM Comput. Surv. 35, 399–458 (2003) 3. Shu, C., Hampapur, A., Lu, M., Brown, L., Connell, J., Senior, A., Tian, Y.: Ibm smart surveillance system (s3): a open and extensible framework for event based surveillance. In: IEEE Conference on Advanced Video and Signal Based Surveillance. pp. 318–323 (2005) 4. Prabhakar, S., Pankanti, S., Jain, A.: Biometric recognition: security and privacy concerns. Security & Privacy Magazine, IEEE 1, 33–42 (2003) 5. Liu, S., Silverman, M.: A practical guide to biometric security technology. IT Professional 3, 27–32 (2001) 6. Ramanathan, N., Chellappa, R.: Face verification across age progression. Comput. Vision Pattern Recognit., 2005. CVPR 2005. IEEE Computer Society Conference on 2 (2005) 7. Ling, H., Soatto, S., Ramanathan, N., Jacobs, D.: A study of face recognition as people age. Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on pp. 1–8 (2007) 8. Veeraraghavan, A., Roy-Chowdhury, A.K., Chellappa, R.: Matching shape sequences in video with applications in human movement analysis. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1896–1909 (2005) 9. Moore, D., Essa, I.: Recognizing multitasked activities from video using stochastic context-free grammar. Workshop on Models versus Exemplars in Computer Vision (2001) 10. Joo, S., Chellappa, R.: Recognition of multi-object events using attribute grammars. IEEE International Conference on Image Processing pp. 2897–2900 (2006) 11. Wren, C., Ivanov, Y., Leigh, D., Westhues, J.: The MERL Motion Detector Dataset: 2007 Workshop on Massive Datasets. (Technical report)

ti are the ideal outputs (−1 for one class and +1 for the L P other class) and ai ti ¼ 0ðai > 0Þ The vectors xi are i¼1

the support vectors (belonging to the training vectors) and are obtained by using an optimization algorithm. A class decision is based upon the value of f (x) with respect to a threshold. The kernel function is constrained to verify the Mercer condition: K ðx; yÞ ¼ bðxÞt bðyÞ; where b(x) is a mapping from the input space (containing the vectors x) to a possibly infinite-dimensional SVM expansion space. In the case of speaker verification, given universal background (GMM UBM): gðxÞ ¼

M X

oi N ðx; mi ; Si Þ;

ð2Þ

i¼1

where, oi are the mixture weights, N() is a Gaussian, andðmi þ Si Þare the means and covariances of Gaussian components. A speaker (s) model is a GMM obtained by adapting the UBM using MAP procedure (only means are adapted: ðms Þ). In this case the kernel function can be written as: K ðs1 ; s2 Þ ¼

M X pﬃﬃ ð1=2Þ s1 t ð oi Si mi Þ i¼1

ð

pﬃ

ð3Þ

ð1=2Þ s2 oi Si mi Þ:

The kernel of the above equation is linear in the GMM Supervector space and hence it satisfies the Mercer condition. ▶ Session Effects on Speaker Modeling

S

1314

S

Sweep Sensor

Sweep Sensor It refers to a fingerprint sensor on which the finger has to sweep on the platen during the capture. Its capture area is very small and it is represented by few pixel lines. ▶ Fingerprint, Palmprint, Handprint and Soleprint Sensor

Synthetic Fingerprint Generation ▶ Fingerprint Sample Synthesis ▶ SFinGe

Synthetic Fingerprints ▶ Fingerprint Sample Synthesis

Synthesis Attack Synthesis attack is similar to replay attack in that it also involves the recording of voice samples from a legitimate client. However, these samples are used to build a model of the client’s voice, which can in turn be used by a text-to-speech synthesizer to produce speech that is similar to the voice of the client. The text-to-speech synthesizer could then be controlled by an attacker, for example, by using the keyboard of a notebook computer, to produce any words or sentences that may be requested by the authentication system in the client’s voice in order to achieve false authentication.

Synthetic Iris Images ▶ Iris Sample Synthesis

Synthetic Voice Creation ▶ Voice Sample Synthesis

▶ Liveness Assurance in Face Authentication ▶ Liveness Assurance in Voice Authentication ▶ Security and Liveness, Overview

System-on-card

Synthetic Biometrics ▶ Biometric Sample Synthesis

Smartcard has complete biometric verification system which includes data acquisition, processing, and matching. ▶ On-Card Matching

Sample Size ▶ Manifold Learning ▶ Performance Evaluation, Overview ▶ Test Sample and Size

Scalability Scalability is the ability of a biometric system to extend adaptively to larger population without requiring major changes in its infrastructure. ▶ Performance Evaluation, Overview

Scenario Tests Scenario tests are those in which biometric systems collect and process data from test subjects in a specified application. An essential characteristic of scenario testing is that the test subject is ‘‘in the loop,’’ interacting with capture devices in a fashion representative of a target application. Scenario tests evaluate endto-end systems, inclusive of capture device, quality validation software, enrollment software, and matching software. ▶ Performance Testing Methodology Standardization

Sampling Frequency Scene Marks Sampling frequency is the number of samples captured in a second from the continuous hand-drawn signal to generate a discrete signal. ▶ Digitizing Tablet

#

2009 Springer Science+Business Media, LLC

Crime scene marks are generally any physical phenomenon created or left behind and in relation to a crime scene, these can be fingerprints, blood spatter,

1134

S

Scent Identification Line-Ups

intentional and unintentional damage, or alteration to objects in the environment of the crime.

to a fusion rule, e.g., the unknown sample is accepted only if both modalities yield an accept decision.

▶ Footwear Recognition

▶ Multibiometrics, Overview

Scent Identification Line-Ups Procedure where a trained dog matches a sample odor provided by a person to its counterpart in an array (or line-up) of odors from different people, following a fixed protocol. Scent identification line-ups are used in forensic investigations as a tool to match scent traces left by a perpetrator at a crime scene to the odor of a person suspected of that crime. The protocol includes certification of the team involved, collecting and conserving scent samples at crime scenes, collecting, conserving and presenting suspect, and other array odors, working procedures and reporting. Scent identification line-ups have evolved from simple line-ups that are used in human scent tracking/trailing, where a dog has to walk up to the person whose track it has been following and through some trained behavior indicate the person. ▶ Odor Biometrics

Score Fusion ▶ Fusion, Score-Level ▶ Multiple Experts

Score Fusion and Decision Fusion Score fusion is a paradigm, which calculates similarity scores for each of the two modalities, then combines the two scores according to a fusion formula, e.g., the overall score is calculated as the mean of the two modality scores. Decision fusion is a paradigm, which makes an accept–reject decision for each of the two modalities, then combines the two decisions according

Score Normalization The score normalization techniques aim, generally, to reduce the scores variabilities in order to facilitate the estimation of a unique speaker-independent threshold during the decision step. Most of the current normalization techniques are based on the estimation of the impostors scores distribution where the mean, μ, and the standard deviation v, depend on the considered speaker model and/or test utterance. These mean and standard deviation values will then be used to normalize any incoming score s using the normalization function sm scoreN ðsÞ : v Two main score normalization techniques used in speaker recognition are: 1. Znorm. The zero normalization (Znorm) method (and its variants like Hnorm (Heck, L.P., Weintraub, M.: Handset-dependent background models for robust text-independent speaker recognition. In: ICASSP. (1997))) normalizes the score distribution using the claimed speaker statistics. In other words, the claimed speaker model is tested against a set of impostors, resulting in an impostor similarity score distribution which is then used to estimate the normalization parameters μ and v. The main advantage of the Znorm is that the estimation of these parameters can be performed during the training step. 2. Tnorm. The test normalization (Tnorm) (Auckenthaler, R., Carey, M., Lloyd-Thomas, H.: Score normalization for text-independent speaker verification systems. Digital Signal Processing 10 (2000) 4254) is another score normalization technique in which the parameters μ and v are estimated using the test utterance. Thus, during testing, a set of impostor models is used to calculate impostor scores for the given test utterance. μ and v are estimated using these scores. The Tnorm is known to improve the performances particularly in the region of low false alarm.

Score Normalization Rules in Iris Recognition

Any of a number of rules for adjusting a raw similarity score in a way that takes into account factors such as the amount of data on which its calculation was based, or the quality of the data. One purpose of score normalization in biometrics is to prevent the arising of false matches simply because only a few elements (e.g., biometric features) were available for comparison. So an accidental match by chance would be more like tossing a coin only a few times to produce a perfect run of all head. Another purpose of score normalization is to make it possible to compare or to fuse different types of measurements, as in multibiometrics. For example, Z-score normalization redefines every observation in units of standard deviation from the mean, thereby allowing incommensurable scores (like height and weight) to become commensurable (e.g., he is 3.2 standard deviations heavier than normal but 2.3 standard deviations taller than normal). Frequently the goal of score normalization is to map samples from different distributions into normalized samples from a universal distribution. For example, in iris recognition a decision is made only after the similarity score (fractional Hamming Distance) has been converted into a normalized score that compensates for the number of bits that were available for comparison, thereby preventing accidental False Matches just because of a paucity of visible iris tissue. ▶ Score Normalization Rules in Iris Recognition ▶ Session Effects on Speaker Modeling ▶ Speaker Matching

Score Normalization Rules in Iris Recognition J OHN DAUGMAN Cambridge University, Cambridge, UK

Synonyms Commensurability; Decision criterion adjustment; Error probability non-accumulation; Normalised Hamming Distance

Definition All biometric recognition systems are based on similarity metrics that enable decisions of ‘‘same’’ or ‘‘different’’ to

S

1135

be made. Such metrics require normalizations in order to make them commensurable across comparison cases that may differ greatly in the quantity of data available, or in the quality of the data. Is a ‘‘perfect match’’ based only on a small amount of data better or worse than a less perfect match based on more data? Another need for score normalization arises when interpreting the best match found after an exhaustive search, in terms of the size of the database searched. The likelihood of a good match arising just by chance between unrelated templates must increase with the size of the search database, simply because there are more opportunities. How should a given ‘‘best match’’ score be interpreted? Addressing these questions on a principled basis requires models of the underlying probability distributions that describe the likelihood of a given degree of similarity arising by chance from unrelated sources. Likewise, if comparisons are required over an increasing range of image orientations because of uncertainty about image tilt, the probability of a good similarity score arising just by chance from unrelated templates again grows automatically, because there are more opportunities. In all these respects, biometric similarity ▶ score normalization is needed, and it plays a critical role in the avoidance of False Matches in the publicly deployed algorithms for iris recognition.

Introduction Biometric recognition of a person’s identity requires converting the observed degree of similarity between presenting and previously enrolled features into a decision of ‘‘same’’ or ‘‘different.’’ The previously enrolled features may not be merely a single feature set obtained from a single asserted identity, but may be a vast number of such feature sets belonging to an entire national population, when identification is performed by exhaustively searching a database for a sufficiently good match. The ▶ similarity metrics used for each comparison between samples might be simple correlation statistics, or vector projections, or listings of the features (like fingerprint minutiae coordinates and directions) that agreed and of those that disagreed as percentages of the total set of features extracted. For each pair of feature sets being compared, varying amounts of data may be available, and the sets might need to be compared under various transformations such as image rotations when the orientation is uncertain. An example is seen

S

1136

S

Score Normalization Rules in Iris Recognition

Score Normalization Rules in Iris Recognition. Figure 1 Illustration of limited data being available in an iris image due to eyelid occlusion, as detected in a segmentation process.

in Figure 1, in which only 56% of the annular iris area is visible between the eyelids. Iris images may have also been acquired with a tilted camera (not unusual for handheld cameras), or with the head tilted or the eye rotated (cyclovergence) by an unknown degree, requiring comparisons to be made over a range of configurations for each of the possible identities, and with varying amounts of template data being available in each case. This article is concerned with the methods of ▶ score normalization that are used in iris recognition to make all of those comparison cases ▶ commensurable with each other, preventing False Match probability from rising simply because there is less data available for comparison or because there are many more candidates and match configurations to be considered.

Score Normalisation by the Amount of Iris Visible The algorithms used in all current public deployments of iris recognition [2] work by a test of statistical independence: A match is declared when two templates fail the test of statistical independence; comparisons between different eyes are statistically guaranteed to pass that test [1]. The test of independence is based on measuring the fraction of bits that disagreed between two templates, called ▶ IrisCodes, and so the similarity metric is a ▶ Hamming Distance between 0 and 1. (The method by which an IrisCode is created is described in this encyclopedia in the entry on Iris Encoding and Recognition using Gabor Wavelets.)

If two IrisCodes were derived from different eyes, about half of their bits should agree and half should disagree (since any given bit is equally likely to be 1 or 0), and so a Hamming Distance close to 0.5 is expected. If both IrisCodes were computed from the same eye, then a much larger proportion of the bits should agree since they are not independent, and so a Hamming Distance much closer to 0 is expected. But what is the effect of having varying numbers of bits available for comparison, for example, because of eyelid occlusion? Eyelid boundaries are detected (as illustrated by the spline curve graphics in Figure 1 where each lid intersects the iris), and the parts of the IrisCode that are then unavailable are marked as such by setting masking bits. The box in the lower-left corner of Figure 1 shows Active Contours computed to describe the pupil boundary (lower ‘‘snake’’) and the iris outer boundary (upper snake). As these snakes are curvature maps, a circular boundary would be described by a snake that was flat and straight. The two thick grey regions in the box containing the upper snake represent the limited regions where the iris outer boundary is visible and possesses a large radial gradient (or derivative) in brightness. The gaps that separate the two thick grey regions correspond to parts of the trajectory around the iris where no such boundary is visible, because it is occluded by eyelids. Thus the outer boundary of the iris must be estimated (dotted curve) by two quite limited areas on the left and right sides of the iris where it is visible. In the coordinate system that results, the iris regions obscured by eyelids are marked as such by masking bits. The logic for comparing two IrisCodes to generate a raw Hamming Distance HDraw is given in Equation (1), where the data parts of the two IrisCodes are denoted {codeA, codeB} and the vectors of corresponding masking bits are denoted {maskA, maskB}: HD raw

T T kðcodeA codeBÞ maskA maskBk T ð1Þ ¼ kmaskA maskBk

N The symbol signifies the logical Exclusive-OR (XOR) operator which detects disagreement between T bits; signifies logical AND whereby the masks discount data bits where occlusions occurred; and the norms k k count the number of bits that are set in the result. Bits may be masked for several reasons other than eyelid or eyelash occlusion. They are also deemed

Score Normalization Rules in Iris Recognition

unreliable if specular reflections are detected in the part of the iris they encode, or if the signal-to-noise ratio there is poor, for example, if the local texture energy is so low that the computed wavelet coefficients fall into the lowest quartile of their distribution, or on the basis of low entropy (information density). The number of bits pairings available for compariT son between two IrisCodes, kmaskA maskBk, is usually almost a thousand. But if one of the irises has (say) almost complete occlusion of its upper half by a drooping upper eyelid, and if the other iris being compared with it has almost complete occlusion of its lower half, then the common area available for comparison may be almost nil. How can the test of statistical independence remain a valid and powerful basis for recognition when very few bits are actually being compared? It may well be that a less exact match on a larger quantity of data is better evidence of a match than is a perfect match on less data. An excellent analogy is a test of whether or not a coin is ‘‘fair’’ (i.e., gives unbiased outcomes when tossed): Getting a result of 100% ‘‘heads’’ in few tosses (e.g., 10 tosses) is actually much more consistent with it being a fair coin than getting a result of 60% / 40% after 1,000 tosses. (The latter result is 6.3 standard deviations away from expectation, whereas the former result is only 3.2 standard deviations away from expectation; so the 60/40 result is actually much stronger evidence against the hypothesis of a fair coin, than is the result of ‘‘all heads in

S

1137

10 tosses’’.) Similarly, in biometric comparisons, getting perfect agreement between two samples that extracted only ten features may be much weaker evidence of a good match than a finding of 60% agreement among a much larger number of extracted features. This is illustrated in Table 1 for an actual database of 632,500 IrisCodes computed from different eyes in a border-crossing application in the Middle East [3]. A database of this size allows 200 billion different pair comparisons to be made, yielding a distribution of 200 billion HDraw similarity scores between different eyes. These HDraw scores were broken down into seven categories by the number of bits mutually available for comparison (i.e., unmasked) between each pair of IrisCodes; those bins constitute the columns of Table 1, ranging from 400 bits to 1,000 bits being compared. The rows in Table 1 each correspond to a particular decision threshold being applied; for example, the first row is the case that a match is declared if HDraw is 0.260 or smaller. The cells in the Table give the observed False Match Rate in this database for each decision rule and for each range of numbers of bits being compared when computing HDraw . Using the findings in Table 1, it is informative to compare performance for two decision criteria: a very conservative criterion of HDraw ¼ 0.260 (the first row), and a more liberal criterion HDraw ¼ 0.285 (the sixth row) which allows more bits to disagree (28.5%) while still declaring a match. Now if the False Match Rates

Score Normalization Rules in Iris Recognition. Table 1 False match rate without score normalisation: dependence on number of bits compared and criterion HDCrit

400 bits

500 bits

600 bits

700 bits

9

10

10

10

800 bits

900 bits

1,000 bits

0

0

0

0.260

2 10

0.265 0.270 0.275 0.280

3 10 9 4 10 9 7 10 9 1 10 8

8 10 10 1 10 9 2 10 9 4 10 9

5 10 10 9 10 10 1 10 9 2 10 9

2 10 10 5 10 10 9 10 10 2 10 9

4 10 11 2 10 10 5 10 10 1 10 9

0 0 3 10 11 2 10 10

0 0 0 0

0.285 0.290 0.295 0.300

2 10 8 3 10 8 4 10 8 6 10 8

7 10 9 1 10 8 2 10 8 3 10 8

4 10 9 8 10 9 1 10 8 3 10 8

3 10 9 7 10 9 1 10 8 2 10 8

2 10 9 4 10 9 9 10 9 2 10 8

5 10 10 1 10 9 3 10 9 7 10 9

2 10 11 1 10 10 4 10 10 9 10 10

0.305 0.310 0.315 0.320

9 10 8 1 10 7 2 10 7 3 10 7

6 10 8 1 10 7 2 10 7 3 10 7

5 10 8 8 10 8 1 10 7 2 10 7

4 10 8 8 10 8 2 10 7 3 10 7

4 10 8 7 10 8 1 10 7 3 10 7

1 10 8 3 10 8 6 10 8 1 10 7

2 10 9 5 10 9 1 10 8 2 10 8

5 10

3 10

1 10

S

1138

S

Score Normalization Rules in Iris Recognition

are compared in the first and last columns of these rows, namely when only about 400 bits are available for comparison and when about 1,000 bits are compared, it can be seen that, in fact, the more conservative criterion (0.260) actually produces 100 times more False Matches using 400 bits than does the more liberal (0.285) criterion when using 1,000 bits. Moreover, the row corresponding to the HDraw ¼ 0.285 decision criterion reveals that the False Match Rate is 1,000 times greater when only 400 bits are available for comparison than when 1,000 bits are compared. The numerical data of Table 1 is plotted in Figure 2 as a surface, showing how the logarithm of the False Match Rate decays as a function of both variables. The surface plot reveals that there is a much more rapid attenuation of False Match Rate with increase in the number of bits available for comparison (lower-left axis), than by reduction of the HDraw decision criterion in the range of 0.260 - 0.320 (lower-right axis). This is to be expected, given that iris recognition works by a test of statistical independence. The observations of

Table 1 and Figure 2 clearly demonstrate the need for similarity scores to be normalized by the number of bits compared when calculating them. A natural choice for the score normalization rule is to rescale all deviations from HDraw ¼ 0.5 in proportion to the square-root of the number of bits that were compared when obtaining that score. The reason for such a rule is that the expected standard deviation in the distribution of coin-tossing outcomes (expressed as a fraction pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ of the n tosses having a given outcome), is s ¼ pq=n where p and q are the respective outcome probabilities (both nominally 0.5 in this case). Thus, decision confidence levels can be maintained irrespective of how many bits n were actually compared, by mapping each raw Hamming Distance HDraw into a normalized score HDnorm using a re-scaling rule such as: rﬃﬃﬃﬃﬃﬃﬃ n HD norm ¼ 0:5 ð0:5 HD raw Þ ð2Þ 911 This normalization should transform all samples of scores obtained when comparing different eyes into

Score Normalization Rules in Iris Recognition. Figure 2 The data of Table 1 plotted as a surface in semilogarithmic coordinates, showing a range factor of 10,000-to-1 in the False Match Rate as the number of bits compared ranges from 400 to 1,000. This bit count is more influential than is the HDraw decision criterion for unnormalised scores in the 0.260 - 0.320 range.

Score Normalization Rules in Iris Recognition

samples drawn from the same ▶ binomial distribution, whereas the raw scores HDraw might be samples from many different binomial distributions having standard deviations s dependent on the number of bits n that were actually available for comparison. This normalization maintains constant confidence levels for decisions using a given Hamming Distance threshold, regardless of the value of n. The scaling parameter 911 is the typical number of bits compared (unmasked) between two different irises. The effect of using this normalization rule (‘‘SQRT’’) is shown in Figure 3 for the 200 billion comparisons between different irises, plotting the observed False Match Rate as a function of the new HDnorm normalized decision criterion. Also shown for comparison is the unnormalized case (upper curve), and a ‘‘hybrid’’ normalization rule which is a linear combination of the other two, taking into account the number of bits compared only when in a certain range [4]. The benefit of score normalization is profound: it is noteworthy that in this semilogarithmic plot, the ordinate spans a factor of 300,000 to 1.

S

1139

The price paid for achieving this profound benefit in robustness against False Matches is that the match criterion becomes more demanding when less of the iris is visible. Table 2 shows what fraction of bits HDraw (column 3) is allowed to disagree while still accepting a match, as a function of the actual number of bits that were available for comparison (column 1) or the approximate percent of the iris that is visible (column 2). In every case shown in this Table, the probability of making a False Match is about 1 in a million; but it is clear that when only a very little part of two irises can be compared with each other, the degree of match required by the decision rule becomes much more demanding. Conversely, if more than 911 bits (the typical case, corresponding to about 79% of the iris being visible) are available for comparison, then the decision rule becomes more lenient in terms of the acceptable HDraw while still maintaining the same net confidence level. Finally, another cost of using this score normalization rule is apparent if one operates in a region of the ROC curve corresponding to a very nondemanding

S

Score Normalization Rules in Iris Recognition. Figure 3 Comparing the effects of three score normalisation rules on False Match Rate as a function of Hamming Distance.

1140

S

Score Normalization Rules in Iris Recognition

Score Normalization Rules in Iris Recognition. Table 2 Effect of score normalisation on the match quality required with various amounts of iris visibility Number of bits compared

Approximate percent of iris visible (%)

Maximum acceptable fraction of bits disagreeing

200 300 400 500

17 26 35 43

0.13 0.19 0.23 0.26

600 700 800 911

52 61 69 79

0.28 0.30 0.31 0.32

1,000 1,152

87 100

0.33 0.34

False Match Rate, such as 0.001, which was the basis for NIST ICE (Iris Challenge Evaluation 2006) reporting. The ICE iris database contained many very difficult and corrupted images, often in poor focus, and with much eyelid occlusion, with motion blur, raster shear, and sometimes with the iris partly outside of the image frame. As ROC curves require False Matches, NIST used a much more liberal decision criterion than is used in any actual deployments of iris recognition. As seen in Figure 4, using liberal thresholds that generate False Match Rates (FMR) in the range of 0.001–0.00001, score normalization adversely impacts on the ROC curve by increasing the False nonMatch Rate (FnMR). The Equal Error Rate (where FnMR = FMR, indicated by the solid squares) is about 0.001 without score normalization, but 0.002 with the normalization. Similarly at other nominal points of interest in this region of the ROC curve, as tabulated within Figure 4, the cost of score normalization is roughly a doubling in the FnMR, because marginal valid matches are rejected due to the penalty on fewer bits having been available for comparison. In conclusion, whereas Table 1, and Figures 2 and 3 document the important benefit of score normalization when operating with very large databases that require several orders of magnitude higher confidence against False Matches, Figure 4 shows that in scenarios which are much less demanding for FMR, the FnMR is noticeably penalized by score normalization, and so the ROC curve suffers.

Adapting Decision Thresholds to the Size of a Search Database Using the SQRT normalization rule, Figure 5 presents a histogram of all 200 billion cross-comparison similarity scores HDnorm among the 632,500 different irises in the Middle Eastern database [3]. The vast majority of these IrisCodes from different eyes disagreed in roughly 50% of their bits as expected, since the bits are equiprobable and uncorrelated between different eyes [2, 1]. Very few pairings of IrisCodes could disagree in fewer than 35% or more than 65% of their bits, as is evident from the distribution. The form of this distribution needs to be understood, assuming that it is typical and predictive of any other database, in order to understand how to devise decision rules that compensate for the scale of a search. Without this form of score normalization by the scale of the search, or an adaptive decision threshold rule, False Matches would occur simply because large databases provide so many more opportunities for them. The solid curve that fits the distribution data very closely in Figure 5 is a binomial probability density function. This theoretical form was chosen because comparisons between bits from different IrisCodes are Bernoulli trials, or conceptually ‘‘coin tosses,’’ and Bernoulli trials generate binomial distributions. If one tossed a coin whose probability of ‘‘heads’’ is p in a series of n independent tosses and counted the number m of ‘‘heads’’ outcomes, and if one tallied this fraction x ¼ m ∕n in many such repeated runs of n tosses, then the expected distribution of x would be as per the solid curve in Figure 5: f ðxÞ ¼

n! pm ð1 pÞðnmÞ m!ðn mÞ!

ð3Þ

The analogy between tossing coins and comparing bits between different IrisCodes is deep but imperfect, because any given IrisCode has internal correlations arising from iris features, especially in the radial direction [2]. Further correlations are introduced because the patterns are encoded using 2D Gabor wavelet filters, whose lowpass aspect introduces correlations in amplitude, and whose bandpass aspect introduces correlations in phase, both of which linger to an extent that is inversely proportional to the filter bandwidth. The effect of these correlations is to reduce the value of the distribution parameter n to a number significantly smaller than the number of bits that are actually

Score Normalization Rules in Iris Recognition

S

1141

Score Normalization Rules in Iris Recognition. Figure 4 Adverse impact of score normalisation in ROC regions where high False Match Rates are tolerated (e.g., 0.00001 to 0.001 FMR). In these regions, the False nonMatch Rate is roughly doubled as a result of score normalization.

compared between two IrisCodes; n becomes the number of effectively independent bit comparisons. The value of p is very close to 0.5 (empirically 0.499 for this database), because the states of each bit are equiprobable a priori, and so any pair of bits from different IrisCodes is equally likely to agree or disagree. The binomial functional form that describes so well the distribution of normalized similarity scores for comparisons between different iris patterns is key to the robustness of these algorithms in large-scale search applications. The tails of the binomial attenuate extremely rapid, because of the dominating central tendency caused by the factorial terms in (3). Rapidly attenuating tails are critical for a biometric system to survive the vast numbers of opportunities to make False Matches without actually making any, when applied in an ‘‘all-against-all’’ mode of searching for any matching or multiple identities, as is contemplated in

some national ID projects. The requirements of biometric operation in ‘‘identification’’ mode by exhaustively searching a large database are vastly more demanding than operating merely in one-to-one ‘‘verification’’ mode (in which an identity must first be explicitly asserted, which is then verified in a yes/no decision by comparison against just the single nominated template). If P1 is the False Match probability for single oneto-one verification trials, then (1P1) is the probability of not making a False Match in single comparisons. The likelihood of successfully avoiding this in each of N independent attempts is therefore (1P1)N, and so PN, the probability of making at least one False Match when searching a database containing N different patterns is: P N ¼ 1 ð1 P 1 ÞN

ð4Þ

S

1142

S

Score Normalization Rules in Iris Recognition

Score Normalization Rules in Iris Recognition. Figure 5 Binomial distribution of normalised similarity scores in 200 billion comparisons between different eyes. Solid curve is (3).

Observing the approximation that PN NP1 for small P 1 Thresholdcurv)∕Tw

39 41 43 45

(integrated abs. centr. acc. aIc)/amax T(vx 0)∕Tw T(vx >0jpen-up)∕Tw N(vy ¼0) (standard deviation of x)/Dx

47

T ððdx=dtÞðdy=dtÞ>0Þ T ððdx=dtÞðdy=dtÞ > > > > > ðkf ½m1Þ > < ð f ½mf ½m1Þ f ½m 1 k f ½m ð6Þ H m ½k ¼ > ð f ½mþ1k Þ > f ½ m k f ½ m > > ð f ½mþ1f ½mÞ > > > > : 0 k > f ½m þ 1; where 0 1 pðxjCk Þ ¼ exp ðx mk Þ S ðx mk Þ : 2 ð2pÞD=2 jSj1=2 1

1

ð3Þ From (1), we have

where

ak ðxÞ ¼ w > k x þ w k0 ;

ð4Þ

w k ¼ S1 mk :

ð5Þ

1 w k0 ¼ m> S1 mk þ ln pðCk Þ: 2 k

ð6Þ

We see that the equal covariance matrices make ak(x) to be linear in x, and the resulting decision boundaries will also be linear. As a special case of LDA, the nearestneighbor classifier can be obtained, when S¼s2I. If the prior probabilities pðCk Þ are equal, we assign a feature vector x to the class Ck with the minimum Euclidean distance jjx mk jj2 , which is equivalent to the optimum decision rule based on the maximum posterior probability. Another extension of LDA could be obtained by allowing for mixtures of Gaussians for the class-conditional densities instead of the single Gaussian. Mixture discriminant analysis (MDA) [6] incorporates the Gaussian mixture distribution for the class-conditional densities to provide a richer class of density models than the single Gaussian. The class-conditional density for class Ck has the form of the Gaussian mixture model, pðxjCk Þ ¼ PR coefficients r¼1 pkr N ðxjmk ; SÞ, where the mixing PR pkr must satisfy pkr 0 together with r¼1 pkr ¼ 1. In this model, the same covariance matrix S is used within and between classes. The Gaussian mixture model allows for more complex decision boundaries although it does not guarantee the global optimum of maximum likelihood estimates.

Linear

Quadratic Discriminant Analysis Quadratic

Linear

1299

Linear Discriminant Analysis

ð2Þ

In parametric approaches to classification, we directly model the class-conditional density with a parametric form of probability distribution (e.g., multivariate Gaussian). Many parametric methods for classification have been proposed based on different assumptions for pðxjCk Þ [3, 4, 5] (see Table 1):

S

If the covariance matrices Sk are not assumed to be equal, then we get quadratic functions of x for ak(x) 1 1 ak ðxÞ ¼ ðx mk Þ> S1 k ðx mk Þ ln jSk j þ ln pðC k Þ: 2 2

ð7Þ

S

1300

S

Supervised Learning

In contrast to LDA, the decision boundaries of QDA are quadratic, which is resulted from the assumption on the different covariance matrices. From the added flexibility obtained from the quadratic decision boundaries, QDA often outperforms LDA when the size of training data is very large. However, when the size of the training set D is small compared to the dimension D of the feature space, the large number of parameters of QDA relative to LDA causes over-fitting or ill-posed estimation for the estimated covariance matrices. To solve this problem, various regularization or Bayesian techniques have been proposed to obtain more robust estimates: 1. Regularized discriminant analysis (RDA) [7, 8] employs the regularized form of covariance matrices by shrinking Sk of QDA towards the common covariance matrix S of LDA, that is, Sk(a) ¼ aSk þ (1a)S for a 2 [0, 1]. Additionally, the common covariance matrix S could be shrunk towards the scalar covariance, S(g) ¼ gSþ(1g)s2I for g 2[0, 1]. The pair of parameters is selected by crossvalidation based on the classification accuracy of the training set. 2. Leave-one-out covariance estimator (LOOC) [9] finds an optimal regularized covariance matrices by mixing four different covariance matrices of Sk,diag(Sk), S, and diag(S), where the mixing coefficients are determined by maximizing the average leave-one-out log likelihood of each class. 3. Bayesian QDA introduces prior distributions over the mean mk and the covariance matrices Sk [10], or over the Gaussian distributions themselves [11]. The expectations of the class-conditional densities are calculated analytically in terms of the parameters. The hyper-parameters of the prior distributions are chosen by cross-validation.

Naive Bayes Classifier In the naive Bayes classifier, the conditional independence assumption makes the factorized class-conditional densities of the form pðxjCk Þ ¼

D Y i¼1

pðx i jCk Þ:

ð8Þ

The component densities pðx i jCk Þ can be modeled with various parametric and nonparametric distributions, including the following: 1. For continuous features, the component densities are chosen to be Gaussian. In this case, the naive Bayes classifier is equivalent to QDA with diagonal covariance matrices for each class. 2. For discrete features, multinomial distributions are used to model the component densities. The multinomial assumption makes ak(x) and the resulting decision boundaries to be linear in x. 3. The component densities can be estimated using one-dimensional kernel density or histogram estimates for non-parametric approaches. The naive Bayes model assumption is useful when the dimensionality D of the feature space is very high, making the direct density estimation in the full feature space unreliable. It is also attractive if the feature vector consists of heterogeneous features including continuous and discrete features.

Nonparametric Approaches One major problem of parametric approaches is that the actual class-conditional density is not a linear nor a quadratic form in many real-world data. It causes the poor classification performance, since the actual distribution of data is different from a functional form we specified, regardless of parameters. To solve this problem, one can increase the flexibility of the density model by adding more and more parameters, leading to a model with infinitely many number of parameters, called nonparametric density estimation. Otherwise, rather than modeling the whole distribution of a class, one can model only a decision boundary that separates one class from the others, since restricting the functional form of the boundary is a weaker assumption than restricting that of the whole distribution of data. Either using a nonparametric density model or modeling a decision boundary are called nonparametric approaches. In this article, the latter approach is only considered. We define a function ak(x) as a relevancy score of x for Ck, such that ak(x) > 0 if x is more likely to be assigned to Ck , and ak(x) < 0 otherwise. Then, the surface ak(x) ¼ 0 represents the decision boundary

Supervised Learning

S

1301

Supervised Learning. Table 2 Comparison among non-parametric methods for classification ak(x)

Method k-NN ANNs SVMs

ðiÞ

jfx 2 Ck gj fk(Lþ 1)(x) P aki >0 aki y ki kðx i ; xÞ

Number of parameters

Decision boundary

k PL

Nonlinear Linear (L¼0) or nonlinear (L>0)

O(K N)

Linear (k(xi,x)¼xiTx) or nonlinear (otherwise)

‘¼0 ðW‘ þ 1ÞW‘ þ 1

between Ck and the other classes, and a test point x is assigned to Ck if k ¼ arg maxkak(x), which is called one-against-all. Many nonparametric methods have been derived from various models for ak(x). We introduce three representative methods [12, 13, 14] (see Table 2): 1. k-nearest neighbor algorithm (k-NN) chooses k data points in the training set, which are closest from x, then ak(x) is the number of those selected points belonging to Ck . 2. Artificial neural networks (ANNs) represent ak(x) as a multilayered feed-forward network. The ℓth layer consists of Wℓ nodes, where the jth node in the layer sends a (non)linear function value fj(ℓ)(x) as a signal to the nodes in the (L þ 1)th layer. Then, ak(x) is the signal of the kth node in the final layer, fk(Lþ1)(x). 3. Support vector machines (SVMs) choose some ‘‘important’’ training points, called support vectors, then represent ak(x) as a linear combination of them. SVM is known to be the best supervised learning method for most real-world data.

k-NN is widely used in biometrics, especially for computer vision applications such as face recognition and pose estimation, where both the number of images N and dimension of data D are quite large. However, traditional k-NN takes O(ND) time to compute distances between a test point x and all training points x1, . . .,xN, which is too inefficient for practical use. Thus, extensive research has focused on fast approximations based on hashing, embedding or something [15].

Artificial Neural Networks In ANNs, the signal of the jth node in the (ℓþ1)th layer is determined by the signals from the ℓth layer:

ð‘þ1Þ ð‘Þ> ð‘Þ fj ðxÞ ¼ g w j f ð‘Þ ðxÞ þ w j0 ; ð10Þ ð‘Þ

ð‘Þ >

ð‘Þ

where w ‘j ¼ ½w j1 ; w j2 ; ; w jW ‘ ð‘Þ

ð‘Þ

ð‘Þ

and

f ð‘Þ ðxÞ ¼

>

Given a set of data points X ¼ {x1,x2, . . . xN} and a set of the corresponding labels Y ¼ {y1, y2, yN}, K-NN assigns a label for a test data point x by majority voting, that is to choose the most frequently occurred label in {y(1), y(2), . . . y(k)}, where x(i) denotes the ith nearest point of x in X and y(i) is the label of x(i). That is, we have

½f 1 ðxÞ; f 2 ðxÞ; ; f W ‘ ðxÞ . The input layer, f (0)(x), is simply x. g() is a nonlinear, nondecreasing mapping, causing ANNs to yield a nonlinear decision boundary. There are two popular mappings: (1) sigmoid, g(x) ¼1 ∕(1þexp{x}); (2) hyperbolic tangent, g(x) ¼ tanh(x). More nodes and layers increase the nonlinearity of decision boundary obtained by ANNs. However, it is difficult to train ANNs having a number of nodes and layers, since the model can easily fall into poor solutions, called local minima. Radial basis function (RBF) networks [16] are another type of ANNs, having the form

ak ðxÞ ¼ jfx ðiÞ 2 Ck gj;

ak ðxÞ ¼ w > k FðxÞ þ w k0 :

k-Nearest Neighbor Algorithm

ð9Þ

where jj denotes the number of elements in a set. The decision boundary is not restricted to a specific functional form. It depends only on the local distribution of neighbors and the choice of k. Larger k makes the decision boundary more smooth.

ð11Þ

That is, RBF networks contain only one hidden layer, denoting by F(x)¼[f1(x),f2(x),. . .,fW(x)], and the network output is simply a linear combination of the hidden nodes. The main difference between RBF networks and ANNs with L ¼ 1 is the mapping from the

S

1302

S

Supervised Learning

input to the hidden. In RBF networks, each jj() is a nonlinear function similar to Gaussian density: n o fj ðxÞ ¼ exp bj jjx c j jj2 ; ð12Þ for some bj > 0 and the center vector cj. That is, each hidden node represents local region whose center is cj, and its signal would be stronger if x and cj are closer. In general, cj is fixed to one of the training points and bj is chosen by hand, thus the global optimum of wk and wk0 can be simply found by least squares fitting.

Theoretically, the generalization power of SVMs is guaranteed by Vapnik–Chervonenkis theory [17]. Training SVMs can be rewritten as the following convex optimization problem min jjw k jj;

w k ;w k0

subject to y ki ak ðxi Þ 1 for all i; ð14Þ

where yki ¼ 1 if x i 2 Ck and otherwise1. At the optimum, ak(x) has the form ak ðxÞ ¼

Support Vector Machines Similar to RBF networks, SVMs obtain a linear decision boundary in a transformed space: ak(x)¼wkTF(x)þ wk0, where F() is an arbitrary mapping, either linear or nonlinear. The difference betwen SVMs and ANNs is the optimality of the decision boundary. In SVMs, the optimal decision boundary is such that the distance between the boundary and the closest point from that boundary, called the margin, is maximized: jak ðxi Þj max min : ð13Þ w k ;w k0 i jjw k jj This optimization problem always converges to the global solution, the maximum margin boundary. Figure 1 shows the motivation for SVMs intuitively. One can expect that the generalization error of the maximum margin boundary is less than that of other boundaries.

n X

aki y ki Fðxi Þ> FðxÞ;

ð15Þ

i¼1

where aki 0 is a Lagrangian multiplier of the ith constraint, ykiak(xi) 1. If a data point xi is exactly on the margin, i.e., ykiak(xi) ¼ 1, then xi is called support vector and aki > 0. Otherwise, aki ¼ 0 and ykiak(xi) > 1. Hence, ak(x) only depends on the support vectors. To compute F(xi)TF(x), we can introduce a function of the form k(xi, x), representing the inner product in the feature space can be used, without computing the mapping F() explicitly. Such a function is called kernel function [18]. There are two popular kernel functions: (1) polynomial kernel, k(xi, x)¼ (xiTxþc)p for some c and p > 0; (2) Gaussian kernel (also called as RBF kernel), kðx i ; xÞ ¼ expf 2s1 2 jjx i xjj2 g for some s > 0. Various algorithms and implementations have been developed to train SVMs efficiently. Two most popular softwares are LIBSVM [19] and SVMlight [20], both

Supervised Learning. Figure 1 (Left) Possible solutions obtained by neural networks. (Right) SVMs give one global solution, the maximum margin boundary.

Support Vector Machine

implement several techniques such as working set selection, shrinking heuristics, and LRU caching to speed up optimization, and provide various kernel functions with choosing appropriate parameters of those functions automatically (automatic model selection). Two recent extensions of SVMlight – SVMstruct for structured data, SVMperf for training with more than hundredthousands of data points – are also popular in biometrics.

Related Entries

S

1303

15. Shakhnarovich, G., Darrell, T., Indyk, P.: Nearest-Neighbor Methods in Learning and Vision: Theory and Practice. MIT Press, Cambridge, MA (2006) 16. Moody, J., Darken, C.J.: Fast learning in networks of locally tuned processing units. Neural Comput. 1, 281–294 (1989) 17. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag, New York (1995) 18. Scho¨lkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, Cambridge, MA (2002) 19. Chang, C.C., Lin, C.J.: LIBSVM – A Library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm (2000) 20. Joachims, T.: SVMlight, http://svmlight.joachims.org (2004)

▶ Classifier Design ▶ Machine-Learning ▶ Probability Distribution

Supervisor References 1. Jain, A.K., Ross, A., Prabhakar, S.: An introduction to biometric recognition. IEEE Trans. Circuits Syst. Video Technol. 14(1), 4–20 (2004) 2. Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: A review. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 4–37 (2000) 3. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, New York (2001) 4. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006) 5. Hastie, T., Tibsjirani, R., Friedman, J.: The Elements of Statistical Learning. Springer, New York (2001) 6. Hastie, T., Tibshirani, R.: Discriminant analysis by Gaussian mixtures. J. R. Stat. Soc. Ser. B 58, 158–176 (1996) 7. Friedman, J.H.: Regularized discriminant analysis. J. Am. Stat. Assoc. 84, 165–175 (1989) 8. Ye, J., Wang, T.: Regularized discriminant analysis for high dimensional, low sample size data. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia, PA (2006) 9. Hoffbeck, J.P., Landgrebe, D.A.: Covariance matrix estimation and classification with limited training data. IEEE Trans. Pattern Anal. Mach. Intell. 18(7), 763–767 (1996) 10. Geisser, S.: Predictive Inference: An Introduction. Chapman & Hall, New York (1993) 11. Srivastava, S., Gupta, M.R., Frigyik, B.A.: Bayesian quadratic discriminant analysis. J. Mach. Learn. Res. 8, 1277–1305 (2007) 12. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory IT–13, 21–27 (1967) 13. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by backpropagating errors. Nature 323, 533–536 (1986) 14. Boser, B.E., Guyon, I., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop of Computational Learning Theory, pp. 144–152 (1992)

A generic term for a method or a system that is able to output an aggregated opinion. ▶ Multiple Experts

Supervisor Opinion The output of the supervisor which can be a strict score (0 or 1) or a graded score (2 [0, 1]) representing the belief of the supervisor on an identity claim by aggregating expert opinions. ▶ Multiple Experts

Support Vector Machine M ATHIAS M. A DANKON , M OHAMED C HERIET University of Quebec ETS, Montreal, Canada

Synonyms SVM; Margin classifier; Maximum margin classifier; Optimal hyperplane

S

1304

S

Support Vector Machine

Definition Support vector machines (SVMs) are particular linear ▶ classifiers which are based on the margin maximization principle. They perform ▶ structural risk minimization, which improves the complexity of the classifier with the aim of achieving excellent ▶ generalization performance. The SVM accomplishes the classification task by constructing, in a higher dimensional space, the hyperplane that optimally separates the data into two categories.

Introduction Considering a two-category classification problem, a linear classifier separates the space, with a hyperplane, into two regions, each of which is also called a class. Before the creation of SVMs, the popular algorithm for determining the parameters of a linear classifier was a single-neuron perceptron. The perceptron algorithm uses an updating rule to generate a separating surface for a two-class problem. The procedure is guaranteed to converge when the ▶ training data are linearly separable, however there exists an infinite number of hyperplanes that correctly classify these data (see Fig. 1). The idea behind the SVM is to select the hyperplane that provides the best generalization capacity. Then, the SVM algorithm attempts to find the

maximum margin between the two data categories and then determines the hyperplane that is in middle of the maximum margin. Thus, the points nearest the decision boundary are located at the same distance from the optimal hyperplane. In machine learning theory, it is demonstrated that the margin maximization principle provides the SVM with a good generalization capacity, because it minimizes the structural risk related to the complexity of the SVM [1].

SVM Formulation Let consider a dataset fðx 1 ; y 1 Þ; . . . ; ðx ‘ ; y ‘ Þg with x i 2 Rd and yi 2{1,1}. SVM training attempts to find the parameters w and b of the linear decision function f(x) ¼ w.x þ b defining the optimal hyperplane. The points near the decision boundary define the margin. Considering two points x1, x2 on opposite sides of the margin with f(x1)¼1 and f(x2)¼ 1, the margin equals ½ f ðx 1 Þ f ðx 2 Þ=kw k ¼ 2=kw k. Thus, maximizing the margin is equivalent to minimizing ||w||∕2 or ||w||2 ∕2. Then, to find the optimal hyperplane, the SVM solves the following optimization problem: 1 min w 0 w w;b 2 s:t yi ðw 0 :xi þ bÞ 1 8i ¼ 1; . . . ; ‘

ð1Þ

Support Vector Machine. Figure 1 Linear classifier: In this case, there exists an infinite number of solutions. Which is the best?

Support Vector Machine

The transformation of this optimization problem into its corresponding dual problem gives the following quadratic problem: max /

s:t

‘ X

ai

i¼1

‘ X

‘

1X ai aj yi yi xi :xj 2 i;j¼1

ð2Þ

yi ai ¼ 0; a 0 8i ¼ 1; . . . ; ‘

i¼1

Where W 0 denotes the transpose of W. The solution of the previous problem gives P the parameter w ¼ ‘i¼1 y i ai x i of the optimal hyperplane. Thus, the decision function becomes P f ðxÞ ¼ ‘i¼1 ai y i ðx i :xÞ þ b in dual space. Note that the value of the bias b does not appear in the dual problem. Using the constraints of the primal problem, the bias is given by b ¼ 1=2½max y¼1 ðw:x i Þ þ miny¼1 ðw:x i Þ. It is demonstrated with the Karush-Kuhn-Tucker conditions that only the examples xi that satisfy yi(w. xi þb)¼1 are the corresponding ai non-zero. These examples are called support vectors (see Fig. 2).

SVM in Practice In real-world problems, the data are not linearly separable, and so a more sophisticated SVM is used to solve

S

1305

them. First, the slack variable is introduced in order to relax the margin (this is called a soft margin optimization). Second, the kernel trick is used to produce nonlinear boundaries [2]. The idea behind kernels is to map training data nonlinearly into a higher-dimensional feature space via a mapping function F and to construct a separating hyperplane which maximizes the margin (see Fig. 3). The construction of the linear decision surface in this feature space only requires the evaluation of dot products f(xi).f(xj)¼k(xi,xj), where the application k : Rd Rd ! R is called the kernel function [3, 4]. The decision function given by an SVM is: yðxÞ ¼ sign½w 0 fðxÞ þ b;

ð3Þ

where w and b are found by resolving the following optimization problem that expresses the maximization of the margin 2∕||w|| and the minimization of training error: ‘ X 1 min w 0 w þ C xi ðL1 SVMÞ or w;b;x 2 i¼1 ð4Þ ‘ X 1 0 2 xi ðL2 SVMÞ min w w þ C w;b;x 2 i¼1 subject to : yi ½w 0 fðxi Þ þ b 1 xi 8i ¼ 1; . . . ; ‘ ð5Þ

S

Support Vector Machine. Figure 2 SVM principle: illustration of the unique and optimal hyperplane in a two-dimensional input space based on margin maximization.

1306

S

Support Vector Machine

Support Vector Machine. Figure 3 Illustration of the kernel trick: The data are mapped into a higher-dimensional feature space, where a separating hyperplane is constructed using the margin maximization principle. The hyperplane is computed using the kernel function without the explicit expression of the mapping function. (a) Nonlinearly separable data in the input space. (b) Data in the higher-dimensional feature space.

xi 0 8i ¼ 1; . . . ; ‘:

ð6Þ

By applying the Lagrangian differentiation theorem to the corresponding dual problem, the following decision function is obtained: X‘ yðxÞ ¼ sign½ ai y i kðx i ; xÞ þ b; ð7Þ i¼1

with a solution of the dual problem. The dual problem for the L1-SVM is the following quadratic optimization problem: maximize : W ðaÞ ¼

‘ X

ai

i¼1

‘

1X ai aj yi yj k xi ; xj 2 i;j¼1

ð8Þ subject to :

‘ X

ai yi ¼ 0 and 0 ai C;i ¼ 1;...;‘: ð9Þ

Support Vector Machine. Table 1 Common kernel used with SVM Gaussian (RBF) Polynomial Laplacian Multi-quadratic

kðx; yÞ ¼ expðjjx yjj2 =s2 Þ kðx; yÞ ¼ ðax:y þ bÞn ðx; yÞ ¼ expðajjx yjj þ bÞ kðx; yÞ ¼ ðajjx yjj þ bÞ1=2

Inverse multiquadratic KMOD

kðx; yÞ ¼ ðajjx yjj þ bÞ1=2 " kðx; yÞ ¼ a exp

g2 jjxyjj2 þs2

# 1

In practice, the L1-SVM is used most of the time, and its popular implementation developed by Joachims [5] is very fast and scales to large datasets. This implementation, called SVMlight, is available at svmlight.joachims.org.

i¼1

Using the L2-SVM, the dual problem becomes : maximize : W ðaÞ ¼

‘ X

SVM Model Selection ai

i¼1

1 1 dij ai aj yi yj k xi ; xj þ 2 i:j¼1 2C ‘ X

ð10Þ subject to :

‘ X

ai yi ¼ 0 and 0 ai ; i ¼ 1; . . . ‘: ð11Þ

i¼1

where dij ¼ 1 if i ¼ j and 0 otherwise.

To achieve good SVM performance, optimum values for the kernel parameters and for the hyperparameter C much be chosen. The latter is a regularization parameter controlling the trade-off between the training error minimization and the margin maximization. The kernel parameters define the kernel function used to map data into a higher-dimensional feature space (see Table 1). Like kernel functions, there are the Gaussian kernel k(xi,xj)¼exp(||xi xj||2 ∕s2) with parameter s and

Support Vector Machine

S

1307

Support Vector Machine. Figure 4 (a) and (b) show the impact of SVM hyperparameters on classifier generalization, while (c) illustrates the influence of the choice of kernel function.

the polynomial kernel k(xi, xj)¼(axi0 xj þ b)d with parameters a, b and d. The task of selecting the hyperparameters that yield the best performance of the machine is called model selection [6, 7, 8, 9]. As an illustration, Fig. 4a shows the variation of the error rate on a validation set versus the variation of the Gaussian kernel with a fixed value of C and Fig. 4b shows the variation of the error rate on the validation set versus the variation of the hyperparameter C with a fixed value of the RBF kernel parameter. In each case, the binary problem described by the ‘‘Thyroid’’ data taken from the UCI benchmark is resolved. Clearly, the best performance is achieved with an optimum choice of the kernel parameter and of C. With the SVM, as with other kernel classifiers, the choice of kernel corresponds to choosing a function space for learning. The kernel determines the functional form of all possible solutions. Thus, the choice of kernel is very important in the construction of a good machine. So, in order to obtain a good performance from the SVM classifier, one first need to design or choose a type of kernel, and then optimize the SVM’s hyperparameters to improve the generalization capacity of the classifier. Figure 4c illustrates the influence of the kernel choice, where the RBF and the polynomial kernels are compared on datasets taken from the challenge website on model selection and prediction organized by Isabelle Guyon.

Resolution of Multiclass Problems with the SVM The SVM is formulated for the binary classification problem. However, there are some techniques used to

combine several binary SVMs in order to build a system for the multiclass problem (e.g., a 10-class digit recognition problem). Two popular methods are presented here: OneVersustheRest: The idea of one versus the rest is to construct as many SVMs as there are classes, where each SVM is trained to separate one class from the rest. Thus, for a c-class problem, c SVMs are built and combined to perform multiclass classification according to the maximal output. The ith SVM is trained with all the examples in the ith class with positive labels, and all the other examples with negative examples. This is also known as the One-Against-All method. Pairwise(orOneAgainstOne): The idea of pairwise is to construct c(c1)2 SVMs for a c-class problem, each SVM being trained for every possible pair of classes. A common way to make a decision with the pairwise method is by voting. A rule for discriminating between every pair of classes is constructed, and the class with the largest vote is selected.

SVM Variants The least squares SVM (LS-SVM) is a variant of the standard SVM, and constitutes the response to the following question: How much can the SVM formulation be simplified without losing any of its advantages? Suykens and Vandewalle [10] proposed the LS-SVM where the training algorithm solves a convex problem like the SVM. In addition, the training algorithm of the LS-SVM is simplified, since a linear problem is resolved instead of a quadratic problem in the SVM case.

S

1308

S

Surface Curvature

The Transductive SVM (TSVM) is an interesting version of the SVM, which uses transductive inference. In this case, the TSVM attempts to find the hyperplane and the labels of the test data that maximize the margin with minimum error. Thus, the label of the test data is obtained in one step. Vapnik [1] proposed this formulation to reinforce the classifier on the test set by adding the minimization of the error on the test set during the training process. This formulation has been used elsewhere recently for training semi-supervised SVMs.

Applications The SVM is a powerful classifier which has been used successfully in many pattern recognition problems, and it has also been shown to perform well in biometrics recognition applications. For example, in [11], an iris recognition system for human identification has been proposed, in which the extracted iris features are fed into an SVM for classification. The experimental results show that the performance of the SVM as a classifier is far better than the performance of a classifier based on the artificial neural network. In another example, Yao et al. [12], in a fingerprint classification application, used recursive neural networks to extract a set of distributed features of the fingerprint which can be integrated into the SVM. Many other SVM applications, like handwriting recognition [8, 13], can be found at www.clopinet. com/isabelle/Projects/SVM/applist.html.

3. Scholkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, Cambridge, MA (2002) 4. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines. Cambridge University Press (2000) 5. Joachims, T.: Making large-scale support vector machine learning practical. In: Scholkopf, Burges, Smola (eds.) Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge, MA (1998) 6. Chapelle, O., Vapnik, V.: Model selection for support vector machines. Advances in Neural Information Processing Systems (1999) 7. Ayat, N.E., Cheriet, M., Suen, C.Y.: Automatic Model Selection for the Optimization of the SVM kernels. Pattern Recognit. 38(10), 1733–1745 (2005) 8. Adankon, M.M., Cheriet, M.: Optimizing Resources in Model Selection for Support Vector Machines. Pattern Recognit. 40(3), 953–963 (2007) 9. Adankon, M.M., Cheriet, M.: New formulation of svm for model selection. In: IEEE International Joint Conference in Neural Networks 2006, pp. 3566–3573. Vancouver, BC (2006) 10. Suykens, J.A.K., Van Gestel, T., De Brabanter, J., De Moor, B., Vandewalle, J.: Least Squares Support Vector Machines. World Scientific, Singapore (2002) 11. Roy, K., Bhattacharya, P.: Iris recognition using support vector machine. In: APR International Conference on Biometric Authentication (ICBA), Hong Kong, January 2006. Springer Lecture Note Series in Computer Science (LNCS), pp. (3882) 486–492 (2006) 12. Yao, Y., Marcialis, G.L., Pontil, M., Frasconi, P., Rolib, F.: Combining flat and structured representations for fingerprint classification with recursive neural networks and support vector machines. Pattern Recognit. 36(2), 397–406 (2003) 13. Matic, N., Guyon, I., Denker, J., Vapnik, V.: Writer adaptation for on-line handwritten character recognition. In: IEEE Second International Conference on Pattern Recognition and Document Analysis, pp. 187–191. Tsukuba, Japan (1993)

Related Entries ▶ Classifier ▶ Generalization ▶ Structural Risk ▶ Training

References 1. Vapnik, V.N.: Statistical learning theory. Wiley, New York (1998) 2. Boser, B.M.E., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In: Proceedings of Fifth Annual Workshop on Computational Learing Theory, pp. 144–152 (1992)

Surface Curvature Measurements of the curvature of a surface are commonly used in 3D biometrics. The normal curvature on a point p on the surface is defined as the curvature of the curve that is formed by the intersection of the surface with the plane containing the normal vector and one of the tangent vectors at p. Thus the normal curvature is a function of the tangent vector direction. The minimum and maximum values of this function are the principal curvatures k1 and k2 of the surface

Surveillance

at p. Other measures of surface curvature are the Gaussian curvature defined as the product of principal curvatures, the mean curvature defined as the average of principal curvatures and the shape index given by SI ¼

2 k2 þ k1 p k2 k1

Computation of surface curvature on discrete surfaces such as those captured with 3D scanners is usually accomplished by locally fitting low order surface patches (e.g. biquadratic surfaces, splines) over each point. Then the above curvature features may be computed analytically. ▶ Finger Geometry, 3D

Surface Matching 3D biometrics work by computing the similarity between 3D surfaces of objects belonging to the same class. The majority of the techniques used measure the similarity among homologous salient geometric features on the surfaces (e.g. based on curvature). The localization of these features is usually based on prior knowledge of the surface class (e.g. face, hand) and thus, specialized feature detectors may be used. The geometric attributes extracted are selected so that they are invariant to transformations such as rotation, translation and scaling. In the case that knowledgebased feature detection is difficult, a correspondence among the surfaces may be established by randomly selecting points on the two surfaces and then trying to find pairs of points with similar geometric attributes. Several such techniques have been developed for rigid surface matching (e.g. Spin Images) which may be extended for matching non-rigid or articulated surfaces. Another technique for establishing correspondences is fitting a parameterized deformable model to the points of each surface. Since the fitted models are deformations of the same surface, correspondence is automatically determined. Creation of such deformable models requires however a large number of annotated training data. ▶ Finger Geometry, 3D

S

1309

Surveillance R AMA C HELLAPPA , A SWIN C. S ANKARANARAYANAN University of Maryland, College Park, MD, USA

Synonyms Monitoring; Surveillance

Definition Surveillance refers to monitoring of a scene along with analysis of behavior of the people and vehicles for the purpose of maintaining security or keeping a watch over an area. Typically, traditional surveillance involves monitoring of a scene using one or more close circuit television (CCTV) cameras with personnel watching and making decisions based on video feeds obtained from the ▶ cameras. There is a growing need towards building systems that are completely automated or operate with minimal human supervision. Biometric acquisition and processing is by far the most important component of any automated surveillance system. There are many challenges and variates that show up in acquisition of biometrics for robust verification. Further, in surveillance, behavioral biometrics is also of potential use in many scenarios. Using the patterns observed in a scene (such as faces, speech, behavior), the system decides on a set of actions to perform. These actions could involve access control (allowing/ denying access to facilities), alerting the presence of intruders/abandoned luggage and a host of other security related tasks.

Introduction Surveillance refers to monitoring a scene using sensors for the purposes of enhanced security. Surveillance systems are becoming ubiquitous, especially in urban areas with growing deployment of cameras and CCTV for providing security in public areas such as banks, shopping malls, etc. It is estimated that UK alone has more than four million CCTV cameras. Surveillance technologies are also becoming common for

S

1310

S

Surveillance

other applications [1] such as traffic monitoring, wherein it is mainly used for detecting violations and monitoring traffic. Typically, video cameras are finding use for detecting congestion, accidents, and in adaptive switching of traffic lights. Other typical surveillance tasks include portal control, monitoring shop lifting, and suspect tracking as well as postevent analysis [2]. A traditional surveillance system involves little automation. Most surveillance systems have a set of cameras monitoring a scene of interest. Data collected from these sensors are used for two purposes. 1. Real time monitoring of the scene by human personnel. 2. Archiving of data for retrieval in the future. In most cases, the archived data is only retrieved after an incident has occurred. This, however is changing with introduction of many commercial surveillance technologies that introduce more automation thereby alleviating the need or reducing the involvement of humans in the decision making process [3]. Simultaneously, the focus has also been in visualization tools for better depiction of data collected by the sensors and in fast retrieval of archived data for quick forensic analysis. Surveillance systems that can detect elementary events in the video streams acquired by several cameras are commercially available today. A very general surveillance system is schematically shown in Fig. 1.

Biometrics form a critical component in all (semi-)automated surveillance systems, given the obvious need to acquire, validate, and process biometrics in various surveillance tasks. Such tasks include: 1. Verification. Validating a person’s identity is useful in access control. Typically, verification can be done in a controlled manner, and can use active biometrics such as iris, face (controlled acquisition), speech, finger/hand prints. The system is expected to use the biometrics to confirm if the person is truly whom he/she claims to be. 2. Recognition. Recognition of identity shows up in tasks of intruder detection and screening, which finds use in a wide host of scenarios from scene monitoring to home surveillance. This involves cross-checking the acquired biometrics across a list to obtain a match. Typically, for such tasks, passive acquisition methods are preferred making face and gait biometrics useful for this task. 3. Abnormality detection. Behavioral biometrics find use in surveillance of public areas, such as airports and malls, where the abnormal/suspicious behavior exhibited by a single or group of individual forms is the biometric of interest. Biometrics finds application across a wide range of surveillance tasks. We next discuss the variates and trade-offs involved in using biometrics application for surveillance.

Surveillance. Figure 1 Inputs from sensors are typically stored on capture. The relevant information is searched and retrieved only after incidents. However, in more automated systems the inputs are pre-processed for events. The system monitors these certain patterns to occur which initiates the appropriate action. When multiple sensors are present, for additional robustness, data across sensors might be fused.

Surveillance

Biometrics and Surveillance The choice of biometric to be used in a particular task depend on the match between the acquisition and processing capability of the biometric to the requirements of the task. Such characteristics include the discriminative power of the biometric, ease of acquisition, the permanence of the biometric, and miscellaneous considerations such as acceptability of its use and ▶ privacy concerns [4, 5]. Towards this end, we discuss some of the important variates that need be considered in biometric surveillance. 1. Cooperative acquisition. Ease of acquisition is probably the most important consideration for use of a particular biometric. Consider the task of home surveillance, where the system tries to detect intruders by comparing the acquired biometric signature to a database of individuals. It is not possible in such a task to use iris as a biometric, because acquisition of iris pattern requires cooperation of the subject. Similarly, for the same task, it is also unreasonable to use controlled face recognition (with known pose and illumination) as a possible biometric for similar reasons. Using the cooperation of subject as a basis, allows us to classify biometrics into two kinds: cooperative and non-cooperative. Fingerprints, hand prints, speech (controlled), face (controlled), iris, ear, and DNA are biometrics that need the active cooperation of the subject for acquisition. These biometrics, given the cooperative nature of acquisition, can be collected reliably under a controlled setup. Such controls could be a known sentence for speech, a known pose and favorable illumination for face. Further, the subject could be asked for multiple samples of the same biometric for increased robustness to acquisition noise and errors. In return, it is expected that the biometric performs at increased reliability with lower false alarms and lower mis-detections. However, the cooperative nature of acquisition makes these biometrics unusable for a variety of operating tasks. None the less, such biometrics are extremely useful for a wide range of tasks, such as secure access control, and for controlled verification tasks such as those related to passports and other identification related documents. In contrast, acquisition of the biometric without the cooperation of the subject(s) is necessary for surveillance of regions with partially or completely unrestricted access, wherein the sheer number of subjects

S

involved does not merit the use of active acquisition. Non-cooperative biometrics are also useful in surveillance scenarios requiring the use of behavioral biometrics, as with behavioral biometrics the use of active acquisition methods might inherently affect the very behavior that we want to detect. Face and gait are probably the best examples of such biometrics. 2. Inherent capability of discrimination. Each biometric depending on its inherent variations across subjects, and intra-variations for each individual has limitations on the size of the dataset it can be used before its operating characteristics (false alarm and mis-detection rate) go below acceptable limits. DNA, iris, and fingerprint provide robust discrimination even when the number of individuals in the database are in tens of thousands. Face (under controlled acquisition) can robustly recognize with low false alarms and mis-detections upto datasets containing many hundreds of individuals. However, performance of face as a biometric steeply degrades with uncontrolled pose, illumination, and other effects such as aging, disguise, and emotions. Gait, as a biometric provides similar performance capabilities as that of face under uncontrolled acquisition. However, as stated above, both face and gait can be captured without the cooperation of the subject, which makes them invaluable for certain applications. However, their use also critically depends on the size of the database that is used. 3. Range of operation. Another criterion that becomes important in practical deployment of systems using biometrics is the range at which acquisition can be performed. Gait, as an example, works with the human silhouette as the basic building block, and can be reliably captured at ranges upto a 100 m (assuming a common deployment scenario). In contrast, fingerprint needs contact between the subject and the sensor. Similarly, iris requires the subject to be at much closer proximity than what is required for face. 4. Miscellaneous considerations. There exist a host of other considerations that decide the suitability of a biometric to a particular surveillance application. These include the permanence of the biometric, security considerations such as the ease of imitating or tampering, and privacy considerations in its acquisition and use [4, 5]. For example, the permanence of face as a biometric depends on the degradation of its discriminating capabilities as the subject ages [6, 7].

1311

S

1312

S

Surveillance

Similarly, the issue of wear of the fingerprints with use becomes an issue for consideration. Finally, privacy considerations play an important role in the acceptability of the biometrics’ use in commercial systems.

Behavioral Biometrics in Surveillance Behavioral biometrics are very important for surveillance, especially towards identifying critical events before or as they happen. In general, the visual modality (cameras) is most useful for capturing behavioral information, although there has been some preliminary work on using motion sensor for similar tasks. In the presence of a camera, the processing of data to obtain such biometrics falls under the category of event detection. In the context of surveillance systems, these can be broadly divided into those that model actions of single objects and those that handle multi-object interactions. In the case of single-objects, an understanding of the activity (behavior) being performed is of immense interest. Typically, the object is described in terms of a feature-vector [8] whose representation is suitable to identify the activities while marginalizing nuisance parameters such as the identity of the object or view and illumination. Stochastic models such as the Hidden Markov models and Linear Dynamical Systems have been shown to be efficient in modeling activities. In these, the temporal dynamics of the activity are captured using state-space models, which form a generative model for the activity. Given a test activity, it is possible to evaluate the likelihood of the test sequence arising from the learnt model. Capturing the behavioral patterns exhibited by multiple actors is of immense importance in many surveillance scenarios. Examples of such interactions include an individual exiting a building and driving

a car, or an individual casing vehicles. A lot of other scenarios, such as abandoned vehicles, dropped objects fit under this category. Such interactions can be modeled using context-free grammars [9, 10] (Fig. 2). Detection and tracking data are typically parsed by the rules describing the grammar and a likelihood of the particular sequence of tracking information conforming to the grammar is estimated. Other approaches rely on motion analysis of humans accompanying the abandoned objects. The challenges towards the use of behavioral biometrics in surveillance tasks are in making algorithms robust to variations in pose, illumination, and identity. There is also the need to bridge the gap between the tools for representation and processing used for identifying biometrics exhibited by individuals and those by groups of people. In this context, motion sensors [11] provide an alternate way for capturing behavioral signatures of groups. Motion sensors register time-instants when the sensor observes motion in its range. While this information is very sparse, without any ability to recognize people or disambiguate between multiple targets, a dense deployment of motion sensors along with cameras can be very powerful.

Conclusion In summary, biometrics are an important component of automated surveillance, and help in the tasks of recognition and verification of a target’s identity. Such tasks find application in a wide range of surveillance applications. The use of a particular biometric for a surveillance application depends critically on the match between the properties of the biometrics and the needs of the application. In particular, attributes

Surveillance. Figure 2 Example frames from a detected casing incident in a parking lot. The algorithm described in [10] was used to detect the casing incident.

SVM Supervector

such as ease of acquisition, range of acquisition and discriminating power form important considerations towards the choice of biometric used. In surveillance, behavioral biometrics are useful in identifying suspicious behavior, and finds use in a range of scene monitoring applications.

Related Entries ▶ Border Control ▶ Law Enforcement ▶ Physical Access Control ▶ Face Recognition, Video Based

S

1313

SVM ▶ Support Vector Machine

SVM Supervector An SVM (Support Vector Machine) is a two class classifier. It is constructed by sums of kernel function K(.,.): f ðxÞ ¼

L X

ai ti K ðx; xi Þ þ d

ð1Þ

i¼1

References 1. Remagnino, P., Jones, G.A., Paragios, N., Regazzoni, C.S.: VideoBased Surveillance Systems: Computer Vision and Distributed Processing. Kluwer, Dordecht (2001) 2. Zhao, W., Chellappa, R., Phillips, P., Rosenfeld, A.: Face recognition: a literature survey. ACM Comput. Surv. 35, 399–458 (2003) 3. Shu, C., Hampapur, A., Lu, M., Brown, L., Connell, J., Senior, A., Tian, Y.: Ibm smart surveillance system (s3): a open and extensible framework for event based surveillance. In: IEEE Conference on Advanced Video and Signal Based Surveillance. pp. 318–323 (2005) 4. Prabhakar, S., Pankanti, S., Jain, A.: Biometric recognition: security and privacy concerns. Security & Privacy Magazine, IEEE 1, 33–42 (2003) 5. Liu, S., Silverman, M.: A practical guide to biometric security technology. IT Professional 3, 27–32 (2001) 6. Ramanathan, N., Chellappa, R.: Face verification across age progression. Comput. Vision Pattern Recognit., 2005. CVPR 2005. IEEE Computer Society Conference on 2 (2005) 7. Ling, H., Soatto, S., Ramanathan, N., Jacobs, D.: A study of face recognition as people age. Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on pp. 1–8 (2007) 8. Veeraraghavan, A., Roy-Chowdhury, A.K., Chellappa, R.: Matching shape sequences in video with applications in human movement analysis. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1896–1909 (2005) 9. Moore, D., Essa, I.: Recognizing multitasked activities from video using stochastic context-free grammar. Workshop on Models versus Exemplars in Computer Vision (2001) 10. Joo, S., Chellappa, R.: Recognition of multi-object events using attribute grammars. IEEE International Conference on Image Processing pp. 2897–2900 (2006) 11. Wren, C., Ivanov, Y., Leigh, D., Westhues, J.: The MERL Motion Detector Dataset: 2007 Workshop on Massive Datasets. (Technical report)

ti are the ideal outputs (−1 for one class and +1 for the L P other class) and ai ti ¼ 0ðai > 0Þ The vectors xi are i¼1

the support vectors (belonging to the training vectors) and are obtained by using an optimization algorithm. A class decision is based upon the value of f (x) with respect to a threshold. The kernel function is constrained to verify the Mercer condition: K ðx; yÞ ¼ bðxÞt bðyÞ; where b(x) is a mapping from the input space (containing the vectors x) to a possibly infinite-dimensional SVM expansion space. In the case of speaker verification, given universal background (GMM UBM): gðxÞ ¼

M X

oi N ðx; mi ; Si Þ;

ð2Þ

i¼1

where, oi are the mixture weights, N() is a Gaussian, andðmi þ Si Þare the means and covariances of Gaussian components. A speaker (s) model is a GMM obtained by adapting the UBM using MAP procedure (only means are adapted: ðms Þ). In this case the kernel function can be written as: K ðs1 ; s2 Þ ¼

M X pﬃﬃ ð1=2Þ s1 t ð oi Si mi Þ i¼1

ð

pﬃ

ð3Þ

ð1=2Þ s2 oi Si mi Þ:

The kernel of the above equation is linear in the GMM Supervector space and hence it satisfies the Mercer condition. ▶ Session Effects on Speaker Modeling

S

1314

S

Sweep Sensor

Sweep Sensor It refers to a fingerprint sensor on which the finger has to sweep on the platen during the capture. Its capture area is very small and it is represented by few pixel lines. ▶ Fingerprint, Palmprint, Handprint and Soleprint Sensor

Synthetic Fingerprint Generation ▶ Fingerprint Sample Synthesis ▶ SFinGe

Synthetic Fingerprints ▶ Fingerprint Sample Synthesis

Synthesis Attack Synthesis attack is similar to replay attack in that it also involves the recording of voice samples from a legitimate client. However, these samples are used to build a model of the client’s voice, which can in turn be used by a text-to-speech synthesizer to produce speech that is similar to the voice of the client. The text-to-speech synthesizer could then be controlled by an attacker, for example, by using the keyboard of a notebook computer, to produce any words or sentences that may be requested by the authentication system in the client’s voice in order to achieve false authentication.

Synthetic Iris Images ▶ Iris Sample Synthesis

Synthetic Voice Creation ▶ Voice Sample Synthesis

▶ Liveness Assurance in Face Authentication ▶ Liveness Assurance in Voice Authentication ▶ Security and Liveness, Overview

System-on-card

Synthetic Biometrics ▶ Biometric Sample Synthesis

Smartcard has complete biometric verification system which includes data acquisition, processing, and matching. ▶ On-Card Matching