Automatic computation of histogram threshold for lip

SIViP DOI 10.1007/s11760-015-0834-9

ORIGINAL PAPER

Automatic computation of histogram threshold for lip segmentation using feedback of shape information Ashley D. Gritzman1 · Vered Aharonson1,2 · David M. Rubin1 · Adam Pantanowitz1

Received: 12 July 2015 / Revised: 30 September 2015 / Accepted: 16 October 2015 © Springer-Verlag London 2015

Abstract Threshold-based segmentation methods provide a simple and efficient way to implement lip segmentation. However, automatic computation of robust thresholds presents a major challenge. This research proposes an adaptive method for selecting the histogram threshold, based on feedback of shape information. The proposed method reduces unnecessary overhead by first comparing the initial segmentation to a reference lip shape model to decide if optimisation is required. In cases where optimisation is required, the algorithm adjusts the threshold until the segmentation is sufficiently similar to a reference shape model. The algorithm is tested on the AR Face Database by comparing the segmentation accuracy before and after optimisation. The proposed method increases the number of segmentations classified as ‘good’ (overlap above 90 %) by 7.1 % absolute, and significantly improves the segmentation in challenging cases containing facial hair. Keywords Lip segmentation · Shape model · Histogram threshold · Adaptive threshold optimisation · Automatic lip-reading · Facial analysis

1 Introduction The shape and movement of the human lips convey valuable visual information which is used in various applications

B

Ashley D. Gritzman [email protected]

1

Biomedical Engineering Research Group, School of Electrical and Information Engineering, University of the Witwatersrand, Johannesburg, South Africa

2

Department of Electrical Engineering, Afeka, Tel Aviv Academic College of Engineering, Tel Aviv, Israel

including: automatic lip-reading (ALR), emotion recognition, biometric speaker identification, and virtual face animation. The image processing to extract visual information from the lips typically involves three stages: face detection, location of the region of interest (ROI), and lip segmentation. This research focuses on lip segmentation only as the accuracy of this component is crucial to the performance of the overall system [1]. Lip segmentation presents a challenging image processing problem arising from three levels of variability. First, the inherent challenge of lip segmentation is variability in the speaker profile including skin colour, lip colour, lip shape, facial hair, and makeup. Second, the contents of the ROI are not static, and the visibility of the teeth, tongue, and oral cavity changes as the lips move to form facial expressions and speech sounds. Finally, non-ideal environmental conditions including lighting, speaker orientation, and background create a third layer of complexity. Lip segmentation techniques can be classified into two broad categories: colour-based techniques and model-based techniques. Colour-based techniques operate at pixel or neighbourhood level, and attempt to differentiate between lip and skin pixels based on colour features. Lip segmentation algorithms usually start by transforming the RGB image to an intensity image by applying a suitable colour transform. The comparison presented in our previous work [2] evaluates 33 colour transforms for enhancing the lip-skin contrast including HSV, YCbCr, YIQ, CIELUV, CIELAB, and LUX. Wark et al. [3] and Chiou and Hwang [4] segment the lips by applying upper and lower limits to threshold the R/G channel. In a similar manner, Coianiz et al. [5] and Zhang and Mersereau [6] apply fixed thresholds to the H channel. While these techniques are simple and efficient, the major limitation is the automatic computation of robust

123

SIViP

thresholds [7]. Fixed thresholds cannot be generalised due to variability in speaker appearance and lighting conditions; hence, the threshold parameters must be calibrated for the specific speaker and environment. Furthermore, even after the initial calibration, appearance of the teeth, tongue, and oral cavity during movement of the mouth can significantly alter the image histogram and affect the threshold parameters. Another colour-based approach uses gradient filters (e.g. Sobel, Canny–Deriche, or the Prewitt operator) to extract the lip contour due to their efficacy in boundary detection [8–10]. However, the segmentation of gradient-based techniques is susceptible to false boundary edges. The model-based approach uses prior knowledge of the lip shape to construct a lip model. The lip model is iteratively matched to the image by optimising a cost function. The three main techniques used to build lip models are: deformable templates [11,12], active shape models [13], and active contour models (‘snakes’) [14,15]. Active appearance models (AAMs) include both shape and appearance features in the statistical lip model [16,17]. Model-based techniques are usually invariant to translation, rotation, scale, and illumination; however, since the model is predefined, variation in speaker appearance and speaker sound formation can be challenging [18]. In addition, minimising a cost function can be computationally expensive which may affect real-time performance [19]. Functional lip segmentation algorithms often combine elements from colour-based and model-based categories in so-called hybrid techniques [19]. These techniques attempt to reduce the computational burden of model-based techniques by first obtaining a rough estimate using colour-based techniques, followed by fitting a lip shape model. Werda et al. [9] threshold the R channel and analyse the vertical and horizontal projections to detect the corners of the mouth. These points are then used to position the subsequent parametric lip model. Bouvier et al. [20] threshold the lip membership map

to estimate the lip region. This region is then used to initialise upper and lower lip snakes. Although the hybrid approach is somewhat less sensitive to the threshold value, selection of the correct threshold remains essential for the initial estimate and subsequent model fitting. Despite the benefits of colour-based techniques, efficiency in particular, the central challenge of robust threshold computation is somewhat prohibitive, and as a result model-based techniques prevail in the current literature. This research addresses the challenge of automatic threshold computation for lip segmentation. Solving this challenge will facilitate accurate and efficient lip segmentation using colour-based methods, while avoiding the drawbacks of more complex techniques. The remainder of this paper is organised as follows: Sect. 2 introduces a novel technique called Adaptive Threshold Optimisation (ATO), which uses feedback of shape information to select the threshold. Section 3 describes the dataset and metrics used to develop and test the algorithm. ATO is designed to augment an existing threshold-based algorithm, so Sect. 4 describes the implementation of a typical base algorithm. Section 5 details the structure and operation of ATO, and finally, Sect. 6 evaluates the lip segmentation performance.

2 Algorithm overview At a high level, the algorithm consists of two components: the base algorithm, which represents a typical threshold-based segmentation algorithm, and the adaptive threshold optimisation (ATO) algorithm. Figure 1 presents an overview of the base algorithm and the ATO algorithm shown in the shaded blocks. ATO is not a stand-alone segmentation algorithm, rather it is designed to augment an existing threshold-based algorithm. ATO aims to improve the segmentation of the base algorithm by optimising the threshold parameter.

Fig. 1 The base algorithm is augmented with adaptive threshold optimisation (ATO) shown in the shaded blocks

123

SIViP Fig. 2 Sample of images in the dataset—the ROI is cropped from full-face AR Face images [22], and the ground truth is created by interpolating the manual markings [23]

This research uses a simple threshold-based segmentation algorithm as the base comprising colour transform, thresholding, morphological processing, and contour smoothing. The threshold-based algorithm uses Otsu’s method [21] to select the default threshold value. However, ATO does not depend on the specific structure of the base algorithm or the method of computing the default threshold, and various threshold-based algorithms may be substituted (e.g. [3–6]). In simple cases, the segmentation produced by the base algorithm using the default threshold is adequate, and it is not necessary to optimise the threshold. Therefore, to preserve the overall efficiency, ATO first determines whether or not the initial segmentation is satisfactory in the validation stage. Only if the initial segmentation is deemed unsatisfactory, then the selection algorithm proceeds to recompute the threshold in the optimisation stage. The principle of ATO is to use feedback of shape information to guide selection of the threshold (see Fig. 1). ATO incorporates shape information by constructing a lip shape model from statistical analysis of labelled images. The validation stage compares the output of the base algorithm to the shape model, and determines whether to accept or reject the segmentation. If the segmentation is rejected, then the optimisation stage iteratively adjusts the threshold value to minimise the difference between the segmentation and the model. Section 4 describes the implementation of the base algorithm, and Sect. 5 presents the details of ATO.

3 Dataset and metrics

Ding and Martinez [23] obtain manual markings of the facial features from three independent human judges. The outer lip contour is labelled with 20 points, and the withinjudge variability is 3.8 pixels or 1.2 %. The ground truth A G is created by interpolating the outer lip contour using two cubic smoothing splines, as shown in Fig. 2. The image database is divided into training and testing datasets: 40 % of the images are used to build the lip shape model in Sect. 5; and 60 % of the images are used to test the ATO algorithm in Sect. 6. 3.2 Metrics Wang et al. [25] define two types of error to quantify the quality of lip segmentation: outer lip error (OLE) is the number of non-lip pixels classified as lip pixels (false positive); and inner lip error (ILE) is the number of lip pixels classified as non-lip pixels (false negative). Liew et al. [26] use these error measures to develop two metrics to quantify the overall lip segmentation accuracy: percentage overlap (OL) and segmentation error (SE). OL in (1) measures the percentage overlap between the segmented lip region A and the ground truth A G . SE in (2) measures the segmentation error relative to the number of lip pixels in the ground truth TL. Perfect segmentation of the lips returns OL = 100 % and SE = 0 %. 2 × (A ∩ A G ) × 100 % A + AG OLE + ILE × 100 % SE = 2 × TL

OL =

(1) (2)

3.1 Dataset

4 The base algorithm The dataset comprises 895 images from the AR Face Database [22], including 112 subjects (58 males, 54 females) and four different facial expressions: neutral, smile, anger, and scream. The AR Face Database includes significant variability in speaker profile, featuring a diverse range of racial and ethnic groups, as well as varying degrees of facial hair and makeup. Since the technologies to locate the face and mouth region are well established (e.g. Viola–Jones detector [24] implemented in OpenCV), the full-face AR images are cropped to a rectangular ROI enclosing the lips, oral cavity, and surrounding skin.

The base algorithm is a simple threshold-based segmentation algorithm comprising preprocessing, colour transform, thresholding, morphological processing, and contour smoothing. The implementation of these components is discussed in this Section. 4.1 Preprocessing The preprocessing component comprises two steps: first, a 3×3 Gaussian low-pass filter (LPF) is applied to each channel

123

SIViP

of the RGB image to remove high-frequency noise; second, a luminance correction is applied to each channel of the RGB image to compensate for varying illumination conditions. The luminance correction is shown in (3) and enhances chromatic information such that colours are brought out of shadowy areas [27]. L is the luminance; a and b control the weight of luminance correction (a = 0.4, b = 0.8). X Comp =

X X + (b − a)L + a

(3)

J =λ

n

wi (yi − Si )2 + (1 − λ)

i=1

xn

S (x)

2

dx

(4)

x1

where: Si = S(xi ) λ = smoothing parameter, λ ∈ [0, 1] w = [100, 1, 1, . . . , 1, 100]

4.2 Colour transform

5 Adaptive threshold optimisation The RGB image is transformed to an intensity image using a colour transform to enhance the lip-skin contrast. Gritzman et al. [2] use Otsu’s discriminant [21] to quantify the separability of lip and skin pixels after applying different colour transforms. The Q channel from YIQ and modified I3 channel from [28] are shown to be most effective in separating lip and skin pixels. The two transforms are used in combination to improve contrast and reduce artefacts. 4.3 Histogram threshold After the colour transform, a binary histogram threshold is applied to the intensity image. The threshold used in the base algorithm is referred to as the default threshold, and this value is the focus of the optimisation stage of ATO. The default threshold is computed using Otsu’s method [21] which chooses the threshold to minimise the intraclass variance of black and white pixels. 4.4 Morphological processing After thresholding, nine morphological operations are performed to consolidate the lip region and to remove artefacts: (i) pruning; (ii) hole fill; (iii) majority filter; (iv) opening; (v) remove isolated regions; (vi) clear border; (vii) dilation; (viii) convex hull to join multiple regions; (ix) hole fill. 4.5 Contour smoothing The lip contour is extracted from the outer boundary of the lip region, and two cubic splines are used to smooth top and bottom contours, respectively. The error measure is weighted (w) to ensure the top and bottom lip contours meet at the corners of the mouth. Given a set of co-ordinates (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ), the smoothing spline S minimises the spline objective function J shown in (4). For the top lip contour λt = 0.003, and for the bottom lip contour λb = 0.006.

123

Adaptive threshold optimisation (ATO), shown in Fig. 1, uses feedback of shape information to validate and optimise the histogram threshold. The default threshold is computed using Otsu’s method [21], which is used by the base algorithm to produce the initial segmentation. The initial segmentation is compared to a predefined lip shape model in the validation stage, which either accepts or rejects the segmentation. If the default threshold is accepted, then the initial segmentation becomes the final segmentation and no further processing takes place. If the default threshold is rejected, then the ATO proceeds to the optimisation stage. 5.1 Lip shape model The lip shape model (LSM) is constructed from statistical analysis of the ground truth images in the training dataset. The LSM is made up of four submodels, one for each expression in the AR Face Database: neutral, smile, anger, and scream. The process to construct each submodel is described below. 5.1.1 Feature extraction The feature vector for the lip shape model comprises fourteen geometric features listed below. The features are normalised relative to the ROI, and standardised by calculating the z score. 1. Width 2. Height 3. Position x

8. Orientation 9. Centroid x 10. Centroid y

4. Position y 5. Area

11. Major axis length 12. Minor axis length

6. Perimeter 7. Eccentricity

13. Distance trans μ 14. Distance trans σ

SIViP Table 1 Parameters of lip shape model (LSM) constructed from training dataset Centroid EV1 Neutral

5.1.2 Dimensional reduction Principal component analysis (PCA) is used to reduce the dimensionality of the feature vector. The algorithm is set to retain at least 80 % of the total variance. In the training dataset, the first three eigenvectors account for 83.2 % of the variance; thus, the remaining eleven eigenvectors are dropped. Figure 3 shows the training images projected in the two-dimensional plane comprising eigenvector 1 (EV1) and eigenvector 2 (EV2). The expressions for smile and scream occur in distinct groups, while the expressions for neutral and anger overlap. The overlap between neutral and anger is expected as the expression for anger is formed by frowning with the eyebrows, while little change occurs in the mouth region. 5.1.3 Submodel statistics The centroid of each submodel is calculated by finding the arithmetic mean of all points comprising the submodel. The submodel statistics are based on the Euclidean distance between the centroid and the corresponding data points. The following statistics are calculated and incorporated into the LSM: mean distance, standard deviation of distance, and maximum distance. The final LSM is comprised of the submodel centroid and the submodel statistics shown in Table 1. 5.2 Validation stage In the validation stage, the 14-dimensional feature vector is extracted from the initial segmentation, standardised, and multiplied by the transformation matrix obtained from PCA. The distance D from the initial segmentation to the nearest centroid in the feature space is calculated in (5). If D is

EV2

EV3

Mean

Std

Max

2.27

0.25

0.09

0.78

0.66

2.49

Smile

−1.36

−1.87

−0.01

1.15

0.85

3.64

Anger

2.27

0.02

−0.03

0.99

0.77

4.15

−3.65

1.47

−0.07

1.14

0.89

4.24

Scream

Fig. 3 Projection of training images in the two-dimensional plane comprising eigenvector 1 (EV1) and eigenvector 2 (EV2)

Distance statistics

below a specific distance cut-off (i.e. the candidate region is sufficiently similar to one of the submodels), then the default threshold is accepted. If D is above the distance cut-off, then the default threshold is rejected and ATO proceeds to the optimisation stage. The distance cut-off is set empirically in Sect. 6.1. D=

|FV − Ctrd| σ

(5)

where: FV = reduced feature vec of initial segmentation Ctrd = centroid of the nearest submodel σ = standard deviation of the nearest submodel 5.3 Optimisation stage The optimisation loop in Fig. 1 shows the objective function comprising thresholding, morphological processing, smoothing, and computation of the distance D. The input to the objective function is the threshold level, and the output is the distance D to the nearest submodel. The goal of the optimisation is to find the threshold level that minimises the distance between the segmentation and the LSM. The objective function is minimised using the Golden Section Search [29].

6 Results and analysis ATO is tested in two phases: first, the validation stage is tested to determine whether it is effective in detecting poor segmentation by the base algorithm; second, the optimisation stage is tested to quantify the improvement in segmentation accuracy. 6.1 Validation results The aim of the validation stage is to assess the initial segmentation (i.e. the output of the base algorithm using the default threshold) and identify poor lip segmentation results.

123

350

20

300

18

250

16

200

14

150

12

100

10

50

8

0

0

1

2

3

4

5

6

7

SE (%)

Frequency

SIViP

6 8

Distance (σ)

Fig. 5 Segmentation results before and after adaptive threshold optimisation (ATO)

Fig. 4 The blue histogram shows the distribution of the distance D from the segmentation to the lip shape model (LSM). The red line plot shows the segmentation error (SE) for each distance bin

180 Before ATO After ATO 160 140

The validation assumes that the distance D can be used to infer the quality of the segmentation. Hence, the validation stage is tested by quantifying the correlation between D and SE. The base algorithm only (without ATO) is used to segment the lips and skin from 895 mouth region images, producing the initial segmentation. D is calculated by measuring the distance between the initial segmentation and the LSM. SE is calculated by measuring the difference between the initial segmentation and the ground truth. Finally, D is compared to SE to quantify the performance of the validation stage. Figure 4 shows the efficacy of the validation algorithm to identify poor lip segmentation, using D as an indicator. The x-axis is the distance D between the segmentation and the nearest centroid. The left y-axis and the blue histogram show the distribution of the segmented lip images as a function of D. The base algorithm obtains D below 1σ for 51 images. The majority (63.4 %) of lip segmentation images are within 1σ − 3σ from the LSM. The right y-axis and the red line plot show the mean SE for each distance bin. The mean SE in bin 0–1 is 6.0 % which increases gradually up to 7.0 % at bin 3–4. However, above 4σ , SE increases sharply and rises to 20 % at bin 7–8. The critical point occurs at 4σ : the mean SE below the critical point is 6.6 %, while the mean SE above the critical point is almost double at 13.1 %. The high correlation between D and SE confirms that the distance can be used to infer the segmentation accuracy. A distance below 2σ is indicative of very good segmentation by the base algorithm as the mean SE in this range is only 6.2 %. Based on this result, the validation cut-off is implemented as follows:

123

Frequency

120 100 80 60 40 20 0 70

75

80

85

90

95

100

OL (%)

Fig. 6 Histogram of percentage overlap (OL) before and after Adaptive Threshold Optimisation (ATO)

D ≤ 2σ : accept initial segmentation D > 2σ : reject initial segmentation, and proceed to optimisation 6.2 Optimisation results If the initial segmentation is rejected during the validation stage, then ATO proceeds to the optimisation stage. Figure 5 shows several examples of the lip segmentation results before and after ATO. The threshold selected by ATO significantly improves the segmentation in several scenarios in which the base algorithm with the default threshold does not perform well, including: facial hair; low contrast between the lips and skin; thin lips; and reflections caused by moisture on the skin. Figure 6 shows a histogram of OL before and after ATO. The optimisation stage decreases the number of images between 75 and 92 % OL, and increases the number of images between 92 and 100 % OL. The peak at 95 % is significantly

SIViP Table 2 Percentage overlap (OL) and segmentation error (SE) of images that pass and fail validation. Images that fail validation are optimised, and the before/after results are shown Pass

Fail validation

Total

Before

Before

After

After

# imgs

186

351

OL

93.7%

91.1%

92.9%

537 92.0%

93.2%

SE

6.2%

8.4%

7.6%

7.7%

7.1%

greater after optimisation at almost 180 images, compared to 145 images before optimisation. ATO increases the number of segmentations considered ‘good’ (OL ≥ 90 %) by 7.1 % absolute, from 78.2 to 85.2 %. Table 2 shows a summary of the optimisation results. The initial segmentation passed validation in 186 images (D ≤ 2σ ), and failed validation in 351 images. ATO only optimises the threshold for images that fail validation, which results in an absolute OL improvement of 1.8 % from 91.1 % before optimisation to 92.9 % after optimisation. Comparing the images that passed and failed validation, the mean SE for images that passed validation was 6.2 %, while the mean SE for images that failed validation was 2.2 % absolute higher at 8.4 %. This again illustrates the ability of the validation stage to discriminate between accurate and poor segmentation. In total, ATO improved OL by 1.15 % absolute up to 93.2 %, and decreased SE by 0.6 % absolute to 7.1 %. The overall impact of improving the segmentation accuracy is obviously dependant on the system and application; however, to provide some indication, Potamianos and Neti [1] report that in one experiment, 2.4 % increase in the face detection accuracy resulted in almost half the word error rate (from 30.7 to 17.6 %).

7 Conclusion Automatic computation of robust thresholds is a major challenge for colour-based lip segmentation. This paper proposes a novel method called adaptive threshold optimisation (ATO) which uses feedback of shape information to guide automatic selection of the threshold. ATO incorporates shape information by constructing a lip shape model from statistical analysis of labelled images. The validation stage of ATO compares the output of the base algorithm to the shape model, and determines whether to accept or reject the segmentation. If the segmentation is rejected, then the optimisation stage of ATO iteratively adjusts the threshold value to minimise the difference between the segmentation and the reference model. The performance of the base lip segmentation algorithm is tested with and without ATO. Using ATO, the number of

segmentations considered ‘good’ (OL ≥ 90 %) increased by 7.1 % absolute up to 85.3 %. The threshold selected by ATO significantly improved the segmentation in challenging cases containing facial hair and low lip-skin contrast. Currently, the lip shape model is constructed from ground truth images (positive examples). ATO operates by measuring the similarity between the segmentation and these positive examples. However, the lip shape model may benefit from also incorporating negative examples which characterise typical failures of the base algorithm. This will allow ATO to optimise both the similarity to the positive examples, and the difference to the negative examples. Acknowledgments The financial assistance of the National Research Foundation (NRF) of South Africa towards this research is hereby acknowledged (Grant No. 97742).

References 1. Potamianos, G., Neti, C.: Audio-visual speech recognition in challenging environments. In: European Conference on Speech Communication and Technology (EUROSPEECH 2003), Geneva, Switzerland, pp. 1293–1296 (2003) 2. Gritzman, A.D., Rubin, D.M., Pantanowitz, A.: Comparison of colour transforms used in lip segmentation algorithms. Signal Image Video Process. 9(4), 168–173 (2014) 3. Wark, T., Sridharan, S., Chandran, V.: An approach to statistical lip modelling for speaker identification via chromatic feature extraction. In: Fourteenth International Conference on Pattern Recognition (ICPR 1998), vol. 1. IEEE, pp. 123–125 (1998) 4. Chiou, G.I., Hwang, J.-N.: Lipreading from color video. IEEE Trans. Image Process. 6(8), 1192–1195 (1997) 5. Coianiz, T., Torresani, L., Caprile, B.: 2D deformable models for visual speech analysis. In: Stork, G., Hennecke, M.E. (eds.) NATO Advanced Study Institute: Speechreading by Man and Machine, pp. 391–398. Springer, Berlin (1996) 6. Zhang, X., Mersereau, R.: Lip feature extraction towards an automatic speechreading system. In: 2000 International Conference on Image Processing (ICIP 2000), vol. 3. IEEE, pp. 226–229 (2000) 7. Caplier, A., Stillittano, S., Bouvier, C., Coulon, P.: Lip modelling and segmentation. In: Liew, A. W.-C., Wang, S. (eds.) Visual Speech Recognition: Lip Segmentation and Mapping, pp. 70–127. Information Science Reference (an imprint of IGI Global), USA (2009) 8. Pardàs, M., Sayrol, E.: Motion estimation based tracking of active contours. Pattern Recogn. Lett. 22(13), 1447–1456 (2001) 9. Werda, S., Mahdi, W., Hamadou, A. B.: Colour and geometric based model for lip localisation: application for lip-reading system. In: 14th International Conference on Image Analysis and Processing (ICIAP 2007). IEEE, pp. 9–14 (2007) 10. Eveno, N., Caplier, A., Coulon, P.: Accurate and quasi-automatic lip tracking. IEEE Trans. Circuits Syst. Video Technol. 14(5), 706– 715 (2004). iD: 1 11. Yuille, A.L., Hallinan, P.W., Cohen, D.S.: Feature extraction from faces using deformable templates. Int. J. Comput. Vision 8(2), 99– 111 (1992) 12. Kaucic, R., Dalton, B., Blake, A.: Real-time lip tracking for audiovisual speech recognition applications. Comput. Vis. ECCV 1996, 376–387 (1996) 13. Luettin, J., Thacker, N.A.: Speechreading using probabilistic models. Comput. Vis. Image Underst. 65(2), 163–178 (1997)

123

SIViP 14. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. Int. J. Comput. Vis. 1(4), 321–331 (1988) 15. Eveno, N., Caplier, A., Coulon, P.-Y.: Jumping snakes and parametric model for lip segmentation. In: International Conference on Image Processing (ICIP 2003), vol. 2. IEEE, pp. 867–870 (2003) 16. Matthews, I., Cootes, T.F., Bangham, J.A., Cox, S., Harvey, R.: Extraction of visual features for lipreading. IEEE Trans. Pattern Anal. Mach. Intell. 24(2), 198–213 (2002) 17. Zheng, Z., Jiong, J., Chunjiang, D., Liu, X., Yang, J.: Facial feature localization based on an improved active shape model. Inf. Sci. 178(9), 2215–2223 (2008) 18. Wang, S., Lau, W., Leung, S., Yan, H.: A real-time automatic lipreading system. In: International Symposium on Circuits and Systems (ISCAS 2004), vol. 2. IEEE, pp. II-101–II-104 (2004) 19. Saeed, U., Dugelay, J.L.: Combining edge detection and region segmentation for lip contour extraction. Articulated Motion and Deformable Objects, pp. 11–20 (2010) 20. Bouvier, C., Coulon, P.Y., Maldague, X.: Unsupervised lips segmentation based on ROI optimisation and parametric model. In: 2007 IEEE International Conference on Image Processing, vol. 4. IEEE, pp. IV-301–IV-304 (2007) 21. Otsu, N.: A threshold selection method from gray-level histograms. Automatica 11(285–296), 23–27 (1975) 22. Martinez, A.: The AR face database. CVC Technical Report, vol. 24 (1998)

123

23. Ding, L., Martinez, A.: Features versus context: an approach for precise and detailed detection and delineation of faces and facial features. IEEE Trans. Pattern Anal. Mach. Intell. 32(11), 2022– 2038 (2010) 24. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2001), vol. 1. IEEE, pp. I–511 (2001) 25. Wang, S., Leung, S., Lau, W.: Lip segmentation by fuzzy clustering incorporating with shape function. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2002), vol. 1. IEEE, pp. I-1077 (2002) 26. Liew, A.-C., Leung, S.H., Lau, W.H.: Segmentation of color lip images by spatial fuzzy clustering. IEEE Trans. Fuzzy Syst. 11(4), 542–549 (2003) 27. Eveno, N., Caplier, A., Coulon, P.Y.: New color transformation for lips segmentation. In: 2001 IEEE Fourth Workshop on Multimedia Signal Processing, pp. 3–8 (2001) 28. Canzler, U., Dziurzyk, T.: Extraction of non manual features for videobased sign language recognition. In: IAPR Workshop on Machine Vision Applications (IAPR MVA 2002), pp. 318–321 (2002) 29. Kiefer, J.: Sequential minimax search for a maximum. Proc. Am. Math. Soc. 4(3), 502–506 (1953)