using human experts' gaze data to evaluate image ... - IEEE Xplore

USING HUMAN EXPERTS’ GAZE DATA TO EVALUATE IMAGE PROCESSING ALGORITHMS Preethi Vaidyanathan1, Jeff Pelz1, Rui Li2, Sai Mulpuru1, Dong Wang1, Pengcheng Shi2, Cara Calvelli3, Anne Haake2 1

Chester F Carlson Center for Imaging Science, College of Science 2 Golisano College of Computing and Information Sciences 3 College of Health Sciences and Technology Rochester Institute of Technology 102 Lomb Memorial Dr. Rochester, NY 14623 USA

(pxv1621, jeff.pelz, rxl5604, sxm2813, dxw1481, pengcheng.shi, cara.calvelli, anne.haake.)@rit.edu ABSTRACT Understanding the capabilities of the human visual system with respect to image understanding, in order to inform image processing, remains a challenge. Visual attention deployment strategies of experts can serve as an objective measure to help us understand their learned perceptual and conceptual processes. Understanding these processes will inform and direct image the selection and use of image processing algorithms, such as the dermatological images used in our study. The goal of our research is to extract and utilize the tacit knowledge of domain experts towards building a pipeline of image processing algorithms that could closely parallel the underlying cognitive processes. In this paper we use medical experts’ eye movement data, primarily fixations, as a metric to evaluate the correlation of perceptually-relevant regions with individual clusters identified through k-means clustering. This test case demonstrates the potential of this approach to determine whether a particular image processing algorithm will be useful in identifying image regions with high visual interest and whether it could be a component of a processing pipeline. Index Terms— Image understanding, dermatology, eye tracking 1. INTRODUCTION Image understanding is a limiting factor in advancing computer systems for various applications. This is particularly true for medical images in which there is a large degree of domain knowledge [1]. Researchers have achieved immense success in developing image-based feature driven approaches, but semantic understanding remains a major

978-1-4577-1286-9/11/$26.00 ©2011 IEEE

129

challenge since most approaches do not incorporate domain knowledge or contextual information. It is well known that human perception results from a combination of both data driven (bottom-up) and model driven (top-down) processes. Computer vision algorithms using bottom-up approaches like color features [2] fail to bridge the semantic gap between feature-based visual similarity and perceptual similarity [3, 4]. In knowledge-rich domains like medical imaging there is value in working with domain experts to better understand their perceptual and conceptual processes. Shyu et al [5] have used perceptual categories, as described by physicians, for recognizing lung pathology in high- resolution computed tomography (CT) and used it to improve the initial feature extraction algorithm. Understanding and using domain knowledge is complicated by the fact that it is generally considered to be tacit knowledge [6]. The difficulty in explicit transmittal suggests that we need more effective means by which to elicit and codify domain knowledge. Perceptual expertise, expected to reside in highly trained physicians in fields such as radiology and dermatology, is a form of expertise that may be studied by following observers’ eye movements. Our research focuses on objective representation of perceptual expertise, using eye tracking as a tool. This could help learn the perceptually and semantically relevant visual content of medical images, specifically dermatological images [7]. The brain uses various modalities to obtain information from the outside world and process it, one of them being the human visual system. When a stationary observer views a static image (as when a physician examines a medical image), we make rapid eye movements called saccades that shift our point of regard several times per second. The saccades are separated by fixations, periods of retinal image stability when we obtain visual information. These saccades and intervening fixations can make useful contribution to understanding the thought process. Yarbus [8] showed that

IVMSP 2011

eye movements are influenced by the specific task and concluded that eyes were directed to the perceptually most useful region. Moreover, fixations on an image are guided more by the semantic information than by the inherent structural information [9]. Eye tracking of experts has also been used to define a common reference model in CT lung image analysis so as to elucidate differences in intrinsic behavior between experts and novices [10]. The application of this human-centered technique not only provides more objective evaluations of the images, but also reveals users’ cognitive processing that is important to image understanding. The goal of this work is to address the following question: How can eye movement data be used to inform and direct the pipeline of image processing approaches on dermatology images? 2. METHODS 2.1. Eye-tracking session Sixteen observers were eye tracked, including twelve Board certified dermatologists (‘attending’ dermatologists) and four ‘resident’ dermatologists. Physician assistant students from the RIT Physician Assistant Program served as “trainees”. A set of 50 dermatological images (provided by Logical Images Inc, Rochester N.Y and Dr. Calvelli), each representing a different diagnosis, was selected for the study. Each image was presented to the observers on a 22” LCD monitor (1680x1050 pixels) approximately 70cm from the observers. The full display subtended approximately 38 x 22 degrees of visual angle at that distance, though most images did not fill the full field. The image in Figure 1, for example, subtended approximately 32 x 32 degrees. Eye movements were collected with a SensoMotoric Instruments (SMI) RED remote eyetracker running at 50 Hz. The RED eyetracker monitors the position of an observer’s point of regard on the image without having to wear any headgear, so it is nonintrusive. A nine-point calibration, followed by a four-point validation was performed at the start of each trial and repeated every ten images. A double computer set-up was used wherein one of the components was used to present the images to the observer and the other one to run the software iViewX gaze tracking system. The software to run the observers, Experiment Center 2.3, was executed on another UDP/IP socket-connected computer. Each observer was instructed to “examine and describe each image verbally as if teaching the trainee to make a diagnosis based on the image.” 2.2. Intersection maps creation Eye tracking metrics include many different values such as fixation coordinates, fixation duration, and saccade

130

Figure 1: An example of the experiment showing the eye tracker.

amplitude, among others. In this study the focus was on using the fixation coordinates of the group of medical participants in a meaningful way, i.e. to identify perceptual regions of interest (pROIs) that were used during the image inspection process. Using x-y coordinate data, we created intersection maps (50 in this case), which are binary maps depicting the fixation locations that were shared by at least 80% of the observers. The intuition is that these regions of the image received the highest number of “votes” perceptually. For a different analysis, another set of intersection maps was created by dropping a Gaussian kernel on every fixation. The standard deviation of this kernel in the horizontal direction was 1 degree equivalent to 36 pixels and in vertical direction was 1.5 degree corresponding to 54 pixels. The difference is due to the error in the eye tracker being greater in the vertical direction. The above kernel was applied to every single fixation followed by a threshold of 0.05 in both directions. These kernels were then summed up for all the fixations on an image and normalized. The area underneath this final kernel shared by 80% of the observers resulted in a different set of intersection maps. 2.3. Segmentation algorithm To understand and illustrate the usefulness of eye fixations we have used K-means clustering as an exploratory analysis tool. A value k = 4, which performed well in segmenting out the lesion on most of the images, was selected. We used the CIELAB color space to represent the images for segmentation. In this color space the red, green and blue color channels are transformed to three channels: a luminance channel (L), a red-green opponent channel (a), and a yellow-blue opponent channel (b). According to [11], CIELAB is an appropriate color space representation for dermatology images, due to the relation of the (L) and (b) components to melanin and of the (a) component to hemoglobin. Though [12] questioned the use of CIELAB because their findings supported RGB space to be most differentiating between lesioned and normal skin, use of Lab was supported by [2] who compared the effectiveness of ten different color representations for dermatology images in the

application of content-based image retrieval. They compared the color of healthy skin and skin lesions and measured the retrieval rate. The best results were obtained using CIELAB color space. This work in differentiating between various types of skin lesions was consistent with [2] in terms of achieving the highest separability among the lesions.

lesion

2.4. Fixation ratio (a)

To produce a metric representing the match between the observers’ gaze and the segmented image regions, a ‘fixation ratio’ for each image was generated as follows: 1. The original RGB image was converted into CIELAB and a-b vectors were used as input for the k-means algorithm, dividing each image into 4 clusters. This generated 50 segmentation maps, each with four clusters. 2. The intersection map for each image was then overlaid on the corresponding segmentation map to obtain the number of fixations falling in each cluster. 3. These fixations were then normalized by the total number of fixations in the intersection map generating relative fixations per cluster. 4. Similarly, the relative area per cluster for the segmentation map was obtained. 5. Fixation ratio for every cluster was obtained by dividing the relative fixation from step 3 by relative area from step 4.

((b))

(c) Figure 2: (a) original image with its fixations (blue dots); (b) segmentation map of (a); (c) cluster picked using fixations.

lesion

(a) ( )

(b)) (b

3. RESULTS AND DISCUSSION Human eye-tracking studies have shown that there is viewing bias towards the center of an image [13]. The viewer center bias is correlated strongly with photographer center bias and is also influenced by viewing strategy at scene onset. Viewer center bias may preclude identifying true regions of interest and may influence the computational models of visualization by artificially inflating the performance model at the center of the image. Therefore, when using eye movement data for image analysis it is important to determine whether viewer center bias is a contributing factor. Our data indicate that observers fixate near the center of the images for the first few fixations due to scene onset but then are very quickly drawn towards the regions of interest that may or may not be centered due to photographer bias (not shown). Moreover, together with the image segmentations the intersection maps show that there is a strong connection between the lesion location, regardless of position in the image, and the 80% intersection fixation region, thereby supporting the notion that during image inspection the effect of viewer center bias is drastically reduced. Next, we used the fixation ratio to capture the cluster that would most effectively segment the lesion in the

131

(c) ( ) Figure 3: (a) original image with its fixations (blue dots); (b) segmentation map of (a); (c) cluster picked using fixations.

image. Our analysis shows that the fixations are related to the lesion even with the presence of photographer center bias and can help in selecting the prime regions of interest according to expert’ eye movements. For example, as shown in Figures 2(b, c), visualization of the intersection fixation data on the segmentation map illustrates that k-means was effective in isolating a lesion with high visual interest to our expert observers (Figure 2(a)). The high relative fixation on the cluster shown in 2(c) and low relative area result in a high fixation ratio that also allows us to select this cluster as the most perceptually relevant among the four clusters of the image. In fact, the image shown in 2(a) scored the highest fixation ratio among all 50 images in our database indicating

a strong relation between the fixated lesion and the corresponding cluster. Image in figure: 2(a) 3(a) 4(a) 5(a)

Area of cluster(%)

Fixation in cluster(%)

8.26 13.59 19.2 41.47

86.12 92.03 66.1 65.03

the

Fixation ratio lesion

10.42 6.81 3.38 1.56

((a))

(b) (b)

Table 1: relative area, relative fixation and fixation ratio for the images shown if Figure 2, 3, 4 and 5.

Similarly for image 3(a), where the lesion does not have well-defined geometry, there is an obvious correspondence between the cluster selected using fixation ratio as shown in 3(c) and the lesion. The high fixation ratio is consistent with the visual inference obtained by overlapping the fixation data on the segmentation map. Taken together with the strong correlation with the perceptual regions of interest (pROIs), this suggests that for images like 2(a) and 3(a) kmeans may be useful as a step in an algorithm pipeline. Image 4(a) is another case where the lesion is not constricted to a geometrical boundary. In this case k-means lesion

(a)

((b))


is not effective in completely isolating the lesion and has missed some regions which might be perceptually important. Table 1 shows that the relative fixation was only 66% and the area was 19%. The resulting low value of fixation ratio indicates the segmentation was not effective in isolating the perceptual regions of interest, the lesion. Thus the fixation data help in determining whether the segmentation could capture the lesion or not for this type of image. The image shown in Figure 5(a) had the smallest fixation ratio among the 50 images in our study. It is an

132


example of a skin disease characterized by multiple similar lesions, where there could be more than one pROI. In this case, although the segmentation appears to capture the lesions, the cluster selected by the fixation data was not the one containing the lesion. This may be due to two factors: first, some error in the eye tracker (reported as 0.5 degrees of visual angle for the SMI eye tracker) needs to be considered when interpreting the fixation maps. This error computes to approximately 17 pixels in all directions at the distance that our observers were positioned relative to the screen. It may be that some fixations fall into a different cluster when the error is off by even a few pixels. Second and more importantly, although fixations are reported as individual pixels (x-y coordinates) on an image, the actual high acuity visual information gained from that fixation is greater, since the fovea is not a single receptor but a collection of receptors which covers 1 degree (approximately 35 pixels for our study). Moreover, when we fixate we do not gain information only from the fovea but from the periphery, too. Thus, an observer might be acquiring information about the lesions (black circles in 5(c)) but the x-y coordinates of the center of fovea might fall on the normal skin (white area in 5(c)) especially in cases like 5(a) with small, multiple lesions. Even though in this case the incorrect cluster was voted best, the cluster with the lesions scored second highest among the four clusters. To account for the above two factors a new set of fixation ratios was calculated using the second set of intersection maps obtained by dropping Gaussian kernels at every fixation described in section 2.2. The segmentation maps were unchanged. Table 2 shows the fixation ratio corresponding to the images shown in Figure 2, 3, 4 and 5 (a) with the new intersection map. Even though most of

fixation ratio for every image 10 9 8 7

fixation ratio

these ratio values are less than the values obtained without the Gaussian correction, the cluster ranking for each image was same. The interesting observation here is for Figure 5(a), the ratio increased by a factor of 1.2 for the cluster without the lesions while decreasing the confidence over the second best cluster which contained the lesions. This is an interesting case with multiple, small lesions which poses a special challenge in understanding the expert’s interest region(s).

6 5 4 3 2 1

Image in figure: 2(a) 3(a) 4(a) 5(a)

Area of cluster(%)

Fixation in cluster(%)

8.26 13.59 19.2 41.47

52.79 72.41 42.02 80.40

the

Fixation ratio

0

0

5

10

15

20 25 30 image number

35

40

45

50

Figure 7: graph showing the fixation ratio generated with the new intersection map for every image in the database.

8.7 6.5 3.14 1.94

Table 2: relative area, relative fixation and fixation ratio for the images shown if Figure 2, 3, 4 and 5.

Figure 6: graph showing the fixation ratio for every image in the database.

The graph in Figure 6 shows the fixation ratio value of every image in our study. From these values it can be inferred that many images score higher than the average fixation ratio value of 3.38. 22 out of 50 images have a value higher than the average. This indicates a prospective way of evaluating the segmentation used (k-means in this test case). It will be important to use a combination of quantitative data (such as the fixation ratios) and qualitative inspection (as in visualization of fixations on segmentation maps) in order to definitively evaluate the effectiveness of segmentation algorithms and to select segmented regions for further analysis. For example, in some cases the fixation intersection map may fail to generate a high fixation ratio because the fixations are concentrated on a small lesion that falls within a large segment. In this case, it might be advantageous to choose a different segmentation algorithm. Similarly, in other cases, center bias may obscure useful results. Figure 7 shows a graph similar to Figure 6 but with the intersection map generated using the Gaussian kernels. We can see the ratio values decreased but the rank of each cluster for every image was still the same. Forty images out

133

of fifty had the correct clusters voted as best and a few others had them as second best. This supports the good performance of k-means as a segmentation algorithm for this image set and demonstrates the fact that the fixation ratio is a useful metric for assessing image processing algorithms in the context of perceptual expertise. This preliminary analysis establishes the value in using perceptual regions of interest for the evaluation of image processing algorithms for image understanding. In the cases where k-means (with k=4) is not effective at producing perceptually relevant segmentation we have a promising way of evaluating and improving other image processing algorithms for these images. Although there is a great deal of variability when different experts explicitly mark images, by eye tracking them, we are able to find regions with a high degree of perceptual relevance which could be more accurate than the explicit markings. This is just one step in our work to develop a pipeline of image processing algorithms that could be used to annotate and classify images in a way that closely parallels the expert process. 4. FUTURE WORK Our ongoing work is to explore the visual ROI-driven image feature selection and reduction through machine learning techniques and also to combine semantic and image information using probabilistic modeling. Along with the perceptual data we have captured verbal data that will help in developing a multi-modal image-understanding approach that incorporates domain knowledge. 5.

ACKNOWLEDGMENTS

The authors wish to thank Dr Karen Evans, Thomas Kinsman, Shagan Shah and Bo Ding for helpful suggestions. This material is based upon the work supported by the National Science Foundation under Grant No. IIS-0941452 and supported by Grant Number R21 LM010039-01 from the National Library of Medicine, NIH. Any opinions, findings and conclusions or recommendations expressed in this materiel are those of the authors and do not necessarily

reflect the views of the National Science Foundation (NSF) or the official views of the NLM or the National Institute of Health. 6.

REFERENCES

[7] R. Li, P. Vaidyanathan, S. Mulpuru, J. Pelz, P. Shi, C. Calvelli, and A. Haake, “Human-Centric Approaches to Image Understanding and Retrieval,” Image Processing Workshop (WNYIPW’10), pp. 62-65, 2010. [8] Yarbus, A.F., Eye Movements and Vision, Plenum Press, New York, 1967.

[1] E.A. Krupinski, “The importance of perception research in medical imaging,” Radiation Medicine, 18(6):329-334, 2000. [2] H.H.W.J. Bosman, N.Petkov, and M.F. Jonkman, “Comparison of color representations for content-based image retrieval in dermatology,” Skin Research and Technology, 16(1), pp. 109-113, 2009 [3] S.L. Lew, N. Sebe, D.C. Lifl, and J. Ramesh, “Content-based multimedia information retrieval: State of the art and challenges,” ACM Transactions on Multimedia Computing, Communications and Applications, 2(1): pp. 1-19, 2006. [4] H. Muller, N. Michoux, D. Bandon, Jones, and A. Geissbuhler, “A Review of Content-Based Image Retrieval Systems in Medical Applications: Clinical Benefits and Future Directions,” International Journal of Medical Informatics, 73(1), pp. 1-23, 2004. [5] C.R. Shyu, A. Kak, and C.E. Brodley, “Testing of Human Perceptual categories in a Physician-in-the-loop CBIR System for Medical Imagery,” Proc. IEEE Workshop on Content-Based Access of Image and Video Libraries (CBAIVL’99), pp. 102-108, 1999. [6] Y-J, Lee, and P. Bajcsy, “An information gathering system for medical image inspection,” Medical Imaging 2005: PACS and Imaging Informatics, Proc. SPIE, 5748, pp. 374-381, 2005.

134

[9] J.M. Henderson, and A. Hollingworth, “Eye movements during scene viewing: an overview,” In G. Underwood (Ed.), Eye Guidance in Reading and Scene Perception, pp. 269-293, New York: Elsevier, 1998. [10] L. Dempere-Marco, X-P. Hu, and G-Z. Yang, “Visual Search in Chest Radiology: Definition of Reference Anatomy For Analyzing Visual Search Patterns,” Proc. Of the 4th Annual IEEE Conference on Information Technology Applications in Biomedicine, 2003. [11] H. Takiwaki, “Measurement of skin color: practical applications and theoretical considerations,” Journal of Medical Investigation, 44, pp. 121-126, 1998 [12] M.C. Shin, K.I. Chang, and V. Tsap, “Does Color space transformation make any difference on skin detection,” IEEE Workshop on ACV, pp. 275, 2002. [13] P-H. Tseng, R. Carmi, I.G.M.Cameron and L. Itti, “Quantifying center bias of observers in free viewing of dynamic natural scenes,” Journal of Vision, 2009.