On the Quality Assessment of Enhanced Images - CiteSeerX

On the Quality Assessment of Enhanced Images: A Database, Analysis, and Strategies for Augmenting Existing Methods Cuong T. Vu, Thien D. Phan, Punit S. Banga, and Damon M. Chandler Laboratory of Computational Perception and Image Quality School of Electrical and Computer Engineering, Oklahoma State University, Stillwater, OK 74078 USA {cuong.vu, thien.phan, punit.singh.banga, damon.chandler}@okstate.edu Abstract—Most methods of image quality assessment (QA) have been designed for QA of degraded images. This paper presents the results of a study designed to investigate whether existing QA methods can be adapted to succeed on enhanced images. We developed a database containing digitally enhanced images and associated subjective quality ratings. Next, we analyzed the efficacy of select QA methods and their reversemode versions in predicting the ratings. Given the fact that an enhanced image makes the original image appear degraded, we tested both normal and reverse-mode versions, where the latter were implemented by specifying the enhanced image as the reference and the original image as the “degraded” image. Our results demonstrate that this reverse-mode approach improves QA of enhanced images. We present a strategy for further improving the QA methods by using measures of contrast, sharpness, and color saturation.

I. I NTRODUCTION Algorithms designed to estimate image and video quality play a crucial role in the design and operation of numerous processing, coding, and analysis systems. The vast majority of quality assessment (QA) algorithms have been designed for degraded images, often operating under the assumption that a high-quality image is one which is most visually similar to the original (reference) image (e.g., [1], [2], [3]). However, many systems yield images whose quality is enhanced relative to the original; for such enhanced images, the notion of similarity is less applicable and a different QA tactic may be needed. Image enhancement is one of the most fundamental operations in digital photography and image editing. Enhancement or retouching is a required step that every professional photographer performs after acquiring photos. Although there is no standard rule to follow when editing an image, most photographers implement several steps such as auto-cropping, noise reduction, sharpening, and contrast, white-balance, and color adjustments. These and other forms of processing (e.g., demosaicing, super-resolution, computational photography) yield images that are dissimilar, but of superior visual quality compared to the original images. Unfortunately, QA of enhanced images is challenging due to fact that the changes can often be subtle and can affect the artistic impression of the image. Nonetheless, it is

still possible to perform QA of enhanced images based on changes in low-level attributes. The work of Fairchild and Johnson [4] begins to address this issue by using contrastand color-appearance models. In addition, VIF [5] can yield a value larger than unity (denoting quality greater than the original) when the “degraded” image contains linear contrast enhancement. Another major roadblock that hinders research on QA of enhanced images is the fact that there currently exists no database of subjective ratings for enhanced images. One very consistent finding which we observed when viewing an original image and its enhanced version is that the enhanced image makes the original image appear degraded (as long as the enhanced image is not over-enhanced such that it appears artificial). Given this fact, we asked whether it is possible to perform QA of enhanced images by operating existing degradation-based QA methods in reverse and then reinterpreting the results. Specifically, given an original image and its enhanced version, the original image can be thought of as a degraded version of the enhanced image. In this way, existing degradation-based QA techniques may be applicable. In this paper, we present the results of a study designed to investigate whether existing QA methods can be adapted to succeed on images enhanced via standard retouching techniques. Our specific contributions are three-fold: First, we present a new database containing digitally enhanced images and associated subjective quality ratings. Next, we analyze the efficacy of existing QA methods and their reverse-based implementations in predicting these quality ratings. Finally, we present a QA algorithm designed specifically for enhanced images; our algorithm measures perceived contrast, sharpness, and color, and then combines these measures with an additional stage which estimates quality based on local statistical differences between log-Gabor coefficients of the original and enhanced images. This paper is organized as follows: Section II provides the details of the QA database for enhanced images. In Section III we analyze the performance of current QA methods on our database and provide details of the augmentation techniques. General conclusions are provided in Section IV. II. A QA DATABASE FOR E NHANCED I MAGES

This project was supported by the National Science Foundation, “Content-Based Strategies of Image and Video Quality Assessment,” Award No. 0917014.

In this section, we provide details of the QA database for enhanced images. We first describe images in the database,

and then the experiments to obtain subjective ratings. The results of the experiments are provided in Section II-D. A. Images and Enhancements Twenty-six color images of size 512 × 512 pixels containing commonplace subject matter were obtained from the Kodak and CSIQ [3] databases. Using Adobe Photoshop, each of these 26 images was modified by the first author to generate three enhanced versions with varying amounts of quality. The enhancements were made by editing either contrast, sharpness, brightness, color, or combinations of these properties. In total, there were 104 images used in this study (26 original images and 26×3=78 enhanced versions of these originals). Some examples of the enhanced images are shown in Figure 1; the entire database and ratings are available online [6]. B. Apparatus and Subjects Images were displayed on a professional-grade, widegamut LaCie 324 24-inch LCD monitor (1920x1200 at 60 Hz; 92% NTSC color gamut). The display yielded minimum and maximum luminances of 0.80 and 259 cd/m2 , respectively, with a luminance gamma of 2.2. Images were viewed binocularly through natural pupils in a darkened room at a distance of 60 cm. Nine adult subjects (the second author and eight naive college students) took part in the experiment. Subjects ranged in age from 23 to 34 years. C. Methods Obtaining reliable subjective ratings of quality for enhanced images is more difficult than for degraded images due to fact that the changes are often subtle. Based on several pilot experiments, we employed a three-step procedure: (1) within-image ranking, then (2) within-image rating constrained by the ranks, and then (3) across-image ratings constrained by the within-image ratings. 1) Within-Image Ranking: First, using a pairwisecomparison paradigm, within-image rankings were obtained. For each original image, subjects were shown every possible pairwise combination of the original and its three enhanced versions ( 42 = 6 comparison pairs). For each pair, subjects indicated which image was of superior quality or whether the images were of equivalent quality. Subjects were unaware of which image was the original. However, in all cases, the originals received either the lowest ranks or were tied for the lowest ranks. 2) Within-Image Rating: Next, using a multiple-stimulus continuous quality evaluation (MSCQE) paradigm, withinimage ratings were obtained. For each original image, subjects were simultaneously shown the original and its three enhanced versions. Based on the ranks obtained in the previous step, the highest-ranked and lowest-ranked images were fixed at the right-hand and left-hand sides of the screen, respectively. Subjects were instructed to position the two

remaining images such that the horizontal displacement to the right was linearly proportional to its quality relative to the other images. Thus, an image placed further to the right was considered to have superior quality. This MSCQE paradigm permitted simultaneous comparison of multiple versions of the same image, which allowed subjects to actively correct previous ratings. However, the placement of the images was constrained in order to maintain the rank-orders obtained in the previous step. To assist subjects, a numerical score corresponding to the position of each image was displayed during the experiment; the leftmost (worst) image had a fixed score of 0 and the rightmost (best) image had a fixed score of 10. 3) Across-Image Rating: Finally, across-image ratings were obtained by again using an MSCQE paradigm. Based on the ratings obtained in the previous step, subjects were simultaneously shown 52 images: The 26 original images and the 26 highest-rated versions of each original. The 26 original images were stacked vertically and fixed in position at the lefthand side of the screen (a vertical scrollbar was used for paging). Each of the 26 highest-rated versions was initially placed to the right of its original. Subjects were instructed to displace each of these latter 26 images such that its horizontal displacement to the right corresponded to increasing visual quality relative to its original image. Again, a numerical score corresponding to the position of each image was also displayed to assist subjects. From these scores, we computed scaling factors by which to scale the within-image ratings to obtain across-image ratings. D. Results Agreement between subjects was quite good. In terms of rank-ordering, every subject agreed with the average ranking on at least 84% of the images. For the final across-image ratings, the linear correlation coefficient between the scores of each subject on the 78 images and the average scores of all other subjects ranged from 0.810 to 0.953. The final across-image ratings were converted to z-scores, and then the z-scores from all subjects were averaged and scaled to span the range [0, 1], where a score of 1 corresponded to the image of greatest quality relative to its original image. Because the original images received the lowest ranks, and because all originals were fixed in position during the across-image rating stage, all originals received a score of 0 after the rescaling. The resulting average scores thus represent DMOS values (difference mean-opinion scores) relative to the originals. Figure 1 shows two original images, two of its enhanced versions, and the corresponding DMOS values. In the first column, image boating1 was generated by sharpening the original boating image, and image boating2 was generated by increasing contrast and saturation. Even though boating2 appears more colorful than boating1, the subjective ratings suggest that subjects prefer the sharpened image. Similarly,

Original image boating

Original image show

Image boating1, rating score: 0.51

Image show1, rating score: 0.63

Figure 2. First row: original image flower on the left and its enhanced version which received the absolute best subjective rating (DMOS = 1). Second row: original image redwood on the left and its enhanced version which received the absolute worst subjective rating (DMOS = 0.17). Table I P ERFORMANCES OF SELECT QA METHODS ON THE ENHANCED IMAGE DATABASE . MAD d = D ETECTION - BASED MAD OUTPUT; MAD a = A PPEARANCE - BASED MAD OUTPUT; SEE [3]. N OTE THAT MADa YIELDS THE SAME OUTPUT FOR NORMAL AND REVERSE MODES . (S EE S ECTION III-B FOR DESCRIPTIONS OF THE LAST TWO ROWS .)

Image boating2, rating score: 0.37

Image show2, rating score: 0.49

Figure 1. Examples of enhanced images with their subjective ratings and their original images.

in the second column, subjects assigned a higher score to the sharpened image show1 rather than the color-enhanced image show2, despite the fact that both images were contrastenhanced by the same amount. Figure 2 shows the absolute best- and worst-rated images in the database along with their corresponding originals. The original image flower appears low in contrast, sharpeness, and colorfulness. Its enhanced version, which received the highest rating, was enhanced in contrast, sharpened, and locally color-corrected (for the stamen of the flower and the leaves). However, for image redwood in the second row, there is little room for enhancement. The enhanced version of this image, shown in Figure 2, was enhanced only in terms of contrast; other enhanced versions of this image received similarly low DMOS values. III. QA A LGORITHMS ON E NHANCED I MAGES A. Results of Existing QA Methods When comparing the enhanced images in Figures 1 and 2 with their respective originals, it is quite apparent that

VSNR VIF MAD MADd MADa Feature diff. d d × MADa

CC normal reverse 0.358 0.391 0.678 0.830 0.700 0.727 0.225 0.262 0.814 0.843 0.884

SROCC normal reverse 0.362 0.458 0.668 0.808 0.686 0.712 0.372 0.384 0.768 0.841 0.864

RMSE normal reverse 1.904 1.880 1.499 1.138 1.461 1.400 1.987 1.968 1.185 1.083 0.954

the enhanced images make the originals appear degraded. Given this fact, we asked whether it is possible to perform QA of enhanced images by operating existing degradationbased QA methods in reverse. To test this, we applied three full-reference QA algorithms, VSNR [2], VIF [5], and MAD [3], in two modes: (1) operating the algorithms in their normal modes; and (2) operating each algorithm in reverse mode where the enhanced image was specified as the reference, and the original image was specified as the degraded image. These particular algorithms were chosen because they require knowledge of which of the two images is the reference image and which is the degraded image; if the roles of these images are swapped, the algorithms will yield a different result. The first five rows of Table I show the performances of these algorithms in term of Pearson correlation coefficient

(CC), Spearman rank order correlation coefficient (SROCC), and RMSE between the predicted scores and DMOS values. Before computing CC and RMSE, we applied a VQEGrecommended logistic transformation to the predicted scores [7]. Also show in in Table I are the individual performances of the detection-based and appearance-based stages of MAD. The detection-based stage operates based on a model of masking, whereas the appearance-based stage compares the local statistics of log-Gabor coefficients of the degraded and reference images. The results in Table I indicate that VSNR and MADd , both of which operate based on models of visual masking, fail when applied to enhanced images. This result is not unexpected given that the masking models were designed for visual detection of distortion. VIF (both normal and reversemode) and MADa perform quite well on the enhanced images. These results suggest that statistical differences between local frequency coefficients, which are used in both VIF and MADa , can be an effective strategy for capturing enhancement-related appearance changes, as long as the results are interpreted to assume positive changes in quality. B. Augmenting Existing QA Methods Based on the information that images were enhanced mainly in term of contrast, sharpness, and color saturation, we tested whether measurements of these features could improve the performance of MADa . Below, we describe our augmented version of MADa . (We are currently investigating the proper technique to augment VIF.) Let X denote a N1 × N2 RGB color image. For contrast, X was converted into luminance assuming sRGB display conditions. Next, the luminance image was divided into blocks of size 8 × 8 with 50% overlap, and then the RMS contrast [8] was measured for each block to obtain a local contrast map. For sharpness, we used our S3 algorithm, which yields a local sharpness map [9]. For saturation, X was converted to an HSV image, and then the S layer was used as the saturation map. Let fi (X) denote one of the abovementioned feature maps of X, where i = {contrast, sharpness, saturation}. Let Xref and Xenh denote the reference and enhanced images, respectively. Let di denote a scalar which indicates the extent to which Xref and Xenh differ in terms of feature i: di = kfi (Xenh )k2 − kfi (Xref )k2 ,

(1)

where k·k2 denotes the L2 -norm. We next take a simple linear combination of the di values for i = {contrast, sharpness, saturation} to determine the overall change d: d = dcontrast + dsharpness + dsaturation ,

(2)

where a value of d > 0 indicates that the enhanced image has greater contrast, sharpness, and/or saturation than the reference image.

Finally, the augmented version of MADa is computed as the product of MADa and d: Augmented MADa = MADa × d.

(3)

The last two rows of Table I show the results of d alone and d × MADa on the database. On its own, d is quite effective at predicting the quality ratings; this result is not surprising considering that d takes into account most of the features which were modified to generate the enhanced images. (See [6] for the individual performances of each di on the database). When d is combined with MADa , a further improvement is observed. A similar combination applied to VIF does not show a such great improvement. We are currently testing other methods for augmenting VIF; we expect similar improvements and perhaps even better performance than d × MADa . IV. CONCLUSIONS This paper presented the results of a study designed to investigate whether existing QA methods can be adapted to succeed on enhanced images. Our analysis suggests that a future algorithm of QA designed specifically for enhanced images may benefit from a strategy that involves both enhanced-feature measurements and statistical comparisons of local frequency coefficients. Although we believe this study represents an important first step, it is also important to note that a correlation of approximately 0.88 still leaves much room for improvement; on degraded images, correlations of 0.95 are not uncommon. Furthermore, the approaches presented here have not been tested on overenhanced or degraded images. We are currently in the process of expanding the database to contain both overenhanced images, and images which also contain combinations of enhancements and degradations. R EFERENCES [1] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Trans. Image Process., vol. 13, pp. 600–612, 2004. [2] D. M. Chandler and S. S. Hemami, “Vsnr: A waveletbased visual signal-to-noise ratio for natural images,” in IEEE Transactions on Image Processing, 16, 2007. [3] E. C. Larson and D. M. Chandler, “Most apparent distortion: full-reference image quality assessment and the role of strategy,” Journal of Electronic Imaging, vol. 19, no. 1, 2010. [4] M. D. Fairchild and G. M. Johnson, “icam framework for image appearance, differences, and quality,” Journal of Electronic Imaging, vol. 13, no. 126, 2004. [5] H. R. Sheikh and A. C. Bovik, “Image information and visual quality,” IEEE Transactions on Image Processing, vol. 15, no. 2, pp. 430–444, 2006. [6] http://vision.okstate.edu/index.php?loc=driq. [7] VQEG, “Final report from the video quality experts group on the validation of objective models of video quality assessment, Phase II,” August 2003, http://www.vqeg.org. [8] B. Moulden, F. A. A. Kingdom, and L. F. Gatley, “The standard deviation of luminance as a metric for contrast in random-dot images,” Perception, vol. 19, pp. 79–101, 1990. [9] C. Vu, T. Phan, and D. Chandler, “S3: A spectral and spatial measure of local perceived sharpness in natural images,” Image Processing, IEEE Transactions on, vol. 21, no. 3, 2011.