Quantitative and Qualitative Performance Evaluations

Quantitative and Qualitative Performance Evaluations of Artificially-Produced Images

Nursuriati Jamil, Zainab Abu Bakar

Tengku Mohd Tengku Sembok

Faculty of Computer and Mathematical Sciences MARA University of Technology Shah Alam, Selangor, Malaysia [email protected], [email protected]

Deputy Vice Chancellor (Academic and Internalisation) National Defense University of Malaysia Kuala Lumpur, Malaysia [email protected]

Abstract—Even though researches in content-based image retrieval are one of the most intensive research areas of information retrieval, they almost certainly focused on retrieving natural images or photographs. Artificiallyproduced images are stylized form of natural, real-life objects. Research for these images have received less attention, thus this paper attempts to intensify work in this area by implementing quantitative and qualitative performance evaluation of artificially-produced images. Fifty selected images are used as sample queries and they are retrieved based on fusion of geometric shape descriptors. Similarities of these images are measured using Euclidean distance and performances of the retrievals are evaluated using both quantitative and qualitative tests. Quantitative test using recall-precision rate and hit-miss ratio showed that geometric shape descriptors are effective in retrieving the images. However, qualitative test results demonstrate agreement between the system and the interviewed people in the assignment of similarity ranks is at average level.

improved shape retrieval techniques, which could be applicable to a much wider domain of images [16]. Performance evaluation of retrievals in CBIR still remains a crucial problem. According to [13], the problems in CBIR are due to the diversified groups working with their own sets of specialized images. There is neither a common image collection, nor a common way to get relevance judgments and evaluation scheme. Past and recent works on shape-based retrieval of artificially-produced images commonly used quantitative method of recall and precision measurements to evaluate their retrieval performances. For example, [3][15] used recall-precision graph to measure retrieval of binary marine creatures and artificially-produced leafs images; [4][8] utilized precision-recall rate to evaluate retrieval of clip-arts and trademark images, respectively; [17] employed recall rate to measure performance in retrieving figurative fish images while [9] used precision rate to evaluate retrieval of hand-drawn sketches.

Keywords-performance evaluation, image retrieval, artificially-produced images, geometric shape descriptor

I.

INTRODUCTION

Research in image and video retrievals is probably one of the most intensive researches in information retrieval field. In 2002, the IEEE alone has published more than 700 papers concerning retrievals [7]. Ref. [5] in his image retrieval survey also stated that content-based image retrieval (CBIR) as a field has grown tremendously in terms of the people involved and the papers published after the year 2000. However, almost all of CBIR researches focus on retrieving natural images or photographs using different combinations of low-level features. Current techniques for retrieval of artificially-produced images have received less attention and are inadequately lacking [16]. Artificially-produced or figurative images may be stylized form of natural, real-life objects such as animals, flora or human. They can also be purely abstract design consisting of traditional or contemporary patterns. Examples of artificially-produced images are motifs, logos, clip arts, trademark images, sketches and caricatures. Although these images may be colored or non-colored, shape is the main characteristic that distinctly describes these images [2]. Therefore, work on intensifying retrieval of these images should be pursued as it may provides an ideal vehicle for the development of

978-1-4244-5651-2/10/$26.00 ©2010 IEEE

II.

SONGKET MOTIFS RETRIEVAL SYSTEM

As in most content-based image retrieval systems, image retrieval for the songket motifs have two basic components: database module and retrieval module graphically shown in Fig. 1. Images of songket patterns are captured using digital camera and preprocessing processes such as motif extraction, contrast enhancement, noise filtering, binarization and morphological operations are performed to produce binary songket motifs [10][11][12]. Five geometric shape features are then extracted from these binary images and stored along with the motif images in the database [14]. Fifty selected motif images are used as the query images in this paper. For each of the query image, the experiment retrieves relevant shapes from the database and displays them in decreasing order of similarity to the query shapes. Similarity measures are computed between features of the query motif images and motifs images in the database. Computation of similarity measure should not only be efficient, but also the degree of similarity calculated should produce visual ranking of shapes which is similar to human perception[14]. In this paper, images are represented by a feature vector, thus Euclidean distance is used to measure the similarity values of the query and target motif images. The query shapes are compared against each shape in the database of more than 300 motifs.

113

Extract motif

Enhance contrast

Filter noise

Threshold

Morphology

Feature extraction

Image preprocessing Songket pattern

Database Module

Compactness Eccentricity Rectangularity Solidity Convexity

d (Q, I )=

n

∑( f

Q j

− f jI )

j =1

Similarity measure

Retrieval Module Figure 1. Components of the Songket Motif Retrieval System

The purpose of retrieval experiments in this paper is to identify the fewest necessary shape descriptors to characterize the motif shape adequately so that it may be unambiguously retrieved or identified. Three experiments are conducted to test the efficiency and accuracy of the five geometric descriptors. Experiment 1 tested on the fusion of three shape descriptors, followed by Experiment 2 that tested on fusion of four shape descriptors. Finally, Experiment 3 tested on combination of all shape descriptors. III.

Fig. 2. If the participant thinks that there are no similar result images, he or she will proceed to the next query. Participants were given very little guidelines for making their selection. They were only told to choose the result images they feel are similar to the query image based on its shape. The main difficulty in setting up the groundtruth database is choosing query-result pairs. The manual process was very labor-intensive and subject to human perception. In total, 308 query-result pairs were evaluated for the purpose of the experiments.

DEVELOPMENT OF GROUND TRUTH DATABASE

Since this paper works with specialized images, a new collection of images and ground truth database were developed. Ground truth acquisition was done by setting up human retrieval evaluation experiments to gather grounded data for the query by image task. Fifteen people consisting of ten students majoring in Textile Design and five lecturers of various disciplines participated in the experiments. Each participant is given a different number of query images for him or her to evaluate, depending on their willingness. For each query image, fifteen to fifty motif images are given for them to choose the one they consider similar to the query image. Refer Fig. 2. If the participant thinks that there are no similar result images, he or she will proceed to the next query. Participants were given very little guidelines for making their selection. They were only told to choose the result images they feel are similar to the query image based on its shape. The main difficulty in setting up the ground-truth database is choosing query-result pairs. The manual process was very labor-intensive and subject to human perception. In total, 308 query-result pairs were evaluated for the purpose of the experiments. Each participant is given a different number of query images for him or her to evaluate, depending on their willingness. For each query image, fifteen to fifty motif images are given for them to choose the one they consider similar to the query image. Refer

Query Image

Figure 2. Sample Ground Truth Acquisition

Fifty selected motif images are used as the query images in this paper. For each of the query image, the experiment retrieves relevant shapes from the database and displays them in decreasing order of similarity to the query shapes. Similarity measures are computed between features of the query motif images and motifs images in

114

the database. Computation of similarity measure should not only be efficient, but also the degree of similarity calculated should produce visual ranking of shapes which is similar to human perception[14]. In this paper, images are represented by a feature vector, thus Euclidean distance is used to measure the similarity values of the query and target motif images. The query shapes are compared against each shape in the database of more than 300 motifs. IV.

PERFORMANCE EVALUATION STRATEGIES

Ref. [1] stated that retrieval accuracy is concerned with the effectiveness of shape retrieval at both quantitative and qualitative level. At quantitative level, precision, recall and hit-miss ratio are computed. At qualitative level, the degree of system agreement with human perception is measured. A. Quantitative Measurements For quantitative level evaluation in CBIR, the most common recall and precision are used. Recall and precision, however, assumed that all motifs have been received and examined at the time of calculation. The truth is that the motifs images retrieved are first sorted according to the degree of relevance in ascending order. In this situation, the recall and precision vary as the images are examined from top to bottom of the relevance list. Thus, proper evaluation of the retrieval performance requires plotting a recall versus precision curve. For each query, the precision of the retrieval at each level of the recall is obtained. To evaluate the retrieval performance over all queries, the average precision at each recall level is calculated as follows: Nq

P (r) = i=1

Pi (r) Nq

(1)

where P (r ) is the average precision at the recall level r, Nq is the number of queries used, and Pi(r) is the precision at recall level r for the i-th query. Another approach of analyzing the retrieval rate is looking at the hit and miss ratio for the top ten retrieved motifs for a query. A hit is defined as the number of correctly retrieved motifs images and a miss is defined as the number of incorrectly retrieved motifs during a query. B. Qualitative Measurements Qualitative test performed in this paper is adapted from [6] known as effectiveness test. The purpose of this test is to measure to what extent the system answers conformed to those provided by the human testers. In the test, 25 sample images of songket motifs are selected from the database and three motifs are selected as the queries. Refer Fig. 3. Twenty-five people comprising of lecturers and students from science and arts fields are asked to rank the 25 sample images with reference to their similarity to the three query motifs. They are to assign for each sample

motif image and each query motif, a value ranging from 0 to 1 indicating the perceived similarity between the two.

Query 1

Query 2 (a)

Query 3

Image 1 Image 2

Image 3

Image 4

Image 5

Image 6 Image 7

Image 8

Image 9

Image 10

Image 11 Image 12 Image 13

Image 14

Image 15

Image 16 Image 17 Image 18

Image 19

Image 20

Image 21 Image 22 Image 23 (b)

Image 24

Image 25

Figure 3. (a) Query Motifs (b) Test Set of 25 Images

For each of the 25 images, three statistical functions pj(i) representing the ranking of the ith image with reference to the query image j in the similarity list are derived. For each function pj(i), a mean value p j (i ) and a standard deviation σj(i) are derived, representing the average ranking of the ith image for a given query image j in the similarity list, and a measure of the agreement about a ranking close to the p j (i)th rank, respectively. Finally, for each image i and rank k, a function Qj(i, k) with values [0,100] are considered, representing the percentage of people that ranked the ith image in the kth position with reference to the query image j. To measure the system performance and take into account the variability and shape dependency of the human judgment, the percentage of people that ranks an image in the same position as the system, or in the very close neighbor are considered. For each motif image i and query motif j, a window of width σj(i) centered in the similarity rank Pj(i) assigned by the system can be used. Table 1 shows the similarity ranking Pj(i) for the three query motifs derived by the system using all five geometric features. The measure of the distance between the system and human similarity ranking for a query image j and a test image is represented by the sum of the percentage of people Qj(i,k) who ranked the i-th image in a

115

position between P (i) − ⎡σ j (i) ⎤ and P j (i) + ⎡σ j (i) ⎤ . This j ⎢ ⎥ ⎢ ⎥ ⎢ 2 ⎥

⎢ 2 ⎥

is defined as function Sj(i):

S j (i) = TABLE I.

k = Pj ( i ) +σ j ( i )

(2)

∑ Q (i, k )

j k = Pj ( i ) −σ j (i )

SIMILARITY RANGKING PRODUCED BY THE SYSTEM

Image No 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.

V.

Query 1 7 20 18 2 21 14 16 8 12 10 5 6 19 25 23 13 15 11 24 1 3 4 9 17 22

Query 2 16 17 7 6 12 9 19 18 21 15 11 4 3 25 20 8 1 22 23 13 5 10 2 14 24

Query 3 6 4 23 17 24 2 22 13 3 9 12 14 19 1 25 10 15 5 21 8 18 20 16 11 7

However, Experiment 1 that combined three shape descriptors shows a higher precision rate compared to Experiment 2 that combined four descriptors. Previous experiment [14] indicated that eccentricity feature has the highest recall-precision value followed by rectangularity, compactness, convexity and solidity. Ref. [10] also proved that in an experiment that combined three shape descriptors, fusions that include eccentricity always performed better than fusions that exclude eccentricity feature. Again, Experiment 2 that excludes eccentricity feature performed comparably poor even though four features are used in retrieving the motifs. Therefore, it can be concluded that more features used in the retrieval do not necessarily improved the retrieval precision rate of the songket motifs. The correct combination of the geometric features plays an important role in determining the best fusion of retrieving the motifs. TABLE II.

Recall

Experiment 1

10%

0.966333

0.916509

0.977264

20%

0.681023

0.637763

0.701356

30%

0.420569

0.368662

0.429474

40%

0.34717

0.27408

0.365254

50%

0.296107

0.220191

0.298644

60%

0.228125

0.16515

0.242633

70%

0.184289

0.115926

0.196718

80%

0.140523

0.08912

0.153559

90%

0.097136

0.068099

0.106755

100%

0.082033

0.057759

0.091711

Average

0.344331

0.291326

0.235594

RESULTS AND DISCUSSIONS

Experiment 2 Experiment3

1

The results are presented in two sections: 1) quantitative measurements using recall-precision rate and hit-miss ratio, and 2) qualitative measurements using effectiveness test.

Ecct+Rect+Solid

0.9

Rect+Comp+Conv+Solid

0.8

All

0.7

Precision

A. Quantitative Performance Analysis Table 2 and Fig. 4 presented the performance results of Experiment 1 through Experiment 3 using recall and precision evaluations. Experiment 1 combined eccentricity, solidity and rectangularity. On the other hand, Experiment 2 combined rectangularity, compactness, convexity and solidity and Experiment 3 combined all five geometric shape descriptors. As can be seen from Table II, Experiment 3 has the highest precision rate 98 percent at 10% recall level followed by Experiment 1’s precision rate of 96 percent at 10% recall and Experiment 2 of 2 percent precision rate at 10% recall. Experiment 3 that integrated all five shape descriptors has the highest average precision rate compared to the other two combinations. The outcome seems to indicate that the more descriptors used, the higher the precision rate is.

PRECISION RATE AT EVERY 10% RECALL

0.6 0.5 0.4 0.3 0.2 0.1 0 10

20

30

40

50

60

70

80

90

100

Recall Figure 4. Recall-precision Curve of Retrieval Experiments

Another quantitative measure used to evaluate the retrieval accuracy is by looking at the hit and miss ratio for the top ten motifs for selected queries as demonstrated in Table 3. As expected, it showed Experiment 3 performed

116

better in general than the other two experiments with an average hit of 52%. The first motif demonstrates equal hit ratio for all three experiments. Rectangularity and solidity features seemed to contribute significantly to the retrieval motif of this nature. Experiment 2 has the lowest hit ratio for retrieval of the second and fifth motifs. These motifs are elongated, thus the absence of eccentricity feature in Experiment 2 may be the cause of the poor performance. Experiment 1, however, performed worst for the third and fourth motifs. These motifs are less dense, thus solidity feature does not produce significant effect on the hit ratio.

TABLE V.

TABLE III.

HIT AND MISS RATIO OF SELECTED QUERIES

Query

Experiment Experiment Experiment 1 2 3 70

70

60

30

60

10

20

30

30

50

60

40

10

40

42

36

52

B. Effectiveness Test Results of the ranking test of 25 samples images in Fig. 3 performed by 25 people showed that similarity between different people varies considerably and the range of variability confirmed that it is a subjective measure. With varying background and experiences, the answers changed from shape to shape and from person to person. Total agreement among all persons is only achieved for perfect match motif image to the query image. Table 4 demonstrates a part of the similarity values of the 25 test images against Query 1 by 25 people. TABLE IV.

Img No. 1. 2. : 20. 21. 22. 23. 24. 25.

SIMILARITY VALUES OF THE 25 TEST IMAGES AGAINST QUERY 1 PERCEIVED BY TWENTY-FIVE PERSON

P1 0.5 0.3 : 1 0.52 0.89 0.78 0.85 0.63

P2 0.3 0.4 : 1 0.3 0.92 0.89 0.75 0.85

P3 0.32 0.43 : 1 0.21 0.95 0.21 0.8 0.32

… P21 P22 P23 … 0.5 0.85 0.5 … 0.3 0.3 0.3 … : : : 1 1 1 0.7 0.5 0.6 0.9 0.9 0.9 0.1 0.5 0.9 … 0.9 0.9 0.9 … 0.3 0.5 0.6

P24 0.5 0.3 : 1 0.5 0.9 0.6 0.5 0.6

RANKING OF THE ITH MOTIF IMAGE WITH REFERENCE TO J

P25 0.5 0.3 : 1 0.5 0.95 0.5 0.88 0.7

TH

QUERY IMAGE

j

P1

P2

P3

…

P23

P24

P25

p j (ith )

σj(ith)

1 2 3

23 23 23

17 3 23

11 23 18

… … …

8 17 5

16 23 6

20 9 8

18 15 16

4 7 5

Finally, for each image i and rank k, a function Qj(i, k) with values [0,100] are considered, representing the percentage of people that ranked the ith image in the kth position with reference to the query image j. In Fig. 5 – Fig. 7, plots of Sj(i) as a function of rankings Pj(i) are presented. The figure shows the agreement between the interviewed people and the system in ranking the ith motif image in the Pj(i) position, for the three query motifs. Only ranks from one to six are shown since they represent the agreement on the most similar motifs. As can be noticed from the plot, agreement between the system and the interviewed people in the assignment of similarity ranks is at average level. For certain motifs, the agreement is large while for others it is small. The average agreement (shown as dotted line) for Query 1 is 44.7%, Query 2 is 51.3% and Query 3 is 42.7%. The similarity ranking agreement between humans and the system sometimes decay maybe due to the fact that not many similar motif images exist to assign precise ranks for the query motifs. This is especially true for Query 3. 100

Min: 8 Max: 100 Ave: 44.7

80 Sj(i)

Average Hit (%)

70

After the 25 images are ranked, three statistical functions pj(i) representing the ranking of the ith image with reference to the query image j in the similarity list are derived. The sample results for one image are presented in Table 5 arranged by the image i, where 1 ≤ i ≤ 25. Only a small number of the images tested produced low standard deviation indicating strong measure of agreement regarding the ranking among participants. However, majority shows relatively high standard deviation signifying the measure of agreement among the participants about the image ranking is low.

60 40 20 0 1

Figure 5.

2

3

Pj(j)

4

5

6

Values of agreement between system and human similarity for Query 1.

117

REFERENCES

100

Min: 8 Max: 100 Ave: 51.3

Sj(i)

80 60 40

[1]

[2]

20 [3]

0 1

2

3

Pj(j)

4

5

6

Figure 6. Values of agreement between system and human similarity for Query 2.

100

Min: 16 Max: 100 Ave: 42.7

Sj(i)

80 60

[4]

[5] [6]

40 20

[7]

0 1

2

3 P (j) 4 j

5

6

Figure 7. Values of agreement between system and human similarity for Query 3.

VI.

CONCLUSIONS

Based on the experiments conducted earlier, the fusion of all five features has the highest retrieval rate compared to the other fusion of features. However, this doesn’t mean that the more features used in the retrieval necessarily will improve the retrieval rate of the songket motifs. The correct combination of the geometric features plays an important role in determining the best fusion of retrieving the motifs. Eccentricity feature is the most efficient in retrieving the songket motifs and solidity feature has a significant contributing effect when combined with other features. When comparing the effect of each feature in retrieving the songket motifs, the hit ratio performed differently depending on the motif’s shape. For elongated motifs, eccentricity has the highest hit rate; for dense and solid motif, solidity has the highest hit rate and for irregular shaped motifs, compactness and rectangularity has the lowest hit rate. Effectiveness test comparing the system’s performances with human testers produced an average agreement from 42.7 percent to 51.3 percent. Individual differences among people with various background and experiences may have largely influenced the outcome of this test.

[8] [9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

ACKNOWLEDGMENT The authors would like to thank the Malaysian Ministry of Science, Technology and Innovation for providing the Intensification of Research in Priority Areas (IRPA) grant to fund this research.

[17]

S. Berretti, A. del Bimbo, A. and P. Pala. “Retrieval by shape similarity with perceptual distance and effective indexing,” IEEE Transactions on Multimedia, vol.2, no.4, pp. 225-239, 2000. M. Bouet, A. Khenchaf and H. Briand,. “Shape representation for image retrieval,” ACM Multimedia vol 2, pp. 1-4. 1999. G. Bordogna, L. Ghilardi, S. Milesi and M. Pagan, “A flexible retrieval system of shapes in binary images,” 30th ACM SIGIR Conference on Research and Development in Information Retrieval, 2007, pp. 745-746,. C-G. Choi, S. Cheong, Y. Chang, and S.H. Kim, “Clipart image retrieval system using shape information,”. In. J.X. Yu, J.X., Lin, X., Lu, H. & Zhang, Y. (eds.): APWeb 2004. LNCS 3007, Springer, Heidelberg, 2004, pp.765-771. R. Datta, D. Joshi, J. Li and J.Z. Wang, “Image retrieval: Ideas, influences and trends of the new age,” ACM Computing Surveys, vol. 40, no.2, article 5, pp.1-60, 2008. A. del Bimbo and P. Pala, ”Visual image retrieval by elastic matching of user sketches,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no.2, pp. 121-132. 1997. H. Eidenberger, “New perspectives on visual information retrieval,” Proc. SPIE Electronic Imaging Symposium, San Jose, CA, 2004, pp. 133-144. O. El Badawy and M. Kamel, “Shape-based image retrieval applied to trademark images,” International Journal of Image and Graphics, vol. 2, no.3, pp. 375-393, 2002. H.S. Horace, A.K.Y. Cheng and W.Y.F. Wong, “Affine invariant retrieval of shapes based on hand-drawn sketches,” Proc. 16th Conf. on Pattern Recognition, Washington, DC, 2002, pp. 794797. N. Jamil and Z. Abu Bakar, “Shape-based image retrieval of songket motifs,” 19th Annual Conf. of the National Advisory Committee on Computing Qualifications, Wellington, New Zealand, 2006, pp. 213-219. N. Jamil and T.M. Tengku Sembok, “Gradient-based edge detection of songket motifs,”. In Tengku M. T. Sembok, Halimah Badioze Zaman, Chen H., Urs, S.R. and Myaeng, S.H. (eds.) Digital Libraries: Technology and Management of Indigeneous Knowledge, 2003, Springer-Verlag Berlin, pp. 456-467 N. Jamil, Z. Abu Bakar and T.M. Tengku Sembok, “Noise removal and enhancement of binary images using morphological operations,” Proc 2nd International Symposium on Information Technology, 2008, pp. 1-6. H. Müller, W. Müller, D.M. Squire, S. Marchand-Maillet and T. Pun, “Performance evaluation in content-based image retrieval: Overview and proposals. Pattern Recognition Letters, vol 22, no. 5, pp. 593-601. Elsevier Science Inc., New York, 2001. N. Jamil, Z. Abu Bakar and T.M. Tengku Sembok, “Retrieval of songket motifs using geometric shape descriptors,” International Journal of Pattern Recognition and Machine Intelligence, vol 1. no.4, pp. 108-113, 2006. M. Sarfraz and A.M. Ridha, “Content-based image retrieval using multiple shape descriptors,” Proc: IEEE/ACS International Conference on Computer Systems and Applications, 2007, pp.730737. J. Schietse, J.P. Eakins and R.C. Veltkamp, “Practice and challenges in trademark image retrieval,” Proc. 6th ACM Conf. on Image and Video Retrieval, ACM Press, New York, 2007, pp. 518524. L. Wenyin, T. Wang and H. Jiang, “A hierarchical characterization scheme for image retrieval,” Proc. International Conf. on Image Processing, Washington DC, 2000, pp. 42-45.

118