a novel framework for sketch-based image retrieval ... - Springer Link

6 downloads 364 Views 3MB Size Report
May 1, 2015 - Abstract Sketch-based Image Retrieval (SBIR) is one important branch ... propose a novel SBIR framework based on Product Quantization (PQ) ...
Multimed Tools Appl (2016) 75:2419–2434 DOI 10.1007/s11042-015-2645-y

Sketch4Image: a novel framework for sketch-based image retrieval based on product quantization with coding residuals Qiang Li 1 & Yahong Han 1,2 & Jianwu Dang 1,2,3

Received: 24 September 2014 / Revised: 16 February 2015 / Accepted: 20 April 2015 / Published online: 1 May 2015 # Springer Science+Business Media New York 2015

Abstract Sketch-based Image Retrieval (SBIR) is one important branch of Content-based Image Retrieval (CBIR). SBIR means dealing with retrieval using simple edge or contour images. However, SBIR is more difficult than CBIR due to the lack of visual information, this makes the Bag-of-Words (BoW) or codebook in SBIR hard to construct. In this paper, we propose a novel SBIR framework based on Product Quantization (PQ) with sparse coding (SC) to construct an optimized codebook. By using state-of-the-art local descriptors, we transform sketch images into features and then build the optimized codebook using PQbased SC. In the retrieval stage, we can obtain a better representation of the query sketch and testing images by the optimized codebook with coding quantization residuals, by which the information loss during feature encoding process can be reduced; similarity computing is implemented by comparing the feature histograms between a query sketch and the testing data for the final results. We demonstrate the superiority and effectiveness of the proposed SBIR by comparing it with several state-of-the-art methods on three public sketch datasets. Keywords Sketch-based image retrieval . Product quantization . Sparse coding . Residual

* Yahong Han [email protected] Qiang Li [email protected] Jianwu Dang [email protected] 1

School of Computer Science and Technology, Tianjin University, Tianjin, China

2

Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin University, Tianjin, China

3

School of Information Science, Japan Advanced Institute of Science and Technology, Ishikawa, Japan

2420

Multimed Tools Appl (2016) 75:2419–2434

1 Introduction An image retrieval system is a computer system for browsing and searching for digital images in a large database. In general, there are main two approaches of defining the query in an image retrieval system: the first is a text-based retrieval system; while the second is an image-based system, which usually contains rich scene information and various features, like color, texture and shape. Many image-based retrieval methods rely on bag-of-words (BoW) model, which represents an object as a histogram of occurrence of local features. Content-Based Image Retrieval (CBIR) [21, 23] is one important application of BoW, which aims at finding similar images from databases according to a given query. Impressive advances have been achieved in terms of the number of images indexed in the database [18, 20, 28]. The basic idea of BoW includes four steps: first, extracting the image features using certain methods (like local or global feature descriptors); second, quantizing the features into a codebook (also known as visual vocabularies or visual words); third, representing the images using the predetermined codebook; and fourth, computing the similarity between the query and the dataset. Generally, the discriminative ability of the codebook determines the algorithm performance. Considering a user-guided image retrieval system, users cannot provide a specific exemplar image; instead, only a raw contour or sketch is available. This is known as the Sketch-Based Image Retrieval (SBIR) system. Generally, SBIR shares the same framework with CBIR but is more difficult to implement due to the lack of visual information; traditional SBIR methods fail to construct an effective discriminative codebook, so it is important to design a method to achieve SBIR with good performance. Sparse Coding (SC) [12, 13] is widely used in computer vision and machine learning, which represents a vector as a sparse linear combination of the elements in a codebook. However, its computation cost is expensive when the codebook is large. Also, on curse of dimensionality [21], high-dimensional features usually make the quantization distortion hard to control, causing the building of the codebook and the assignment to the visual words to become a bottleneck in quantization. In order to tackle this issue, Product Quantization (PQ) [16] is proposed; this method has been shown to offer a promising paradigm for tackling the problem. Differing from other quantization methods, PQ decomposes the high-dimensional feature spaces into a Cartesian product of low-dimensional subspaces and quantizes each of them respectively. As the dimensionality of each subspace is small, using a small codebook is sufficient to obtain good performance with minor quantization distortion. Inspired by SC and PQ, we present a novel SBIR method, which shares the merits of sparse coding, and constructs the codebook as a Cartesian product of smaller subcodebooks. Specifically, we first transform normal images to sketch-like images using Canny edge detect and extract local features using a state-of-the-art sketch image feature extracting method; second, we learn an optimized codebook via a PQ-based sparse coding technique, which solves a regression problem to obtain the codebook on the basis of an initialized one; and third, we encode the images using the codebook, further encoding the quantization residual to reduce the information loss for better image representation. Our method produces a codebook with good discriminative ability in accuracy for image retrieval; using a PQ-based SC quantization method to integrate the residual information will help us to obtain a more accurate image representation, thus effectively facilitating SBIR. The proposed framework consists of three contributions. First, we encode the sketch lines by employing state-of-the-art transformation on local features. This method transforms sketch

Multimed Tools Appl (2016) 75:2419–2434

2421

information into local features and is robust against different angles. Second, we encode the local features into the optimized codebook by PQ-based SC, and then encode sketch features with quantization residual to improve the representation ability. Third, the proposed method can be efficiently computed and good performance is achieved compare to several popular SBIR methods. The rest of the paper is organized as follows. Section 2 introduces the related work. In Section 3, we present the details of the proposed SBIR method. The experimental results and analysis are discussed in Section 4. Finally, conclusions are presented in Section 5.

2 Related work Product Quantization (PQ) is a new technology proposed by Jegou et al. [16] for dealing with high dimensionality quantization distortion in Vector Quantization (VQ), which decomposes the high-dimensional feature space into a Cartesian product of low-dimensional subspaces and quantize each subspace separately as VQ. However, although the computational cost is effectively reduced, PQ may fail to retrieve the exact nearest neighbor of a query with high probability due to the quantization distortion. Ge et al. [10] follow the same framework for PQ and propose an optimized PQ by rotating the original data to minimize the quantization; two novel solutions, a Non-Parametric Solution and Parametric Solution, are proposed to minimize quantization distortions with respect to the space decomposition and the quantization codebooks. More recently, a locally optimized product quantization method [17] has been used to index data by inverted lists, and residuals between data points and centroids are PQ-encoded. It locally optimizes an individual product quantizer per cell. Under no assumptions on the distribution, practically all centroids are supported by the data, contributing to a lower distortion. There are a limited number of Sketch-Based Image Retrieval (SBIR) works in literature. SBIR refers to query-by-shape-sketch; the user simply inputs a rough outline of an object and is able to retrieve images containing similar content. To the best of our knowledge, the first SBIR approach dates back to 1979 [5]. Most retrieval algorithms focus on feature descriptors that explore essential properties of the sketch lines. One of the most representative works [1] is a histogram representation of the sample points. It maps interest points into a log-polar space based on the relative positions of the points. Chalechale et al. [4] designed a special descriptor for edge-based search, which employs an angular radial partitioning (ARP) of the images and a histogram of the number of edge pixels falling into angular bins. The final feature vector is taken to be the magnitude of the Fourier transformation of that histogram to achieve invariance to rotations. Hurtut et al. [15] use curvature motion flows to analyze pictorial content in line drawings in small databases. Wang et al. [25] presented a shape descriptor for matching and recognizing 2D shapes; the contour of an object is represented by a fixed number of sample points; a height function for each sample point is defined as its descriptor, and the height function provides excellent discriminative power for shape similarity. Cao et al. [3] proposed a new descriptor, Symmetric-aware Flip Invariant Sketch Histogram (SYMFISH), to refine the shape context feature; the new descriptor supplements the original shape context by encoding the symmetric information, which is a pervasive characteristic of natural scene and objects. Saavedra et al. [22] proposed a novel SBIR method based on keyshapes, which allows users to obtain structural representations of images, leading

2422

Multimed Tools Appl (2016) 75:2419–2434

to an improvement of retrieval effectiveness. Additionally, some feature descriptors proposed in the MPEG-7 processing [19, 27] can be applied to SBIR; these descriptors get orientation distributions of different edges from local regions of the image and concatenate the distributions to compose the final descriptor. Another kind of SBIR technique that has been used recently is the BoW-based approach, which uses local descriptors to obtain an optimized codebook. Hu et al. [14] proposed a Gradient Field Histogram of Oriented Gradients (GF-HOG) method, which is an adapted form of the HOG descriptor suitable for SBIR; they incorporate GF-HOG into a Bag-of-VisualWords (BoVW) retrieval framework and demonstrate how this combination may be harnessed both for robust SBIR and for localizing sketched objects within an image. However, even the proposed method shows good results, while the major disadvantage is that both the computation of the GF and the construction of a large number of codewords are time consuming. Eitz et al. [7] explores standard HOG within a BoW framework computing HOG at random points in the image and in another experiment, over pixels of a Canny edge map (the latter is dubbed S-HOG). It has been shownthat the BoW technique outperforms the known methods [7]. The problem with BoW is that it does not support partial matching due to the localization information loss. Another related work is the feature/knowledge shifting among multi-modalities [29, 31, 32]. This literature captures capture the knowledge from one media type via building a spatial pyramid, and the learned classifiers can be used in other media type. In [32], the authors proposed a novel automatic tagging algorithm by an Informative and Correlative Tag (ICTag) set, which can provide a precise description of the multimedia object by exploring both the information capability of individual tags and the tag-to-set correlation; moreover, a heuristic method is introduced to reduce the computational complexity. In [31, 29], the authors exploit the relations between images and videos in an optimal kernel space to tackle the domain-shift problem, and abundant well-tagged web images are helpful in learning reliable video classifiers. However, these methods explore the visual features in order to facilitate retrieval of other media types. In contrast, our method aims at directly constructing weak visual feature descriptions to image retrieval, which is different from [29, 31, 32].

3 Method 3.1 Preprocessing and extracting features A sketch is different from a normal image, as it is a set of rough contour lines; thus, due to in different modalities, it is difficult to achieve SBIR by simply comparing sketches to normal images. Hence, we need to transfer normal images to near-sketch representation through edge detection to produce the line drawing contours. In our work, we prefer the Canny operator for two reasons: it can get more accurately results compare to other edge detection methods, and it has the advantage of low computation costs, which makes it suitable for large-scale image processing. Next, we encode the edge lines to local features using state-of-the-art line descriptor GALIF [8]. Compared to normal images, it is difficult to extract sketch features due to lack of visual information. Local image descriptors usually encode image content within a small local region, so the original data can be well represented by a sparse set of basic elements. Curvelet basis is

Multimed Tools Appl (2016) 75:2419–2434

2423

the maximally sparse representation for exploiting the properties of sketches such as partial description and continuous or discontinuous lines on a solid background. The GALIF descriptor is designed based on Curvelet transformation, so the GALIF is optimal for sketch data. In the context of feature representation, we relax the requirement of the transformation being a basis. We thus approximate ideas from the Curvelet transformation using filters that respond only to image elements with given frequency and orientation. Other local feature descriptors, such as the SIFT, often capitalize on the distribution of the value of interest; however, there are less sample points for sketches. In this sense, it is appropriate to use GALIF to extract local features with more useful feature information extracted. It should be noted that the GALIF method takes angles into account; in our application, we will choose proper angle parameters for the encoding sketch lines.

3.2 Building codebook with PQ-based SC Suppose a sketch feature X∈RD is to be encoded, C is a codebook, and Y∈RK is the code for X. Derived by SC, we can formulate the coding process as: arg min kX −CY k22 þ λkY k1 s:t:Y ≥ 0 Y

ð1Þ

where ‖.‖1 and ‖.‖2 denote the L1 and L2 norm, respectively. The first term gives the reconstruction error; λ in the second term is a regularization parameter; λ≥0 means all entries in the code are non-negative; and setting λ to a large value leads to a smaller number of nonzero regression coefficients. According to PQ, we equally partitioned each X into M subvectors X=[x1,x2,…,xM], and D xi ∈RM ; the Cartesian product for X is C=C1 ×C2 ×…×CM; thus, we have: arg min kX −CY k22 þ λkY k1 s:t:Y ≥0; C ¼ C 1  C 2  …  C M Y

ð2Þ

D  k subcodebook, and C is the Cartesian product of a size D×K, where Ci,i=1,2,…M is a M M and K=k ; thus, it is clear that the size of subcodebook is much smaller than that of the codebook. We further decompose Eq. (1) into M subproblems (PSC) [11].

arg min kx1 −C 1 y1 k22 þ λky1 k1 s:t:y1 ≥0 y1

⋮ arg min kxM −C M yM k22 þ λkyM k1 s:t:yM ≥0

ð3Þ

yM

Each subvector xi is encoded by solving PSC problem with codebook C fixed. On the other hand, when fixing Y, the problem reduces to a least square problem with quadratic constraints: arg min kx1 −C 1 y1 k2F þ λky1 k1 s:t:kC 1 k1 ≤1 C1

⋮ arg min kxM −C M yM k2F þ λkyM k1 s:t:kC M k1 ≤1

ð4Þ

CM

The optimization in Eq. (4) can be done efficiently by the Lagrange dual. The conventional method for solving C and Y is to solve them iteratively by alternately optimizing one while fixing the other [30]. Compared with traditional quantization methods,

2424

Multimed Tools Appl (2016) 75:2419–2434

PQ-based SC can produce an optimized codebook with a smaller size, which has benefits for data representation ability and quality.

3.3 Encoding residuals Following the above subsection, we can obtain an optimized codebook C in the training stage, which is used in encoding the query and testing data. While we can obtain good representations of original features using the codebook, there still information loss during the quantization due to product constraint. In order to further improve the representation, inspire by [24], we propose a simple but effective information supplementation method by the adding residual information. Considering the information loss in the quantization coding process, known as residual, for each subvector xi, the residual can be formulated as: ei ðxi Þ ¼ xi −C i yi

ð5Þ

D

where ei ðxi Þ∈R M . To better represent xi, we use the code of xi under Ci and its quantization residual together by stacking the two vectors to form the final local feature vector as:   y ð6Þ yi ¼ i ei D

where yi is extended to yi ∈Rkþ M . From Subsec.3.2 and Subsec.3.3, we know that when we have learned the codebook, we can efficiently encode the query and testing set using Eq. (4); sketch images can be even better represented by features with residuals.

3.4 Similarity computing The query and testing data are represented by a histogram of features; we employ a vectorspace based method to compute the similarity between two feature histograms [26]. Suppose histogram h from a query sketch and histogram h′ from the testing data; the similarity between h and h′ is defined as: D

E 0 h; h  0 sðh; h0 Þ ¼ khkh 

ð7Þ

Equation (7) computes the histograms of two sketches. The retrieval result is a set of bestmatching sketches in the order of their similarities, and we finally return the set of normal images corresponding to the sketch set. In summary, searching for the target images turns the problem into a comparison of a single query sketch to edge representations corresponding to a normal image in the database. This approach contains three steps: a) for an object, we construct the set of edge representations; b) for the edge representations, we generate the set of clusters from the database, forming a ‘codebook’ that yields a representation for each projection in the database by ‘distribution of visual words’; c) in the query stage, we compute a similar distribution for the input sketch and compare it with the elements of the database.

Multimed Tools Appl (2016) 75:2419–2434

2425

We summarize our proposed SBIR framework in Algorithm 1; the Btesting dataset^ means that the target images are normal images and contained in this image data repository, we can search for the images in the repository using query sketches.

Algorithm 1 Pseudo code of our SBIR INPUT: Training dataset [X1,X2,…,XN], testing datasete [Y1,Y2,…,YN], number of subvectors M, subcodeword length k, regularization parameter λ, and query sketch. OUTPUT: Retrieved normal images. 1. Initialize Codebook C; 2. Encoding [X1,X2,…,XN] using GALIF; 3. Repeat 4. Update Y using Eq. (3); 5. Update C using Eq. (4); 6. Until convergence 7. Preprocessing [Y1,Y2,…,YN] according to Subsec. 3.1; 8. Encode the query sketch and testing data using C and Eq. (6); 9. Compute the similarity using Eq.(7); Output target images;

3.5 Quantization distortion and complexity analysis For the quantization distortion, suppose each dimension of X∈RD follows the independent Gaussian distribution with zero means; according to rate distortion theory [6], the relation between codeword length b and the distortion of E follows Eq.(8): bðE Þ ≥

D X 1 d¼1

2

log2

Dσ2d E

ð8Þ

where σ2d is the variance at each dimension. As described in [31], if σ2d is large enough, the distortion E satisfies: X  1 2  D E ≥k − D D 

ð9Þ

where k=2b; the matrix ∑ is the covariance of X, and |∑|is the determinant. Equation (9) is the distortion lower bound with number of subcodewords k in each subcodebook. The following table shows the values of theoretical and empirical distortion (we set σ2d ∈[0.5,1]). It can be seen that smaller quantization distortion gaps between theoretical and empirical distortions can be achieved under different codebook size; the encoded residual can be further make the distortion gaps even smaller. The complexity of proposed method is in two-fold: time complexity and space complexity: Consider a product codebook A with d-by-K, the time complexity is for our coding method is pffiffiffiffi O( n K ), while the traditional SC method is O(K). For subcodebook A1, A2, …, and An of the size d-by-K/n, the time complexity is O(k) each for smaller SC subproblems. For memory complexity, in the training stage, the large codebook (matrix) A is not computed or stored, instead, the smaller subcodebook (matrix) A1, A2,…, and An of the size d-by-K/n is used in encoding process. So, the memory complexity of is dramatically deduced from O(K2) in pre-computing the K-by-K matrix for the large codebook to O(nK) for pre-computing n smaller k/n-by-k/n matrices (A1 ×A1, A2 × A2,…, An ×An). It’s obviously that the memory consumption is much smaller.

2426

Multimed Tools Appl (2016) 75:2419–2434

Fig. 1 Example images for the Eitz benchmark dataset. The first column shows the sketch query, and the next six columns show the corresponding real-world images

4 Experiments The performance measurement of image retrieval is based on the Top-N retrieved images because users are interested in the top-returned results; in this work, we take N=20. We divide each feature vector into M=4 subvectors and set the subcodeword length to k=32. We compare our method with three other state-of-the-art SBIR methods: Cao et al. [2], Eitz et al. [7], and Saavedra et al. [22], the latter two of which are BoW-based SBIR methods. Next, we will give the results and discussions on different datasets.

4.1 Eitz benchmark dataset Eitz et al. [7] released a sketch benchmark1 for SBIR evaluation, in which the user’s subjective opinions about the relevance of images with respect to an input sketch are taken into account. The benchmark dataset contains 31 benchmark sketches as well as 40 corresponding images for each sketch. Five sketch examples and their associated images are shown in Fig. 1. Eitz et al. also proposed the use of Kendall’s correlation τ to measure the retrieval performance, as Eq. (10) τ¼

1

#concordant pairs−#discordant pairs 1 nðn−1Þ 2

http://cybertron.cg.tu-berlin.de/eitz/tvcg_benchmark/index.html

ð10Þ

Multimed Tools Appl (2016) 75:2419–2434

2427

Fig. 2 Overall correlation value comparisons for different SBIR methods on the Eitz benchmark dataset

where #concordant pairs evaluates the consistency between two lists; #discordant pairs measure the inconsistency between two lists; and n is the length of the rank lists. The correlation coefficient τ ranges from −1 to 1, where −1 means that one ranking is the reverse of the other, and 1 means the two rankings having the same order. Thus, higher value is associated with greater consistency. The overall correlation values of the Eitz benchmark dataset using different SBIR approaches is shown in Fig. 2. It can be seen that we achieve an average correlation value of 0.33. Considering that our method is based on the state-of-the-art sketch feature representation method, combined with our proposed method by which information loss during quantization is greatly reduced, it is to be expected that our method produces improvement in retrieval performance. The codebook size used in our work is K=324, which is much larger than that of 1000 used in [7, 22]; this means that the original features can be more precisely quantized by our codebook. In order to show the superiority of our method, we replace the GALIF descriptor by the

Fig. 3 Correlation values for all 31 categories of the Eitz benchmark dataset

2428

Multimed Tools Appl (2016) 75:2419–2434

Fig. 4 Precision-recall curve on Flickr15k dataset using different approaches

ShapeContext [1], the result is shown as the second bar in Fig. 2 (ShapeContext+ Ours). We can see that our proposed method can be evidently improve the retrieval performance; and a correlation value of 0.22 was achieved, representing an increase of nearly 31 % compared to single the ShapeContext descriptor.

Fig. 5 Examples of the retrieval results of five query sketches using the proposed method on the Flickr15k dataset. Each row represents one category. The first column is the query sketch, while the remaining five columns show the top positive retrieval images; the rightmost position of each row shows one false retrieval example

Multimed Tools Appl (2016) 75:2419–2434

2429

Table 1 Distortion of lower bounds under different codebook size b

8

32

64

Theoretical distortion bound

3.9

16 8.3

17.9

35.2

Empirical distortion

4.1

8.8

18.6

36.5

We further present the correlation values for each sketch, as shown in Fig. 3. From the results, we achieve encouraging improvements for queries Q2, Q10 and Q22 with respect to the correlation achieved by the optimized codebook approach with residual information considered. In the case of Q2, our approach achieves a correlation value of 0.41 which is the best among the compared approaches. In the case of Q25, our method also achieves an exciting result of 0.4373 which is again the highest value. In some other examples, like Q8 and Q16, the improvements are likewise impressive. Although the improvement of our approach for most queries and the overall correlation values is significant, we still cannot obtain satisfactory results in every instance; for example, in Q1, Q6 and Q23, the correlation values are not the best. One reason for this is that these queries are complex combinations of special curves and lines (Q1 and Q6), and cannot be well represented by the GALIF descriptor; further, the local feature-based codebook is not discriminative. Another reason is that, although the sketch may be appropriately represented by a set of arcs or straight lines, at the same time the images are badly cluttered, which may also justify the poor performance obtained by the BoW-based approach. Relevant problems are also reported in Ref. [22]. Therefore, representing cluttered images is a challenging disadvantage in current SBIR approaches.

Fig. 6 Precision-recall curve on the ETHZ dataset using different approaches

2430

Multimed Tools Appl (2016) 75:2419–2434 Apple logs

0.71

0.05

0.14

0.10

0.05

Bottle

0.06

0.68

0.16

0.11

0.04

Giraffe

0.08

0.13

0.54

0.03

0.03

Mug

0.06

0.12

0.13

0.47

0.15

Swan

0.20

0.09

0.19

0.16

0.53

Apple logs

Bottle

Giraffe

Mug

Swan

Fig. 7 Confusion rate for each shape class on the ETHZ dataset

4.2 Flickr15k dataset Flickr15k2 is a newly released sketch image dataset with 14,660 manually selected images with a reasonable sized meaningful object area. The images are labeled into 33 categories, include heart shape, big ben, horse, sun set, and so on. Unlike the Eitz benchmark dataset, there is no standard sequence for Flickr15k; we choose another two metrics - precision and recall [9] - to measure the performance. The result is shown in Fig. 4, and examples of the retrieval results of five query sketches are shown in Fig. 5. Examining the precision-recall curves in Fig. 4, we can see that our proposed method achieves better performance on the top-returned results; the precision doesn’t drop quickly, and the precision rates are the best in most cases. The possible explanation is that our proposed SBIR framework is more robust compared to the novel retrieval approaches; the PQ-based SC method makes the codebook more discriminative. Moreover, the fusion of residuals information enables better representation of original features. Our findings also confirm the results in [7], with better performance versus simple feature descriptors [30], and indicate that the use of codebooks or BoWs brings a significant benefit. Figure 5 presents five query examples on Flickr15k. We can see that, in most cases, the returned real-world images are visually and semantically consistence with the sketch queries; although there are some false retrieved results in the list (in the rightmost position of each row); as shown, the false results have similar shapes to the queries. In total, the GALIF descriptor is able to capture the spatial information prior to coding, the codebook can extract quantize spatial information among features, and the residuals can be harnessed in improving the performance. From the results, we can conclude that our SBIR method has good discriminative power and shows encouraging retrieval performance, even in the complex cluttered image dataset.

2

http://personal.ee.surrey.ac.uk/Personal/R.Hu/SBIR.html

Multimed Tools Appl (2016) 75:2419–2434

2431

Table 2 Percentage of true positives within first 20 retrieved images using each of the 700 sketches of the novel dataset (average and best results) and the hand-drawn prototype models (right column) Sketch

Best

Average

Apple logs

97 %

76 %

Bottle

93 %

71 %

Giraffe

94 %

75 %

Mug

86 %

66 %

Swan

78 %

70 %

4.3 ETHZ shape dataset The ETHZ shape dataset3 contains five sketch classes, include bottles, swans, mugs, gira_es and Apple logos, with a total of 255 images collected from the web. The target objects appear over a wide range of scales. The ETHZ dataset provides one sketch image for each class, and the five sketches can be used as the input queries in experiment; all the real-life images in the dataset are used as repository data Table 1. Follow the Flickr15k dataset, we use the precision-recall curves on ETHZ, as is shown in Fig. 6; we also give the confusion rate for each class of the ETHZ dataset, as is shown in Fig. 7. The results show that different categories can be well distinguished when using five query sketches. The average percentage of true positives within the Top-20 retrieved images from all user sketches is as follows: Applelogo 71 %, Bottle 68 %, Giraffe 54 %, Mug 47 %, Swan 53 %. We achieve good performance on all the classes compared to previous work; while the performance for Mug is relatively poor, this is reasonable, as mugs may be confused with bottles due to the similar patterns of local shapes of mugs and bottles (e.g., straight vertical lines). We also give the average percentage of true positives within the top 20 retrieved images using all the sample sketches, as is shown in Table 2; the average of all sketches scored 68.5 % and the best class scored 91 %. Overall, we can achieve a retrieval rate of 77 %, which means that we can retrieve most the target images through our SBIR method using a simple hand-drawn sketch of an object.

5 Conclusion The traditional Sketch-based Image Retrieval (SBIR) methods are unable to achieve good performance because of the lack of visual information and information loss in the quantization process. In this work, we proposed a novel SBIR framework that constructs an optimized codebook using a PQ-based sparse coding method and quantizes feature data with encoding residual information, which makes the representation of features much better than in previous methods. We implement our SBIR method on three public sketch datasets and compare our method with three state-of-the-art SBIR algorithms. The experimental results demonstrate the superior performance of our method. Furthermore, with the benefits of PQ, our SBIR method can be quickly implemented, thus making it optimal for use in image retrieval on portable devices, like smart phones. 3

http://groups.inf.ed.ac.uk/calvin/datasets.html

2432

Multimed Tools Appl (2016) 75:2419–2434

Acknowledgments This work is partly supported by National Program on Key Basic Research Project (973 Program, under Grant 2013CB329301), the Major Project of National Social Science Fund (under Grant 14ZDB153), the NSFC (under Grant 61202166 and 61472276), and Doctoral Fund of Ministry of Education of China (under Grant 20120032120042).

References 1. Belongie S, Malik J, Puzicha J (2002) Shape matching and object recognition using shape contexts. IEEE Trans Pattern Anal Mach Intell 24(4):509–522 2. Cao Y, Wang H, Wang C, Li Z, Zhang L (2011) Edgel inverted index for large-scale sketch-based image search. In CVPR, pp 761–768 3. Cao X, Zhang H, Liu S, Guo X, Lin L (2013) SYM-FISH: A Symmetry-Aware Flip Invariant Sketch Histogram Shape Descriptor. In ICCV, pp 313–320 4. Chalechale A, Naghdy G, Mertins A (2005) Sketch-based image matching using angular partitioning. IEEE Trans Syst Man Cybern Part A 35(1):28–41 5. Chang N, Fu K (1979) Query-by-pictorial-example. In COMPSAC, pp 325–330 6. Cover T, Thomas J (1991) Elements of information theory. John Wiley & sons, Inc. 7. Eitz M, Hildebrand K, Boubekeur T, Alexa M (2011) Sketch-based image retrieval: benchmark and bag-offeatures descriptors. IEEE Trans Vis Comput Graph 17(11):1624–1636 8. Eitz M, Hildebrand K, Boubekeur T, Alexa M (2012) Sketch-based shape retrieval. ACM Trans Graph 4(31): 1–10 9. Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874 10. Ge T, He K, Ke Q, Sun J (2014) Optimized product quantization. IEEE Trans Pattern Anal Mach Intell 36(4): 744–755 11. Ge T, He K, Sun J (2014) Product Sparse Coding. In CVPR, pp 939–946 12. Han Y, Yang Y, Ma Z, Shen H, Sebe N, Zhou X (2014) Image attribute adaptation. IEEE Trans Multimedia 16(4):1115–1126 13. Han Y, Yang Y, Yan Y, Ma Z, Sebe N, Zhou X (2015) Semi-supervised feature selection via spline regression for video semantic recognition. IEEE Trans Neural Netw Learn Syst 26(2):252–264 14. Hu R, Collomosse J (2013) A performance evaluation of gradient field HOG descriptor for sketch based image retrieval. Comput Vis Image Underst 117(7):790–806 15. Hurtut T, Gousseau Y, Schmitt F, Cheriet F (2008) Pictorial analysis of line-drawings. In CAe, pp 123–130 16. Jegou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33(1):117–128 17. Kalantidis Y, Avrithis Y (2014) Locally Optimized Product Quantization for Approximate Nearest Neighbor Search. In CVPR, pp 2321–2328 18. Liu W, Wang J, Ji R, Jiang Y, Chang, S (2012) Supervised Hashing with Kernels. In CVPR, pp 2074–2081 19. Martínez JM (2002) MPEG-7: overview of MPEG-7 description tools, part 2. IEEE Multimedia 9(3):83–93 20. Pauleve L, Jegou H, Amsaleg L (2010) Locality sensitive hashing: a comparison of hash function types and querying mechanisms. Pattern Recogn Lett 31(11):1348–1358 21. Rui Y, Huang TS, Chang SF (1999) Image retrieval: current techniques, promising directions, and open issues. J Vis Commun Image Represent 10(1):39–62 22. Saavedra JM, Bustos B (2013) Sketch-based image retrieval using keyshapes. Multimed Tools Appl: 1–30 23. Smeulders WM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380 24. Wang X, Bai X, Liu W., Latecki L (2011) Feature context for image classification and object detection, In CVPR, pp 961–968 25. Wang J, Bai X, You X, Liu W, Latecki LJ (2012) Shape matching and classification using height functions. Pattern Recogn Lett 33(2):134–143 26. Witten A, Moffat A, Bell T (1999) Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann 27. Won C, Park D, Park S (2002) Efficient use of mpeg-7 edge histogram descriptor. Etri J 24(1):23–30

Multimed Tools Appl (2016) 75:2419–2434

2433

28. Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) A multimedia retrieval framework based on semisupervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell 34(5):723–742 29. Yang Y, Yang Y, Shen H (2013) Effective transfer tagging from image to video. ACM Trans Multimed Comput Commun Appl 9(2):3 30. Yang J, Yu K, Gong Y Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In CVPR, pp 1794–1801 31. Yang Y, Zha Z, Gao Y, Zhu X, Chua T (2014) Exploiting web images for semantic video indexing via robust sample-specific loss. IEEE Trans Multimedia 16(6):1677–1689 32. Zhang X, Huang Z, Shen H, Yang Y, Li Z (2012) Automatic tagging by exploring tag information capability and correlation. World Wide Web 15(3):233–256

Qiang Li is pursuting his PhD in School of Computer Science and Technology, Tianjin University, China. His research interests include machine learning based image and video processing, and data mining.

Yahong Han received the Ph.D. degree from Zhejiang University, Hangzhou, China. He is currently an Associate Professor with the School of Computer Science and Technology, Tianjin University, Tianjin, China. His current research interests include multimedia analysis, retrieval, and machine learning.

2434

Multimed Tools Appl (2016) 75:2419–2434

Jianwu Dang graduated from Tsinghua University, China, in 1982, and got his M.S. at the same university in 1984. He worked for Tianjin University as a lecture from 1984 to 1988. He was awarded the PhD from Shizuoka University, Japan in 1992. He worked for ATR Human Information Processing Labs., Japan, as a senior researcher from 1992 to 2001. He joined the University of Waterloo, Canada, as a visiting scholar for 1 year from 1998. Since 2001, he has moved to Japan Advanced Institute of Science and Technology (JAIST). He joined the Institute of Communication Parlee (ICP), Center of National Research Scientific, France. He is as BOne Thousand Plan^ distinguished expert in China since 2010 and National B973 Project^ chair scientist. His research interests are in all the fields of signal processing.