Visual Aesthetic Quality Assessment with Multi-task Deep Learning

arXiv:1604.04970v1 [cs.CV] 18 Apr 2016

1

Visual Aesthetic Quality Assessment with Multi-task Deep Learning Yueying Kao, Ran He, Kaiqi Huang CRIPAC & NLPR, Institute of Automation, Chinese Academy of Sciences

Abstract. This paper considers the problem of assessing visual aesthetic quality with semantic information. We cast the assessment problem as the main task among a multi-task deep model, and argue that semantic recognition offers the key to addressing this problem. Based on convolutional neural networks, we propose a general multi-task framework with four different structures. In each structure, aesthetic quality assessment task and semantic recognition task are leveraged, and different features are explored to improve the quality assessment. Moreover, an effective strategy of keeping a balanced effect between the semantic task and aesthetic task is developed to optimize the parameters of our framework. The correlation analysis among the tasks validates the importance of the semantic recognition in aesthetic quality assessment. Extensive experiments verify the effectiveness of the proposed multi-task framework, and further corroborate the above proposition. Keywords: Visual aesthetic quality assessment, multi-task learning, semantic information

Introduction

Aesthetic image analysis has attracted increasing attention recently in computer vision community [6, 9, 18, 25]. It is related to the high-level perception of visual aesthetics, and is useful in many applications, e.g., image retrieval, photo management, and photography [5, 11]. The main challenge in assessing aesthetic quality of images automatically is that visual aesthetics is a subjective attribute. Many efforts have been made to address this issue. Data-driven approaches [4, 7, 17, 19, 25–28, 31] are often used to learn from the labeled images. The aesthetic quality of the images are labeled by humans. For aesthetic feature extracting, most of these approaches treat visual aesthetic quality assessment as a single and standalone classification task. Handcrafted features are earlier attempts. They are based on the intuitions of how people perceive the aesthetic quality of images or photographic rules. These features include color [4, 11, 23], the rule of thirds [4], simplicity [17, 25], and composition [7]. Later, generic image descriptors such as bag-of-visual-words (BOV) [3] and fisher vectors (FV) [8] are used to assess aesthetic quality. They are shown to outperform the traditional handcrafted features [19, 20, 22]. Recently, deep convolutional neural networks (CNNs) [12, 30] have been applied to aesthetic quality assessment [10, 15].

Aesthetic: Semantic:

High Portraiture

High Sky, Architecture

Low Food and Drink

Low Still Life, Nature

Fig. 1: Example images with aesthetic and semantic labels.

Nevertheless, these computational approaches provide either accurate or interpretable results [18]. For human beings, aesthetic quality assessment is always coupled with the identification of semantic content of images [14, 21]. It is difficult for humans to treat aesthetic quality assessment as an isolate and independent task. When humans assess the aesthetic quality of an image, they first understand what they are assessing. That is, they have known the sematic information of this image. Seen from Fig. 1, we can recognize the semantic content from these images at a glance and assess the aesthetic quality quickly. Hence it is reasonable to assume that, assessing the aesthetic quality and semantic recognition are correlated tasks for machine learning. The task of semantic recognition is potentially helpful to improve the task of automatically assessing visual aesthetic quality. This paper employs multi-task convolutional neural network (MTCNN) to address the problem of aesthetic quality assessment. Multi-task learning can learn multiple related tasks in parallel with shared knowledge. It has been demonstrated that this approach can boost some or all of the tasks [2]. Our goal is to utilize semantic recognition in the joint objective function to improve the aesthetic quality assessment. Multi-task learning is suited to our problem. Facing the different learning difficulties in the two tasks, we present a strategy to keep the effect of both tasks balanced in the joint objective function. The strategies of treating all tasks equally and early stopping are often adopted in existing works [2, 29, 32]. To investigate how to take advantage of semantic information best and how semantic information influence aesthetic task, we present four MTCNNs with different structures. In addition, the correlation in different tasks is also analyzed to explain the factors in aesthetic quality assessment and make our results more interpretable. Our contributions are summarized as follows: – Instead of taking visual aesthetic quality assessment as an isolated task, we propose to exploit the semantic recognition to assess the aesthetic quality jointly with a multi-task convolutional neural network. It is a novel attempt to learn aesthetic features with the help of a related task, i.e. semantic recognition. – Four MTCNNs including three basic MTCNNs with different structures and an enhanced MTCNN, are developed to explore different features with the supervision of aesthetic and semantic labels. The correlation in aesthetic quality assessment and semantic recognition is analyzed from our MTCNNs, which can explain the factors in aesthetic quality assessment and makes our results more interpretable.

– The proposed method significantly outperforms the state-of-the-art methods on the challenging AVA dataset. The rest of this paper is organized as follows: we summarize related work in Sec. 2, describe our method in detail in Sec. 3, present the experiments in Sec. 4, and conclude the paper in Sec. 5.

2

Related work

Since our work is related to the aesthetic quality assessment and multi-task learning, we will mainly review work related to the two parts in this section. Aesthetic quality assessment: Most previous works [4, 7, 11, 16, 19, 24] on aesthetic quality assessment focus on the challenging problem of designing appropriate features. Typically, handcrafted features are proposed based on the intuitions about human perception of the aesthetic quality of images or photographic rules. For example, Datta et al. [4] design certain visual features such as colorfulness, the rule of thirds, and low depth of field indicators, to discriminate between aesthetically pleasing and displeasing images. Dhar et al. [7] extract some high level attributes including compositional, content, and sky-illumination attributes, which are characteristically used by humans to describe images. Luo et al. [16] and Tang et al. [25] consider that photos may have different aesthetic criteria in mind for different type of images and design visual features in different ways according to the variety of photo content. In [19], generic image descriptors are used to assess aesthetic quality. It is shown that they can outperform the traditional handcrafted features. Despite the success of handcrafted features and generic image descriptors, CNNs have been applied to aesthetic quality assessment [10, 15] and obtain new state-of-theart performances. CNNs learn aesthetic features automatically. However, they extract features by treating aesthetic quality assessment as an independent problem. The best network in [15], RDCNN, hopes to leverage the idea of multi-task learning with the style attributes to help determine the aesthetic quality of images. Unfortunately, due to many missing labels for style attributes, they can not jointly perform aesthetics categorization and style classification in a neural network, and just concatenate the features of the aesthetics and style by using transfer learning. Our work is also related to CNNs for aesthetics classification. In contrast, firstly, we exploit semantic information to assist in learning aesthetic representation with a multi-task learning framework. Secondly, we can jointly learn aesthetics categorization and semantic recognition with a single multi-task network, which is different from RDCNN [15]. Finally, images are labeled with semantic information much easier than style attributes in real world. This is because only professional photographer and photography amateurs are familiar with all the style attributes. Multi-task learning: Multi-task learning aims to boost the generalization performance by learning multiple related tasks simultaneously [1, 2, 13, 32]. Deep neural network can learn features jointly under multiple objectives and it is very appropriate for multitask learning. Multi-task learning based on deep neural network has been applied to many computer vision problems [29,32]. However, there are many strategies for sharing knowledge and learning process. For example, Yim et al. [29] treat all tasks equally

important. In contrast, early stopping strategy is used in some related tasks [32], due to different learning difficulties and convergence rates in different tasks. In our problem, because semantic recognition task is much easier than aesthetic quality assessment, common features of our two tasks are learned simultaneously and an effective strategy of keeping effect of all the tasks balanced in the joint objective function is used.

3

Method

In this section, we propose to exploit the semantic information to help identify the aesthetic quality of images, assuming that they are considered as the related attributes [14,21]. Our problem is formulated as a MTCNN model and its framework is illustrated in Fig. 2. To explore the effect of semantic recognition on aesthetic quality assessment, three basic MTCNNs and an enhanced MTCNN are presented to optimize the model. Common feature representation learning with parameters Layer 1

x Filter size 11

Layer 2

Layer 3

128

128

Layer 4 Layer 5 Layer 6

layer 7

256

27×27

54×54 3×3 max pool Stride 2, Norm

3×3 max pool Stride 2, Norm

4096 units

4096 units

Task 2: Semantic

...

27×27

13×13

3×3 max pool Stride 2, Norm

Task1: Aesthetic

y

...

3

...

3

Stride 2

227×227

Multi-task

Wa

96 5

Input image 256×256



z

Ws

Large scale dataset Fig. 2: The framework of our multi-task learning and the illustration for the architecture Transfer parameters Small dataset of our MTCNN #1. Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 layer 7

x'

Filter size 11

Wa'

96

128

128

256

5

ya'

...

Problem Formulation Stride 2

3

Task 2:

...

...

3.1

3

Task1: Aesthetic

13×13

Semantic 27×27 27×27 Our problem can be interpreted as54×54 a probabilistic model. Using the probabilistic y 227×227 formulation, various deep networks can3×3solve our problem by optimizing Wthe model Input image 256×256 3×3 max pool max pool 3×3 max pool 4096 4096 Stride 2, norm Stride 2, Norm 2, Norm units units parameters that maximize the posterior probability.Stride Then, Bayesian analysis is leveraged to predict most likely aesthetic quality and semantic attributes of given images. Assuming we have a training dataset with a total of N samples, which are associated with C aesthetic classes and M semantic attributes. Considering each image has only one aesthetic class and multiple semantic attributes in real world, each image is represented as (xn , yn , zn ), n = 1, 2, ..., N . Here xn represents the n-th image sample, yn = c, c = 0, ..., C − 1 is the aesthetic label and zn = [zn1 , ..., znm , ..., znM ]T is the semantic label for the n-th image sample. If the n-th image sample has the mth semantic attribute, the m-th semantic label is set as znm = 1, otherwise znm = 0. Therefore a given dataset is denoted as (X, Y, Z) = {(xn , yn , zn ), n ∈ {1, 2, ..., N }}. For our MTCNNs (our MTCNN #1 is shown in Fig. 2), Θ denotes the common parameters in some bottom layers to learn features for all tasks , and W = [Wa , Ws ] ' s

' s

indicates the specific parameters for associated tasks. Wa and Ws represent the parameters for aesthetic quality assessment and semantic recognition respectively. Each column in Wa or Ws corresponds to a subtask. The goal is to find the optimal or suboptimal parameters Θ, W, λ by maximizing the following posterior probability ˆ = argmax p(Θ, W, λ|X, Y, Z), ˆ W ˆ ,λ Θ,

(1)

Θ,W,λ

where λ is the weight coefficient of the semantic recognition task in the joint learning process. Based on the Bayesian theorem, we have p(X, Y, Z|Θ, W, λ)p(Θ, W, λ) p(X, Y, Z) ∝ p(X, Y, Z|Θ, W, λ)p(Θ, W, λ),

p(Θ, W, λ|X, Y, Z) =

(2)

where p(X, Y, Z|Θ, W, λ) is the conditional probability, and p(Θ, W, λ) is the prior probability. Then Eqn. (1) takes the form ˆ ∝ argmax p(Y |X, Θ, Wa )p(Z|X, Θ, Ws , λ)p(Θ)p(W )p(λ). ˆ W ˆ ,λ Θ,

(3)

Θ,W,λ

Each term in Eqn. (3) is defined as: 1) The conditional probability p(Y |X, Θ, Wa ) corresponds to the task of aesthetic quality assessment. Here assessing aesthetic quality is interpreted as a classification problem and modeled as a multinomial logistic regression similar to traditional classification problems [12]. The conditional probability p(Y |X, Θ, Wa ) can be formulated as p(Y |X, Θ, Wa ) =

N X C Y

(4)

1{yn = c}p(yn = c|xn , Θ, Wa ),

n=1 c=1

where 1{·} is the indicator function, it has two values, 1{a true statement} = 1, and 1{a f alse statement} = 0. p(yn = c|xn , Θ, Wa ) is calculated by the softmax function exp(Wac T (ΘT xn )) p(yn = c|xn , Θ, Wa ) = PC . lT T l=1 exp(Wa (Θ xn ))

(5)

2) The conditional probability p(Z|X, Θ, Ws , λ) corresponds to the semantic recognition. Since each element of the semantic label of a given image is binary: znm ∈ {0, 1}, each semantic attribute recognition can be interpreted as a logistic regression. Hence the conditional probability p(Z|X, Θ, Ws , λ) can be p(Z|X, Θ, Ws , λ) =

N Y M Y

m

m

(p(znm = 1|xn , Θ, Wsm )zn (1 − p(znm = 1|xn , Θ, Wsm ))1−zn )λ ,

n=1 m=1

(6)

where p(znm = 1|xn , Θ, Wsm ) is calculated by a sigmoid function σ(x) = 1/(1 + exp(−x)). 3) The prior probability p(Θ) corresponds to the network parameters for common features. The parameters Θ canQbe initialized Q as a standard normal distribution like K K previous network [12]. p(Θ) = k=1 p(θk ) = k=1 N (0, I), where 0 is a zero matrix and I is an identity matrix. 4) Similar to Θ, the parameters W for specific tasks can also be initialized as a standard normal distribution. Thus, the prior probability can be p(W ) = p(Wa )p(Ws ) = Na (0, I)Ns (0, I). 5) λ is used to control the influence of semantic recognition task in the final objective function. The prior probability p(λ) is implemented by defining λ obeying a normal distribution, p(λ) = N(µ, σ 2 ). Then Eqns. (4), (5) and (6) are substituted into Eqn. (3), negative log function is taken for Eqn. (3), and the constant terms are omitted. As a result, the objective function can be argmin{− Θ,W,λ

−λ

N X C X

exp(Wac T (ΘT xn )) 1{yn = c}log PC lT T n=1 c=1 l=1 exp(Wa (Θ xn ))

M N X X

(znm logσ(Wsm T (ΘT xn )) + (1 − znm )(1−

(7)

n=1 m=1

logσ(Wsm T (ΘT xn )))) + ΘT Θ + W T W + (λ − µ)2 }. 3.2

Optimization Procedure

The multi-task objective function in Eqn. (7) can be optimized by a network through stochastic gradient descent (SGD) [12]. Here MTCNN is applied to search optima for the parameters Θ, W, λ. One architecture of our MTCNNs is shown in Fig. 2. Firstly, all tasks share knowledge in bottom layers. Then specific features are learned for each task in top layers. Finally, the combination of the softmax loss function for aesthetic quality prediction (the first term in Eqn. (7)) and the cross entropy loss function for semantic recognition (the second term in Eqn. (7)) are employed to update the parameters of the network jointly. Traditionally, multiple tasks are treated equally important in back propagation of multi-task learning [2, 29] assuming that they can reach best performance roughly at the same time. However, different tasks may have different learning difficulties and convergence rates. Caruana [2] propose to control the effect of different tasks by adjusting the learning weight on each output task. He also put forward some strategies for this problem, such as early-stopping. Early stopping strategy has been used to some works [32] and good performance is achieved. Nevertheless, this strategy is not suited to our problem. Because the extra task, semantic recognition task, is much easier, and converges more rapidly than the main task, aesthetic quality assessment. Our experimental results (details in Sec. 4) show that, when early stop the convergent semantic recognition task, the training error of aesthetic task continues converging very slowly and do not drop obviously. We think it is mainly because the aesthetic

is subjective and needs the help of semantic task in entire training process. Hence, we Color codetoused: = convolutional layer,inyellow = present a simple strategy keep purple the effect of all tasks balanced back propagation. fully-connected layer. Because the softmax loss function only considers the value corresponding ground truth label for each example. In our problem, λ = 1/M is fixed in the objective function in the entire training process. MTCNN #1

Semantic

Aesthetic

MTCNN #2 Aesthetic

MTCNN #3

Semantic

Enhanced MTCNN

Semantic

Aesthetic Semantic

Wa

Aesthetic

Ws

2

Aesthetic

Wa'

1 Input

Input

Input

Input

Fig. 3: Explored MTCNNs with different architectures. The details of MTCNN #1 are illustrated in Fig. 2. Color code used: purple = convolutional layer + max pooling, grey = convolutional layer, yellow = fully-connected layer.

3.3

MTCNN Implementation

To implement the multi-task model, we investigate several multi-task network architectures to utilize semantic information for visual aesthetic quality assessment. These networks are explained in Fig. 3. The supervision of aesthetic and semantic labels can be in the same or different layers in the network. Here we propose three basic network architectures and an enhanced network. For all networks, the input is a 227 × 227 × 3 patch randomly extracted from a resized image 256 × 256 × 3 as previous work [15]. MTCNN #1: Since our goal is to discover the effective features for aesthetic assessment with the help of semantic information, a simple idea is to learn all parameters for aesthetic representations with aesthetic and semantic supervision in a network. MTCNN #1 implements this idea. The architecture of MTCNN #1 (in Fig. 3) is detailed in Fig. 2. The network contains four convolutional layers and two fullyconnected layers with parameters Θ for common feature learning. The parameters W = [Wa , Ws ] from layer 6 to layer 7 for each task are learned separately. Then, the softmax loss function is adopted for aesthetic quality prediction, and the cross entropy loss function for semantic recognition. The combination of the two loss functions is employed to jointly update the parameters of the network. MTCNN #2: To explore different structures for aesthetic features learning, we introduce MTCNN #2 (shown in Fig. 3) to allow some top layers to learn aesthetic representations independently without semantic supervision. Similar to MTCNN #1, the

network #2 contains four convolutional layers with parameters Θ for common feature learning. Different from the architecture #1, layers 5, 6 and 7 in the network #2 learn parameters W = [Wa , Ws ] separately for the two tasks. The loss functions are also the same as the architecture #1. MTCNN #3: Since CNNs can learn hierarchical features, we consider the low-level features of a network for our main task in the MTCNN #3 (shown in Fig. 3). In this network, four convolutional layers and three fully-connected layers are designed for semantic recognition, while two convolutional layers and two fully-connected layers for aesthetic quality assessment. The two tasks share knowledge Θ in the two convolutional layers. The other layers are used to learn specific parameters W = [Wa , Ws ] for each task. The loss functions are also the same as the architecture #1. Enhanced MTCNN: To further explore the effective aesthetic features, we propose an enhanced MTCNN by combining MTCNN #1 and MTCNN #3. That is, we add extra aesthetic supervision in the first two layers in MTCNN #1. Shown in Fig. 3, the common parameters Θ1 in the first and second convolutional layers are learned for three tasks, the common parameters Θ2 in other two convolutional layers and two fully0 connected layers are learned for two tasks, and specific parameters W = [Wa , Wa , Ws ] are learned separately in top layers. Our goal is to enhance the supervision of aesthetic labels in the first and second convolutional layers under the premise of ensuring the influence of semantic information in all network. Here we denote Θ = [Θ1 , Θ2 ]. The objective function in Eqn. (7) is transformed to

argmin{− Θ,W,λ

−

N X C X n=1 c=1

−λ

N X C X

exp(Wac T (ΘT xn )) 1{yn = c}log PC lT T n=1 c=1 l=1 exp(Wa (Θ xn )) 0

1{yn = c}log PC

N X M X

T

exp(Wac (Θ1T xn ))

l=1

T

exp(Wa0 l (Θ1T xn ))

(8)

(znm logσ(Wsm T (ΘT xn )) + (1 − znm )(1−

n=1 m=1 logσ(Wsm T (ΘT xn ))))

+ ΘT Θ + W T W + (λ − µ)2 },

where the first term in Eqn. (8) is our main task, and the second term is the added task. We fix λ = 2/M based on our strategy for the enhanced MTCNN.

4

Experiments

In this section, we evaluate the proposed method on the challenging large-scale AVA dataset. Experimental results show that the benefits of semantic information and the effectiveness of our proposed method. 4.1

AVA Dataset

AVA dataset [22] is one of the most large-scale and challenging dataset for visual aesthetic quality assessment. It contains more than 255,000 images gathered from

www.dpchallenge.com. Each image has about 200 voters to assess the aesthetic score from one to ten. In addition, each image contains 0, 1 or 2 semantic tags (attributes). We select 185,751 images used in this paper based on some rules. 1)More than 3000 images are available for each tag; 2) each image contains at least one tag. Eventually 29 semantic tags are chosen. From the 185,751 images, 20,000 images are selected randomly as testing set similar to [15], and the rest 165,751 images as training set. For aesthetic labels, we follow the experimental setup as [15, 22], the training set is divided into two classes: high quality and low quality images. We designate the images with an average score larger than 5 + δ as high quality images, those with an average score smaller than 5 − δ as low quality images. Images with an average score between 5 + δ and 5 − δ are discarded. We set δ to 0 and 1 respectively for the training set to obtain the ground truth labels. There are 165,751 images in training set when δ = 0 and 38,994 images in training set when δ = 1. We set δ to 0 for the testing set regardless of the value of δ for training set. For semantic labels, each image is labeled as a 29-dim binary vector.

Table 1: Accuracy (%) of our MTCNN #1 with different λ on the AVA dataset. δ λ = 0 λ = 1/29 λ = 2/29 λ = 1 with early stopping

4.2

0

72.19

76.15

75.76

73.54

73.43

1

75.13

75.90

75.82

73.12

74.28

Evaluating the Effectiveness of Keeping Balance Strategy

In the objective function, λ is used to control the contributions from semantic information. To validate our strategy of keeping the influence of two tasks balanced, we implement our MTCNN #1 with our strategy λ = 1/M (here λ = 1/29) and we also compare the experimental results of MTCNN #1 with λ = 0, λ = 2/29, λ = 1 and early stopping strategy (shown in Table 1). By compering the results with or without the supervision of semantic labels, the MTCNN #1 with λ 6= 0 performs better than that with λ = 0. This indicates the supervision is effective. What’s more, the results shown in Table 1 demonstrate that our strategy λ = 1/29 performs best on both values of δ. When λ = 1/29, the aesthetic and semantic tasks have same effect on the process of back propagation. Therefore the effectiveness of our strategy is verified. To further demonstrate the effectiveness of our MTCNN with our strategy, we also analyze the accuracy on each semantic tag using MTCNN #1 with different setting of λ in Fig. 4. As shown, our MTCNN #1 with λ = 1/29 performs best on overall images and most semantic tags. We also observe that different results are achieved on various semantic tags with the same method, and different improvements with MTCNNs are also different on various semantic tags. For example, the semantic tags “Family” and “Snapshot” obtain an great improvement with different methods.

0.85

0.80

0.75

Accuracy (δ=0)

0.70

0.65

0.60 MTCNN #1 (λ=0)

0.55

MTCNN #1 (λ=1/29) MTCNN #1 (λ=2/29)

0.50

MTCNN #1 (λ=1) 0.45

0.40

Semantic tags

(a) 0.85

0.80

Accuracy (δ=1)

0.75

0.70

MTCNN #1 (λ=0)

0.65

MTCNN #1 (λ=1/29) MTCNN #1 (λ=2/29)

0.60

MTCNN #1 (λ=1) 0.55

0.50

Semantic tags

(b)

Fig. 4: Accuracy on each semantic tag using MTCNN #1 with different λ when δ = 0 and δ = 1.

4.3

Evaluating the Benefits of Semantic Information

To evaluate our MTCNNs with the help of semantic information for aesthetic classification, we compare our results (fix λ = 1/29 for three basic MTCNNs and λ = 2/29 for the enhanced MTCNN) with those of our single task CNN (STCNN, MTCNN #1, λ = 0) on the AVA dataset with both values of δ. Shown in Table 2, all the four MTCNNs perform better than our STCNN especially when δ = 0. Aesthetic quality classification with δ = 0 is more challenging than that with δ = 1 [22]. These results demonstrate the effectiveness of semantic information.

Table 2: Accuracy (%) of different methods on the AVA dataset. δ

Our

MTCNN MTCNN MTCNN Enhanced

STCNN

#1

#2

#3

MTCNN

0

72.19

76.15

75.91

75.92

1

75.13

75.90

75.81

75.37

[22]

SCNN DCNN RDCNN [15]

[15]

[15]

76.58

66.7 71.20

73.25

74.46

76.04

67.0 68.63

73.05

73.70

In addition, we analyze the results with the four MTCNNs to investigate how to take advantage of semantic information best and how semantic information influence aesthetic task. We can see that the more supervision semantic labels makes on the aesthetic feature learning, the better performance our MTCNN achieves. It also reveals that the low-level features of MTCNN #3 can still perform well. Therefore, under the premise of ensuring the effect of semantic information in the whole network, we enhance the aesthetic supervision in the two bottom layers. Experimental results show that our enhanced MTCNN for the main task performs best.

(a) δ= 0, STCNN (MTCNN, λ=0)

(b) δ= 0, MTCNN, λ=2/29

(c) δ= 1, STCNN (MTCNN, λ=0)

(d) δ= 1, MTCNN, λ=2/29

Fig. 5: Learned filters in the first convolutional layer with STCNN for aesthetic task only and MCTNN #1 for the two tasks with both δ = 0 and δ = 1.

Low aesthetic quality

High aesthetic quality

To qualitatively demonstrate the benefits of our MTCNN with semantic information, we show the learned filters in the first convolutional layer with a STCNN for aesthetic task only and our MCTNN #1 with both δ = 0 and δ = 1 in Fig. 5. Compared to the filters learned without semantic information, the filters with semantic information are smoother, cleaner and more understandable. The proposed MTCNN can learn more color and high frequency edge information than STCNN. These differences can also be observed from the examples of test images correctly classified by MTCNN but misclassified by STCNN in Fig. 6. The high quality images often have more vivid color and clearer edge than low quality images. Most of the low quality images in Fig. 6 are blurred and dull. This indicates that the supervision of semantic labels for aesthetic feature learning is very beneficial, and aesthetic and semantic tasks are related to some extent.

Fig. 6: Example test images correctly classified by MTCNN but incorrectly by STCNN. The labels of the images on the first and second row are high aesthetic quality, and The labels of the images on the third and fourth row are low aesthetic quality.

4.4

Comparison with Other State-of-the-art Methods

To further validate our MTCNNs with semantic information for aesthetic classification, we compare our results with those of the state-of-the-art methods in [15,22]on the AVA dataset. Shown in Table 2, all the four MTCNNs perform better than the method in [22], SCNN [15], DCNN [15] and RDCNN [15] on both values of δ. The method in [22] is the baseline of the AVA dataset and is implemented by extracting SIFT [19] information

and SVM classifier. SCNN is a single-column CNN, DCNN is a double-column CNN with two inputs consisting of a global view and a local view, and RDCNN is a doublecolumn CNN with an aesthetic column and a style column. Thus, these results in Table 2 illustrate the effectiveness of our MTCNNs with semantic recognition task. Furthermore, we also train a separate model for each semantic labels to assess aesthetic quality. Due to different number of images for different semantic labels, we only train four CNNs separately for “Landscape”, “Nature”, “Still Life” and ”Black and White”. The four labels have the most number of images in 29 labels. Here we call the CNNs trained separately for the four semantic labels “respective CNN”. For example, the respective CNN for “Landscape” is trained only with “Landscape” images for aesthetic categorization. Figure 7 shows the results with different methods for aesthetic classification on “Landscape”, “Nature”, “Still Life” and ”Black and White” separately with both value of δ. As shown in Fig. 7, all the MTCNNs outperform the respective CNN on each semantic labels, which also demonstrates the effectiveness of the proposed MTCNNs. Moreover, MTCNNs don’t need to know the semantic labels of test images, while the respective CNNs have to know the semantic labels.

Respective CNN MTCNN #1 MTCNN #2 MTCNN #3 Enhanced MTCNN

Accuracy (δ=0)

0.8

0.75

0.7

0.85 Respective CNN MTCNN #1 MTCNN #2 MTCNN #3 Enhanced MTCNN

0.8

Accuracy (δ=1)

0.85

0.75

0.7

0.65

0.65 Landscape

Nature

Still Life

Black and White

(a)

Landscape

Nature

Still Life

Black and White

(b)

Fig. 7: The accuracy with different methods for aesthetic classification on “Landscape”, “Nature”, “Still Life” and ”Black and White” separately with both δ = 0 and δ = 1.

4.5

Inter Tasks Correlation Analysis

To demonstrate the effectiveness of MTCNNs with semantic information and investigate how semantic information influence aesthetic task again, we analyze the correlation between the two tasks. Since each column vector of task-specific matrix W = [Wa , Ws ] in the network corresponds to the parameters of a subtask, we calculate the correlation coefficient between any two column vector of weight matrix W as correlation between any two subtasks. Shown in layer 7 of Fig. 2 in our problem, the aesthetic classification task has two subtasks: high aesthetic and low aesthetic, the semantic recognition task has 29 subtasks. Figure 8 presents the correlation in any two subtasks learned by MTCNN #1 with δ = 0, which also verifies that semantic information is beneficial for aesthetic estimation. Seen from Fig. 8, low aesthetic task has high negative correlation with high aesthetic task. We can also see that the aesthetic tasks have high correlation with certain semantic attributes. For instance, the semantic tags “Snapshot” and “Candid” recognition has high positive correlation with the low

aesthetic task. In real word, most of “Snapshot” and “Candid” images are usually regarded as low aesthetic quality images. While “Advertisement” and “Seascapes” recognition has positive correlation with the high aesthetic task. This accords with the knowledge that most of “Seascapes” and “Advertisement” images are usually taken as high aesthetic quality images. In addition, Fig. 8 can also visualize the correlation in different semantic tag recognitions. 1 0.8

0.6

0.4 0.2

0

-0.2 -0.4

-0.6

-0.8

children

food and drink

floral

transportation

seascapes

studio

advertisement

rural

water

travel

action

macro

architecture

black and white

still life

animals

portraiture

nature

candid

landscape

urban

emotive

sports

sky

snapshot

family

humorous

abstract

cityscape

low aesthetic

high aesthetic

low aesthetic 1.000 -0.996 0.340 -0.154 0.322 0.084 -0.065 0.552 0.080 0.018 0.084 -0.011 -0.049 0.178 -0.006 -0.024 -0.066 -0.153 -0.011 0.035 -0.035 0.230 -0.006 0.099 -0.209 -0.228 -0.068 0.001 -0.193 0.066 0.096 high aesthetic -0.996 1.000 -0.340 0.155 -0.322 -0.084 0.063 -0.552 -0.079 -0.017 -0.086 0.012 0.049 -0.177 0.005 0.024 0.068 0.156 0.013 -0.036 0.035 -0.230 0.007 -0.100 0.208 0.227 0.067 -0.003 0.195 -0.067 -0.096 abstract 0.340 -0.340 1.000 -0.336 -0.031 -0.152 0.032 0.161 -0.424 -0.241 0.001 -0.149 -0.058 -0.270 -0.247 0.454 -0.452 -0.314 -0.078 0.448 -0.368 -0.162 -0.324 -0.026 0.159 -0.001 -0.285 0.483 -0.376 0.348 -0.156 cityscape -0.154 0.155 -0.336 1.000 -0.393 -0.480 0.320 -0.095 0.277 0.822 -0.162 0.775 0.211 0.016 -0.388 -0.525 -0.035 0.879 0.234 -0.569 0.864 0.046 0.800 0.686 -0.629 -0.416 0.711 -0.529 0.898 -0.761 -0.430 family 0.322 -0.322 -0.031 -0.393 1.000 0.743 -0.413 0.775 0.289 0.019 0.764 -0.507 -0.441 0.846 0.848 0.062 0.307 -0.258 0.426 -0.033 -0.187 0.356 -0.198 -0.381 0.346 0.338 -0.326 -0.145 -0.148 0.303 0.934 humorous 0.084 -0.084 -0.152 -0.480 0.743 1.000 -0.552 0.503 0.348 -0.158 0.438 -0.661 -0.299 0.593 0.710 0.428 0.295 -0.321 0.042 0.168 -0.360 0.398 -0.324 -0.570 0.586 0.726 -0.558 0.208 -0.217 0.643 0.798 sky -0.065 0.063 0.032 0.320 -0.413 -0.552 1.000 -0.299 -0.124 -0.129 -0.086 0.497 0.043 -0.508 -0.381 -0.305 -0.292 -0.041 -0.122 -0.133 0.427 0.052 0.060 0.564 -0.155 -0.215 0.774 -0.109 0.114 -0.257 -0.410 snapshot 0.552 -0.552 0.161 -0.095 0.775 0.503 -0.299 1.000 0.520 0.306 0.447 -0.086 -0.026 0.716 0.391 -0.106 0.255 0.001 0.220 0.010 0.156 0.536 0.206 0.091 -0.109 0.061 -0.097 -0.020 0.102 0.158 0.567 sports 0.080 -0.079 -0.424 0.277 0.289 0.348 -0.124 0.520 1.000 0.418 -0.072 0.180 0.444 0.467 0.125 -0.318 0.692 0.353 -0.144 0.056 0.456 0.816 0.496 0.300 -0.294 0.067 0.149 -0.028 0.450 -0.059 0.209 urban 0.018 -0.017 -0.241 0.822 0.019 -0.158 -0.129 0.306 0.418 1.000 0.066 0.556 0.132 0.458 -0.090 -0.457 0.131 0.910 0.452 -0.536 0.736 0.137 0.823 0.506 -0.591 -0.352 0.399 -0.554 0.893 -0.664 -0.077 emotive 0.084 -0.086 0.001 -0.162 0.764 0.438 -0.086 0.447 -0.072 0.066 1.000 -0.363 -0.687 0.648 0.804 0.038 -0.046 -0.169 0.623 -0.305 -0.058 0.019 -0.213 -0.306 0.432 0.302 -0.067 -0.339 -0.015 0.085 0.785 landscape -0.011 0.012 -0.149 0.775 -0.507 -0.661 0.497 -0.086 0.180 0.556 -0.363 1.000 0.464 -0.257 -0.653 -0.654 -0.169 0.616 0.009 -0.455 0.846 -0.031 0.828 0.903 -0.797 -0.618 0.823 -0.352 0.650 -0.684 -0.611 nature -0.049 0.049 -0.058 0.211 -0.441 -0.299 0.043 -0.026 0.444 0.132 -0.687 0.464 1.000 -0.308 -0.617 -0.291 0.352 0.228 -0.510 0.366 0.316 0.248 0.497 0.492 -0.526 -0.269 0.174 0.371 0.212 -0.092 -0.520 candid 0.178 -0.177 -0.270 0.016 0.846 0.593 -0.508 0.716 0.467 0.458 0.648 -0.257 -0.308 1.000 0.750 -0.171 0.460 0.225 0.594 -0.255 0.117 0.352 0.178 -0.206 0.048 0.137 -0.168 -0.414 0.286 -0.078 0.796 portraiture -0.006 0.005 -0.247 -0.388 0.848 0.710 -0.381 0.391 0.125 -0.090 0.804 -0.653 -0.617 0.750 1.000 0.137 0.340 -0.249 0.455 -0.102 -0.313 0.191 -0.372 -0.603 0.568 0.451 -0.380 -0.249 -0.172 0.246 0.930 still life -0.024 0.024 0.454 -0.525 0.062 0.428 -0.305 -0.106 -0.318 -0.457 0.038 -0.654 -0.291 -0.171 0.137 1.000 -0.216 -0.441 -0.183 0.498 -0.694 -0.069 -0.665 -0.635 0.745 0.681 -0.677 0.581 -0.473 0.766 0.159 animals -0.066 0.068 -0.452 -0.035 0.307 0.295 -0.292 0.255 0.692 0.131 -0.046 -0.169 0.352 0.460 0.340 -0.216 1.000 0.170 -0.047 0.302 0.024 0.607 0.134 -0.113 -0.090 0.004 -0.160 -0.060 0.133 -0.050 0.302 architecture -0.153 0.156 -0.314 0.879 -0.258 -0.321 -0.041 0.001 0.353 0.910 -0.169 0.616 0.228 0.225 -0.249 -0.441 0.170 1.000 0.330 -0.487 0.711 0.074 0.806 0.523 -0.608 -0.402 0.433 -0.547 0.901 -0.738 -0.296 black and white -0.011 0.013 -0.078 0.234 0.426 0.042 -0.122 0.220 -0.144 0.452 0.623 0.009 -0.510 0.594 0.455 -0.183 -0.047 0.330 1.000 -0.491 0.210 -0.228 0.147 -0.039 0.056 -0.055 0.187 -0.680 0.385 -0.387 0.475 macro 0.035 -0.036 0.448 -0.569 -0.033 0.168 -0.133 0.010 0.056 -0.536 -0.305 -0.455 0.366 -0.255 -0.102 0.498 0.302 -0.487 -0.491 1.000 -0.565 0.261 -0.490 -0.314 0.354 0.322 -0.520 0.820 -0.512 0.647 -0.033 travel -0.035 0.035 -0.368 0.864 -0.187 -0.360 0.427 0.156 0.456 0.736 -0.058 0.846 0.316 0.117 -0.313 -0.694 0.024 0.711 0.210 -0.565 1.000 0.225 0.873 0.840 -0.666 -0.388 0.851 -0.494 0.849 -0.670 -0.264 action 0.230 -0.230 -0.162 0.046 0.356 0.398 0.052 0.536 0.816 0.137 0.019 -0.031 0.248 0.352 0.191 -0.069 0.607 0.074 -0.228 0.261 0.225 1.000 0.146 0.184 -0.048 0.195 0.064 0.119 0.179 0.168 0.261 rural -0.006 0.007 -0.324 0.800 -0.198 -0.324 0.060 0.206 0.496 0.823 -0.213 0.828 0.497 0.178 -0.372 -0.665 0.134 0.806 0.147 -0.490 0.873 0.146 1.000 0.758 -0.794 -0.490 0.593 -0.418 0.828 -0.668 -0.310 water 0.099 -0.100 -0.026 0.686 -0.381 -0.570 0.564 0.091 0.300 0.506 -0.306 0.903 0.492 -0.206 -0.603 -0.635 -0.113 0.523 -0.039 -0.314 0.840 0.184 0.758 1.000 -0.745 -0.544 0.825 -0.255 0.601 -0.574 -0.523 studio -0.209 0.208 0.159 -0.629 0.346 0.586 -0.155 -0.109 -0.294 -0.591 0.432 -0.797 -0.526 0.048 0.568 0.745 -0.090 -0.608 0.056 0.354 -0.666 -0.048 -0.794 -0.745 1.000 0.843 -0.540 0.366 -0.507 0.741 0.550 advertisement -0.228 0.227 -0.001 -0.416 0.338 0.726 -0.215 0.061 0.067 -0.352 0.302 -0.618 -0.269 0.137 0.451 0.681 0.004 -0.402 -0.055 0.322 -0.388 0.195 -0.490 -0.544 0.843 1.000 -0.429 0.452 -0.223 0.774 0.525 seascapes -0.068 0.067 -0.285 0.711 -0.326 -0.558 0.774 -0.097 0.149 0.399 -0.067 0.823 0.174 -0.168 -0.380 -0.677 -0.160 0.433 0.187 -0.520 0.851 0.064 0.593 0.825 -0.540 -0.429 1.000 -0.489 0.596 -0.640 -0.360 floral 0.001 -0.003 0.483 -0.529 -0.145 0.208 -0.109 -0.020 -0.028 -0.554 -0.339 -0.352 0.371 -0.414 -0.249 0.581 -0.060 -0.547 -0.680 0.820 -0.494 0.119 -0.418 -0.255 0.366 0.452 -0.489 1.000 -0.512 0.779 -0.142 transportation -0.193 0.195 -0.376 0.898 -0.148 -0.217 0.114 0.102 0.450 0.893 -0.015 0.650 0.212 0.286 -0.172 -0.473 0.133 0.901 0.385 -0.512 0.849 0.179 0.828 0.601 -0.507 -0.223 0.596 -0.512 1.000 -0.645 -0.159 food and drink 0.066 -0.067 0.348 -0.761 0.303 0.643 -0.257 0.158 -0.059 -0.664 0.085 -0.684 -0.092 -0.078 0.246 0.766 -0.050 -0.738 -0.387 0.647 -0.670 0.168 -0.668 -0.574 0.741 0.774 -0.640 0.779 -0.645 1.000 0.368 children 0.096 -0.096 -0.156 -0.430 0.934 0.798 -0.410 0.567 0.209 -0.077 0.785 -0.611 -0.520 0.796 0.930 0.159 0.302 -0.296 0.475 -0.033 -0.264 0.261 -0.310 -0.523 0.550 0.525 -0.360 -0.142 -0.159 0.368 1.000

Fig. 8: Correlation in any two subtasks of aesthetic quality classification and semantic recognition learned by MTCNN #1 with δ = 0.

5

Conclusion

In this paper, we have employed the semantic information to help discover representations for aesthetic quality assessment by formulating a multi-task deep learning framework. Aesthetic quality assessment has not been taken as an isolation problem. To make full use of the semantic information and investigate how semantic information influence aesthetic task, four MTCNNs have been developed to learn the aesthetic representation jointly with the supervision of aesthetic and semantic labels. At the same time, a strategy of keeping the effect of two tasks balanced is presented to optimize the parameters of our MTCNNs. In addition, the correlations in the two tasks have been analyzed to investigate the role of semantic recognition in aesthetic quality assessment. Experimental results have shown that our MTCNNs perform better than the state-of-theart methods. It is demonstrated that the semantic information is beneficial to aesthetic feature learning and the high-level features in the network play an important role in aesthetic quality assessment. In the future, we will explore other useful factors to help improve the aesthetic quality assessment.

References 1. Abdulnabi, A.H., Wang, G., Lu, J., Jia, K.: Multi-task cnn model for attribute prediction. IEEE TMM 17(11), 1949–1959 (2015)

2. Caruana, R.: Multitask learning. Machine learning 28(1), 41–75 (1997) 3. Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV. vol. 1, pp. 1–2. Prague (2004) 4. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Studying aesthetics in photographic images using a computational approach. In: ECCV. pp. 288–301 (2006) 5. Datta, R., Li, J., Wang, J.Z.: Learning the consensus on visual quality for next-generation image management. In: ACM MM. pp. 533–536 (2007) 6. Datta, R., Li, J., Wang, J.Z.: Algorithmic inferencing of aesthetics and emotion in natural images: An exposition. In: ICIP. pp. 105–108 (2008) 7. Dhar, S., Ordonez, V., Berg, T.L.: High level describable attributes for predicting aesthetics and interestingness. In: CVPR. pp. 1657–1664 (2011) 8. Jaakkola, T.S., Haussler, D., et al.: Exploiting generative models in discriminative classifiers. NIPS pp. 487–493 (1999) 9. Joshi, D., Datta, R., Fedorovskaya, E., Luong, Q.T., Wang, J.Z., Li, J., Luo, J.: Aesthetics and emotions in images. IEEE Signal Processing Magazine 28(5), 94–115 (2011) 10. Kao, Y., Wang, C., Huang, K.: Visual aesthetic quality assessment with a regression model. In: ICIP. pp. 1583 – 1587 (2015) 11. Ke, Y., Tang, X., Jing, F.: The design of high-level features for photo quality assessment. In: CVPR. pp. 419–426 (2006) 12. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS. pp. 1097–1105 (2012) 13. Liu, W., Mei, T., Zhang, Y., Che, C., Luo, J.: Multi-task deep visual-semantic embedding for video thumbnail selection. In: CVPR. pp. 3707–3715 (2015) 14. Locher, P.J.: The aesthetic experience with visual art at first glance. In: Investigations Into the Phenomenology and the Ontology of the Work of Art, pp. 75–88. Springer (2015) 15. Lu, X., Lin, Z., Jin, H., Yang, J., Wang, J.Z.: Rapid: Rating pictorial aesthetics using deep learning. In: ACM MM. pp. 457–466 (2014) 16. Luo, W., Wang, X., Tang, X.: Content-based photo quality assessment. In: ICCV. pp. 2206– 2213 (2011) 17. Luo, Y., Tang, X.: Photo and video quality evaluation: Focusing on the subject. In: ECCV. pp. 386–399 (2008) 18. Marchesotti, L., Murray, N., Perronnin, F.: Discovering beautiful attributes for aesthetic image analysis. IJCV 113(3), 246–266 (2015) 19. Marchesotti, L., Perronnin, F., Larlus, D., Csurka, G.: Assessing the aesthetic quality of photographs using generic image descriptors. In: ICCV. pp. 1784–1791 (2011) 20. Marchesotti, L., Perronnin, F., Meylan, F.: Learning beautiful (and ugly) attributes. In: BMVC. vol. 7, pp. 1–11 (2013) 21. Mullin, C., Hayn-Leichsenring, G., Wagemans, J.: There is beauty in gist: An investigation of aesthetic perception in rapidly presented scenes. Journal of vision 15(12), 123–123 (2015) 22. Murray, N., Marchesotti, L., Perronnin, F.: Ava: A large-scale database for aesthetic visual analysis. In: CVPR. pp. 2408–2415 (2012) 23. Nishiyama, M., Okabe, T., Sato, I., Sato, Y.: Aesthetic quality classification of photographs based on color harmony. In: CVPR. pp. 33–40 (2011) 24. Niu, Y., Liu, F.: What makes a professional video? a computational aesthetics approach. IEEE Transactions on Circuits and Systems for Video Technology 22(7), 1037–1049 (2012) 25. Tang, X., Luo, W., Wang, X.: Content-based photo quality assessment. IEEE TMM 15(8), 1930–1943 (2013) 26. Wang, Y., Dai, Q., Feng, R., Jiang, Y.G.: Beauty is here: Evaluating aesthetics in videos using multimodal features and free training data. In: ACM MM. pp. 369–372 (2013)

27. Wu, O., Hu, W., Gao, J.: Learning to predict the perceived visual quality of photos. In: ICCV. pp. 225–232 (2011) 28. Yeh, H.H., Yang, C.Y., Lee, M.S., Chen, C.S.: Video aesthetic quality assessment by temporal integration of photo-and motion-based features. IEEE TMM 15(8), 1944–1957 (2013) 29. Yim, J., Jung, H., Yoo, B., Choi, C., Park, D., Kim, J.: Rotating your face using multi-task deep neural network. In: CVPR. pp. 676 – 684 (2015) 30. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: ECCV. pp. 818–833 (2014) 31. Zhang, L., Gao, Y., Zimmermann, R., Tian, Q., Li, X.: Fusion of multichannel local and global structural cues for photo aesthetics evaluation. IEEE TIP 23(3), 1419–1429 (2014) 32. Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Facial landmark detection by deep multi-task learning. In: ECCV, pp. 94–108 (2014)