A one-to-many conditional generative adversarial

0 downloads 0 Views 5MB Size Report
research in GAN and image learning for multiple translations. .... these limitations, unpaired image learning methods are proposed, i.e. cycleGAN ...... BehTex7K datasets are comparable to One-to-One translation model (e.g. pix2pix with.

Multimed Tools Appl https://doi.org/10.1007/s11042-018-5968-7

A one-to-many conditional generative adversarial network framework for multiple image-to-image translations Chunlei Chai 1 & Jing Liao 1 & Ning Zou 1 & Lingyun Sun 1

Received: 27 September 2017 / Revised: 16 March 2018 / Accepted: 3 April 2018 # Springer Science+Business Media, LLC, part of Springer Nature 2018

Abstract Image-to-Image translation was proposed as a general form of many image learning problems. While generative adversarial networks were successfully applied on many image-to-image translations, many models were limited to specific translation tasks and were difficult to satisfy practical needs. In this work, we introduce a One-to-Many conditional generative adversarial network, which could learn from heterogeneous sources of images. This is achieved by training multiple generators against a discriminator in synthesized learning way. This framework supports generative models to generate images in each source, so output images follow corresponding target patterns. Two implementations, hybrid fake and cascading learning, of the synthesized adversarial training scheme are also proposed, and experimented on two benchmark datasets, UTZap50K and MVOD5K, as well as a new high-quality dataset BehTex7K. We consider five challenging image-to-image translation tasks: edges-to-photo, edges-to-similar-photo translation on UTZap50K, cross-view translation on MVOD5K, and grey-to-color, grey-to-Oil-Paint on BehTex7K. We show that both implementations are able to faithfully translate from an image to another image in edges-to-photo, edges-to-similar-photo, grey-to-color, and grey-to-Oil-Paint translation tasks. The quality of output images in cross-view translation need to be further boosted. Keywords Image-to-image translation . Generative adversarial network . One-to-many conditional generative adversarial network . Deep learning

* Ning Zou [email protected] Chunlei Chai [email protected] Jing Liao [email protected] Lingyun Sun [email protected]


Laboratory of CAD&CG, Zhejiang University, Hangzhou, China

Multimed Tools Appl

1 Introduction Many image learning problems in image processing, computer graphics, and computer vision can be viewed as image-to-image translation [9, 37]. The image-to-image translation is defined as the problem of translating one possible representation of a scene into another, given sufficient training data [9]. Traditionally, the input images are preprocessed with feature extraction techniques at initial stage of translation. Deep Neutral Networks (DNN) eliminate the need of feature extraction procedure, supporting an end-to-end image learning where the model could learn from raw inputs and then output the desired images. Generative Adversarial Networks (GAN) is a generative framework, where adversarial training between a generative DNN (called Generator, G) for capturing data distribution, and simultaneously a discriminative DNN (called Discriminator, D) for evaluating whether an instance comes from the real distribution rather G [8]. However, several challenges are raised. First, GAN could not take advantage of conditioned data in observations (section 3.1). Also, previous studies explored multi-GAN model for multiple data distributions learning, but these model has not combined with conditional models. Second, most GAN frameworks are built for capturing uni-mode of distribution of images. But many interesting problems are naturally one-to-many mapping [7, 18]. We use conditional GAN(cGAN) to solve the first challenge. The cGAN is useful for learning conditional probability distributions, and shows strength in edges-to-photo (edges-to-shoes [9], edges-to-handbags [9]), weather transfer (de-raining [34], de-haze [1]), image editing [20], style transfer [9, 15] and multimodal representation learning [23]. However, cGAN has not been experimented using complicated tasks, e.g. grey-to-ArtTexture, trans-view translations. For the second challenge, multiple GANs could be used [7, 13], but computational and spatial costs is high. Or multiple images could be concatenated into a large representation as input [3], but complicated pre-and post-fusion techniques may be required. In additions, learning from the hybrid representation poses more challenges. To solve these problems, we propose a One-to-Many framework to capture multiple distributions of images. In the One-to-Many framework, multiple generators are trained individually on different distributions, to fool a discriminator by enforcing a synthesized adversarial training mechanism. We provide two ways of implementing the synthesized adversarial training, using a hybrid fake instance or cascading learning each generators. The generator and discriminator learns conditioned distribution of data through adversarial training. We evaluate the One-to-Many cGAN on three image datasets: UTZap50K, MVOD5K, and BehTex7K. The results show that our framework is applicable to large-scale multiple image-to-image translations. Our work is also related to multi-domain image distribution learning, cross-domain learning, when appropriate network is adopted as meta-model in framework. The rest of this paper is organized as follows: Section 2 provides a review of related research in GAN and image learning for multiple translations. Section 3 provides essential knowledge about GAN and its extensions to help understand effectiveness of the One-to-Many cGAN model, which will be introduced in Section 4. Section 5 presents the experimental datasets. Section 6 describes the experiment settings, and Section 7 presents the results of experiments. In section 8, we discuss some findings in details. Finally, conclusion and future work are given in section 9 and section 10.

Multimed Tools Appl

2 Related work 2.1 Image-to-image translation with deep neutral networks Many problems in the graphics and vision are defined as image-to-image translations [9, 37]. Thoroughly, image-to-image translations fall into six clusters: colorization, photo-to-edges and edges-to-photo, photo-to-segments and labels-to-photo, scene-to-scene, style transfer, and other untypical translations. Since the image-to-image translation is a general process, this classification will be hardly to cover every example of image-to-image translations. But this classification explores pattern changes at pixel-level during translation, and provides a preliminary classification by dividing those with similar pattern into a class (Table 1). (1) Colorization

The term colorization is generically used to describe a computer-aided process of adding color to grayscale image or videos [10]. Differ from style transfer, the colorization does not need to tradeoff between original contents and style, so less restrictions need to been satisfied during data distribution learning. (2) Scene-to-scene In the domain of image processing, computer vision and computer graphics, a scene is described as a representation of image, which may be rendered as an RGB image, a gradient field, an edge map, a semantic label map, etc. [9]. Translating from one viewpoint to another is also a scene-to-scene translation (e.g. from ground-level to in-air). The scene-to-scene translation is related to geographic remote sensing, e.g. Day-to-Night [9], Ariel-to-Map [9]. (3) Style transfer Transferring the style from one image onto another is a long-existing problem, of which goal is to render a reference style in source image onto another, while preserving semantics and structure of a target image [6, 17]. Gatys et al. discovered deep convolutional neural networks can create deep image representations that can explicitly represent semantic information, and discriminates image content from style [6]. Yet, local distortions may yield in transferred

Table 1 Examples of six categories of image-to-image translations Category


(1) colorization (2) scene-to-scene (3) style transfer

adding color on greyscale image map to aerial photo and aerial to map, multi-view day to night, rendering artistic style, weather degraded image to normal scene, i.e. de-raining, defogging edges to handbag, edges to shoe, edges to cats, ‘auto-painter’ semantic labels to photo, photo to semantic labels, architectural labels to photo future state prediction, cross-domain translation

(4) Photo-to-edges and edges-to-photo (5) Photo-to-labels and labels-to-photo (6) others

Multimed Tools Appl

results. Luan et al. constrained the transformation to be locally affine in color-space [17], successfully suppressed distortion and could accurately transferring styles in a broad variety of scenarios, including day time [9], weather [1, 26, 34], season and artistic edits [17]. (4) Photo-to-edges and edges-to-photo (object generation) The object generation may be the most challenging task (demands a huge amount of training data and yields many worst failures), also the most aspiring one: network trained on contour images yielded decent generalized results on human sketches among image-to-image translations [9]. This is first achieved by training a cGAN on large-scale aligned ‘edges-groundTruth’ image pairs, including commercial products such as shoes and handbags [9, 11, 26], animals such as cats [9], and artistic works [15, 17]. An interesting finding is that the above methods have good generalizability on human sketch [9], even if the original shape is greatly different from the sketch. (5) Photo-to-labels and labels-to-photo Photo-to-labels or Photo-to-segments can be viewed as a semantic segmentation task. The semantic segmentation is a process of classifying each pixel with semantic labels. For instance, ‘color all cars blue and all pedestrians red’ is a classic semantic segmentation in Automatic Driving system. Recovering labeled photo to original contents is close to content generation in (4), which maps groups of labeled pixels onto realistic distributions. For instance, generating realistic images from RGBD representations (e.g. blue: X; green: Y; red: Z) [24]. (6) Others In some image-to-image translations, including future state prediction (mentioned in [9]), cross-domain translation [11], inputs and ground truths do not differ in description of subject (e.g. color of shoes), but may differ in the subject to describe (e.g. input is handbag while output is shoe). Besides, some image-to-image translation is not aimed at producing fine-imitation of target images, but rather creating images that can be perceived as plausible representatives of target images, e.g. Creative Adversarial Network exhibited kind of ‘human creativity’ in output artistic images [4]. The perceptual study result showed that human subjects could not distinguish these artificial arts from art created by contemporary artists. Before the image-to-image translation is defined as a general problem, generative adversarial nets(GANs) have been widely studied in colorization [10, 32] and style transfer [6]. Then, conditional GAN is developed as a general solution to various image-to-image translation tasks, including labels-to-street-scene, labels-to-facade, BW-to-color, aerial-to-map, day-to-night and edges-to-photo [9]. However, GAN and cGAN requires training images in pair format and the translation is one-directional. Take style transfer as example (A denotes a raw style, B denotes a target style), given sufficient A-B pairs of images, a well-trained model can only transfer from A to B. The B to A translation needs additional model training. To overcome these limitations, unpaired image learning methods are proposed, i.e. cycleGAN [37], DualGAN [29] used two GANs, each GAN learning mappings from one set of images to another (i.e. edges to photos) respectively. But basically, the model is limited to the one-to-one mapping nature. The One-to-Many cGAN extends translations on multiple distributions. Besides, our model is facile to combine variants of GAN by replacing cGAN.

Multimed Tools Appl

2.2 Conditioned and multiple generative adversarial network A limitation of GAN is on conditioned data learning, where the condition can be any kind of auxiliary information describing the data [12]. In the image-to-image translations, the conditioned variable is defined as the observed images [9]. Existing methods for conditional distribution learning are by modifying inputs of generators and discriminators in GAN, called conditional GAN or conditioned GAN [9, 12, 18]. Another limitation of GANs is on multi-dimension or a joint distribution learning, where multiple targets have to been satisfied simultaneously in many multiview and multimodal learning tasks [16]. Several improvements were proposed recently. Among them, the use of multiple learning models trained on individual distribution is a popular approach. This is based on an observation that the combination of several networks trained starting from different initializations improves results significantly [25]. An example is the coupled GANs (CoGAN) [13], aimed at learning a joint distribution of images in two domains, by independently training unconditional GAN1 and GAN2 on diverse domains (Fig. 1). However, different data distributions normally are highly related to each other in real scenarios, e.g. the left view and the right view of an object is highly similar. So G1 and G2 is possible able to be evaluated by a single but powerful discriminator, which supervise both G1 and G2 performance. Besides, the reduction of discriminators also reduces network capacity. The Multi-Agent Diverse GAN (MAD-GAN) [7] explicitly uses multiple generators to capture each mode individually and a discriminator for adversarial training. The aim of the MAD-GAN is to avoid the tricky ‘mode collapse’ problem [7], in generating instances while capturing diverse modes of the true data distribution. A similarity based competing mechanism is developed to enforce diverse modes of data can be generated separately (Fig. 2). If the instances from a generator is not sufficient different with others, a constraint that ensures the discriminator score for each generator should be higher than all other generators with a margin proportional to the similarity score, will be activated. Both the CoGAN and the MAD-GAN extend GAN to multi-distribution learning, but they could not learn auxiliary or latent information in the observed data. Besides, the CoGAN and MAD-GAN sacrifice instance diversity for true data mode capture. Yang et al. [28] and Liu et al. [14] described a tradeoff between similarity and diversity of instances from a distribution, by sharing information among multiple learning tasks from the perspective of multimedia analysis. Chen et al. introduced the multi-view BiGANs (MV-BiGAN), extending GANs framework over multi-view data learning [3]. This is

Fig. 1 General architecture of CoGAN, MAD-GAN, MV-BiGAN, and One-to-Many-cGAN

Multimed Tools Appl

Fig. 2 Mechanism of CoGAN, MAD-GAN, MV-BiGAN and One-to-Many-cGAN learning diverse distribution of data. The Similarity Based Competing Objective(SBCO) enforces MAD-GAN generating dissimilar instances where dissimilarity is defined by a task-specific function

achieved by aggregating different views of inputs into an aggregation space, and yield a multi-view model being able to deal with any subset of views. It can deal with missing views and is able to update its prediction when additional views are provided. Also, the MV-BiGAN can learn from conditioned distribution, by introducing an additional encoder and training two encoders that mapping the conditioned probability distribution.

3 Preliminaries In order to present theoretical analysis of how One-to-Many cGAN Framework works, we provide a brief review of GAN and its extensions. We explain difference of each model by analyzing optimization goals and corresponding objective functions. The architecture of these models is shown in Fig. 3 and training mechanism is show in Fig. 4. To make consistent comparison between different models, we modify notations in original studies as following: The Pdata represents the true data distribution, Pz the distribution of noise, Pg the distribution learned by generator G. The G(x) follows the distribution of x learned by G,

Fig. 3 One-to-many cGAN architecture

Multimed Tools Appl

Fig. 4 Training framework of hybrid fake (left) and cascading learning (right)

and D(x) defines the probability that x comes from Pdata rather than Pg (The samples produced by G implicitly form Pg [8], in other words, the distribution of G(x) is Pg).

3.1 Conditional generative adversarial networks According to [8], the goal of GAN is: min max LGAN ðG; DÞ ¼ Ex∼pdata ðxÞ ½logDðxÞ þ Ez∼pz ðzÞ ½logð1−DðGðzÞÞÞ G



Begin with a random noise z, GANs can produce fine estimation of true data distribution, i.e. G(z)=x. If we consider the condition that the prior knowledge of the data distribution is known: apparently, performance cannot be promoted in GAN defined in (1), because the optimization objective is independent of the condition variable x. In a conditional GAN (cGAN), the objective of generator G is to fit a conditional probability distribution [18]. The objective of cGAN is similar to (1), except that cGAN also observe condition [18]: LcGAN ðG; DÞ ¼ Ex;y∼Pdata ðx;yÞ ½logDðx; yÞ þ Ex∼Pdata ðxÞ;z∼pz ðzÞ ½logð1−Dðx; Gðx; zÞÞÞ


where the x denotes the conditional variant, the y is equivalent to x in (1).

3.2 CoGAN, MAD-GAN, and MV-BiGAN CoGAN As described in 2.2, the CoGAN consists of two GANs that are trained independently on two marginal distributions pdata(x1), pdata(x2). The objective of CoGAN is defined in [13], and can be reformulated as: min max Ex1 ∼Pdata ðx1 Þ ½logD1 ðx1 Þ þ Ez∼pz ðzÞ ½logð1−D1 ðG1 ðzÞÞÞþ E

G1; G2 D1; D2

þ Ez∼pz ðzÞ ½logð1−D2 ðG2 ðzÞÞÞ

x2 ∼Pdata ðx2 Þ ½logD2 ðx2 Þ


This goal forces the generated images individually resembling images in the corresponding domains. The CoGAN can be easily generalized to multiple image domains, by increasing the number of GAN in CoGAN.

MAD-GAN The MAD-GAN is also derived from GAN and is able to learn multiple data distributions. Unlike CoGAN of using multiple generators and multiple discriminators as

Multimed Tools Appl

counterparts, the MAD-GAN consists of only one discriminator against multiple generators. By substitute Pgi by Gi(x), the objective of MAD-GAN in [7] can be represented as following: k


G1 ;G2 ;…;Gk

max Ex∼Pdata ðxÞ ½logDkþ1 ðxÞ þ ∑ Ex∼Pdata ðxÞ;z∼pz ðzÞ ½logð1−Dkþ1 ðGi ðx; zÞÞÞ ð4Þ D


This term is derived from standard GAN’s objective function, by assuming Pgkþ1 ≔Pdata to avoid cluster [7].

MV-BiGAN The Multi-View BiGANs (MV-BiGAN) is derived from the Bidirectional Generative Adversarial Networks(BiGANs), which contains an encoder E capture a bidirectional mapping between the input space and a latent representation space, a generator G mapping any point in latent space to a possible object in output space, and a discriminator D. The BiGAN breaks the limitation that GAN could not retrieve latent representation so as to exploit the learned manifold. The objective of the BiGAN framework can be represented as: min max Ex∼Pdata ðxÞ;z∼pE ðzjxÞ ½logDðx; zÞ þ Ez∼pðzÞ;x∼pG ðxjzÞ ½1−logDðx; zÞ G;E



where PE denote the distribution learned by encoder, PG denote the distribution learned by generator, i.e. Pg. This goal enforces optimization between PE(x, z) and PG(x, z), so that the BiGAN is able to model the joint distribution of (x, z) pairs. However, the BiGAN could not applied to conditional distribution, thus being difficult to learn a joint distribution. To solve this problem, the MV-BiGAN adds an additional encoder denoted H for capturing conditional distribution, and an additional discriminator D2 for adversarial training with E and H to a BiGAN(including G, E and D1). Merging objective functions of two adversarial problems (G, E against D1, as well as E, H against D2) and discard complex KL regularization variables, the objective of the MV-BiGAN can be simplified as: min max Ex;y∼pdata ðx;yÞ;z∼pE ðzjxÞ ½logD1 ðx; zÞ þ Ez∼pðzÞ;x∼pG ðxjzÞ ½1−logD1 ðx; zÞ

G;E;H D1; D2

þ Ex;y∼pdata ðx;yÞ;z∼pE ðzjxÞ ½logD2 ðy; zÞ þ Ex;y∼pdata ðx;yÞ;z∼pH ðzjyÞ ½1−logD2 ðy; zÞ


(we are interested in how BiGAN is aggregated, so the regularization techniques for uncertainty reduction will not be discussed. The complete objective function is available in [3]) Since our main objective is not learning latent representations in images, the MV-BiGAN is not suitable. The cGAN that is able to learn conditioned data [9, 11, 15, 26, 34] and only contains a G and a D raise interests. Therefore, we build a multi-distribution learning framework of cGAN.

4 One-to-many cGAN framework To construct a model for learning multiple image-to-image translations, two branches of work need be accomplished: (1) a versatile model for image-to-image translation, and (2) an integration model for multiple generations.

Multimed Tools Appl

(1) is primarily based on cGAN in [9] as meta-model for mapping pixels of one image to another, for the reason that it had prominent performance in various image-to-image translations [9, 11, 26]. Additionally, cGAN achieved the best precision on multimodal representation [23]. (2) is inspired by the idea of maintaining a set of generators in GAN for learning diverse modes of data [7] and Multi-view discriminative and structured dictionary learning with group sparsity [5]. We also refer to collaborative learning in [33]. Even extremely sparse and inter-related image-label matrix can be efficient learned by multiple classifiers collaboratively, by leveraging on the structure of original collaborative learning formulation. We introduce a group of generators against a single discriminator as structure of multiple GANs. Each generator is adversarial trained against the discriminator, forming as a cGAN. So multiple estimations of data distribution can be produced. All estimations are evaluated and synthesized according to the loss respectively. Accordingly, the full objective is re-derived in (9) and reformulated in (10).

4.1 General architecture Generally, the x is denoted as input image, y is denoted as output image, ^y is denoted as the ground truth image, and z is denoted as a random noise vector [9, 18]. In image-to-image translations, y is usually denoted as fake instances (fake), while ^y is referred as real instances (real). The generator learns a mapping from observed image x and random noise vector z, to ^y: G: fx; zg→^y [9, 18]. The discriminator strives to discriminate y and ^y, labeling all y as fake and all ^y as real (marked as 0 and 1 separately).

4.2 Multiple representations generation There are two approaches to learn multiple modes of data. For the one, we can aggregate all modes of data into a complex, and then train a traditional GAN, i.e. MV-BiGAN. For another, we can use multiple networks as a group to learn diverse representations, and then train a network for each mode, as CoGAN, MAD-GAN. It may seem like ensemble learning, which observing from single representation and vote to generate a fine estimation. However, CoGAN and MAD-GAN used multiple networks to learn multiple modes; each network generates an estimation for each mode. We adopt the second approach for it eliminates complicated aggregation techniques. Besides, it is more generalizable in that it can be combined with other existing variants of GANs to produce diverse instances [7]. To further reduce complexity, only one discriminator against multiple generators is kept in our framework. We introduce a synthesize adversarial training mechanism to accommodate the new architecture.

4.3 General objectives In One-to-Many cGANs, the objective of each generator is to capturing a true distribution of data, so that the fakes and the reals is hardly discriminable: The discriminator makes judgement on a real and corresponding fakes from all generators. The objective of discriminator is to discriminate all fakes from the very similar real. Denoting X as a set of input images

Multimed Tools Appl

{x1, x2, …}, Ym as the m-th set of output images {y1, y2, …}, Y^ m as the m-th set of target images f^y1 ; ^y2 ; …g, Gm the m-th generator, each Gm in group learns a mapping from observed i i images Xi ⊆ X to targets Y^ ⊆Y^ m and produces fakes Yi ⊆ Ym, where ⋃Xi = X, ⋃Y i ¼ Y m ; ⋃Y^ ¼ Y^ m for ∀i, m ∈ {1, 2, …, M}. We will give a detailed explanation of how to reach an equilibrium of the adversarial objectives in the Appendix. The goal of the training of One-to-Many cGANs is derived from the traditional cGAN [18], with considerations on reducing blurs in images in [9] and using a group of generators to capture multiple distributions in [7]. [9] explored benefits of encouraging less blurring by mixing conditional GAN’s objective with traditional loss L1, and constructed a standard objective for image-to-image translations: Gpix2pix ¼ min max LcGAN ðG; DÞ þ λLL1 ðGÞ G



where LcGAN is the loss of adversarial network {G, D} defined in (2); the λ controls the degree of L1 term forcing low-frequency correctness [9]. The LL1 penalized on distance between the fake and the real image [9]: LL1 ðGÞ ¼ Ex;y∼Pdata ðx;yÞ;z∼pz ðzÞ ½ky−Gðx; zÞk


Optimizing over multiple generators can be understand as achieving a union of goals of each pair of {Gm, D} in (8): M

∪ Gpix2pix



The union goal can be mathematically represented as the sum of each component regarding {Gm, D} [7]. Thus, the objective of the One-to-Many cGANs achieving optimization between a group of generators {G1, G2, …, GM} and a discriminator can be represented as: GOne2Many−cGAN ¼



G1 ;G2 ;…;GM

max ∑ LcGAN ðGm ; DÞ þ λLL1 ðGm Þ D



4.4 Synthesize adversarial training We organize generators in ensemble-like fashion, each generator is assigned equal weights for reaching optimal solution of the network [7]. The ensembles of GANs can obtain better probability distributions model of true data distribution over a single GAN: it is more accurate, robust and flexible [25]. The well-distributed generators ensure the optimal generator learns true data distribution in a mixture of multiple distributions, where each distribution weights evenly [7]. Besides, a single discriminator against an ensemble-like of generators is able to greatly reduce computational cost than using multiple discriminators against many generators individually. To clarify how synthesized adversarial training works, we define a general example of the One-to-Many GAN, which is consists of a discriminator D and a group of generators {Gm, m = 1, 2, …, M}. The objective of Gm is to minimize LcGAN of {Gm, D}, while the objective of D is to maximize the mean of all LcGAN . Optimizing this framework is difficult: (1) how Gm is evaluated and adjusted so that each one can provide fine estimation PGm of corresponding true data distribution Pm: Gm ðxm Þ ¼ ym ≔^ym ; and (2) how D evaluates and learns from M fakes generated by each Gm: Dðym Þ≔PGm ; Dð^ym Þ≔Pm .

Multimed Tools Appl

For the first problem, we aggregate all fakes into a hybrid fake, which could be later treated as a normal fake for training. Or the D evaluate each fake, yielding M losses, which could be aggregated into the final loss (e.g. weighted sum). For the second problem, correspondingly we update D based on the hybrid fake, or update successively from M fakes. The general scheme of synthesized adversarial training is shown in Algorithm 1.

4.4.1 Hybrid fake The characteristics of image-to-image translations are that the mapping pixels is along color-space or gray space, while position of specific pixel domain does not change. In other words, the object in the input image may change its color, but will not translate or rotate, during translation. Even in the edge-to-image translation, where contents of object are filled up during translation, the contour of object remain unchanged yet. Thus, we synthesize fakes along color space exclusively, so that the spatial relationship of pixels is reserved. A weighted average color in representative of diverse color at each pixel level is used to construct the hybrid fake. The hybrid fake can be constructed as following: I¼

ωm *Im



where ωm is weighting factors of fake instance Im contributing to the hybrid fake. The ωm is set to M1 , because Im is produced from each distribution learned by Gm. According to [7], the objective function for training generators obtains global optimal if every distribution component is assigned an equal weight of 1k , where k denotes the number of generators. The training of One-to-Many cGANs with hybrid fake is similar to a traditional GAN. We first update D towards minimizing loss of GANs and L1 terms (8), with use of the hybrid fake. Then we update each generator towards opposite direction. The hybrid fake is only used in the learning of D because we expect generator is trained independently on each distribution of data. Therefore, individual fake rather than the hybrid fake constructed by fakes from heterogeneous distributions is used to update each generator. Notice that the current output images used for evaluation of generators and the discriminator are not a same fake. As for any fixed generator, the output in LcGAN is individual fake G(xm). For clarity, the general objective in (10) is expressed separately. The objective function of a group of generators is to minimize: min

G1; G2 ;…;GM

M      Ex;y∼pdata ðx;yÞ logDM þ1 ðx; yÞ þ ∑ Ex∼pGm ðxÞ log 1−DM þ1 ðx; yÞ m¼1

þ λEx;y∼pGm ðx;yÞ ½ky−Gm ðxÞk


As for the objective of discriminator, the hybrid fake I is used to maximize: M   max Ey∼pdata ðyÞ logDM þ1 ðI; yÞ þ ∑ Ex;y∼pGm ðx;yÞ ½logDm ðx; yÞ D

þ λEx;y∼pGm ðx;yÞ ½ky−Gm ðxÞk



where Gm is the m-th generator in group, PGm the distribution learned by the m-th generator, Dm represents the m-th index of the distribution learned by discriminator D þ1 m (x; θd), having ∑M m¼1 D ¼ 1.

Multimed Tools Appl

4.4.2 Cascading learning In cascading learning, each Gm is trained against D as in hybrid fake. But the training of D is not through a synthesized instance, but by successively adjusted against {G1, G2, …, GM} during a training epoch. Without using a hybrid fake, the training of One-to-Many cGANs with cascading learning is similar to a CoGAN, where a tuple of GANs: {G1, D1}, {G2, D2}, … are trained on diverse image domains. But two distinctions between CoGAN and One-to-Many cGANs are: (1) we use cGAN instead of GAN to learn conditioned distribution; (2) we use a single discriminator rather than {D1, D2, …}. We first update D towards minimizing (8) using a generator Gm. Then we update Gm towards opposite direction. The m-th fake is used for optimization of nets Gm and D. We repeat this process on D and the next generator Gm + 1, until every pair of {Gm, D} is updated. Accordingly, eq. (10) is expanded using (2) and (8) as the goal of the training of One-to-Many cGANs with cascading learning: min

G1; G2 ;…;GM

  max Ex;y∼pdata ðx;yÞ logDM þ1 ðx; yÞ D

M     þ ∑ Ex∼pdata ðxÞ;z∼pz ðzÞ log 1−DM þ1 ðx; Gm ðx; zÞÞ þ λðky−Gm ðx; zÞkÞ



(The proofs of the Eqs. (12), (13), and (14) are provided in the appendix) The algorithm of hybrid fake training is shown in Algorithm 2 and Algorithm 3.

5 Datasets 5.1 UT-Zappos50K UT-Zappos50K (UTZap50K) [30] is a benchmark shoe dataset [7, 9, 26, 30], consisting of 50,025 catalog images collected from Zappos.com. This dataset is created in the context of an online shopping task, and is first used in edge-to-image translation task in [9]. The dataset has 4 major categories: shoes, sandals, slippers, and boots, each image followed by functional types and individual brands. GIST and LAB color features are also provided [30]. In edge-tophoto translation, only image data is required. Edges of images are extracted using Holistically-Nested Edge Detection [27] and edge simplification processing in [9], and these edges serve as inputs in experiments.

5.2 MVOD5K The MVOD5K is a benchmark dataset of multi-view objects [2]. Previous edge-to-photo translation focused on single view of object [9, 13, 15, 26]. Although multi-generation solutions [3, 7, 13] have been proposed, they were used to generate different styles of but identical view of images. Multiview learning is a significant but challenging multi-generation application. Thus, we explored edge-to-photo translation on the MVOD5K, a benchmark Multiview dataset [2]. The dataset comprises of 5000 images, divided into 45 different product categories (shoes, backpacks, eyeglasses, cameras, printers, guitars, pianos, coffee machines, vacuum cleaners, irons, etc.). There are 1827 different object instances in total (from 45

Multimed Tools Appl

categories) and each object has at least two different images taken from different views. We use shoe images in experiment because shoes category comprises the most instances (237 subjects with 702 images). Details of input-outputs list is shown in Table 2.

5.3 BehTex7K The BehTex7K is a new dataset, which is composed of 7064 high-quality images. Previous edge-to-image tasks are only tested on simple objects (shoes, handbags, etc.); complicated image, e.g. patterns, was under-explored. To collect up-to-date high-quality pattern design works, we crawled from Textile Design projects from the largest online design community, Behance.net, to gather designer’s work. 1080 projects’ URLs were crawled. 7025 images from 1079 projects were retrieved (the resource of the 629th project is unavailable), and then were divided into patterns (2418 images), objects (780 images), and scenes (3505 images) of textile design works. Images of brand information (i.e. logos) and pure text, as well as blank and repetitive images were excluded (322 images).

6 Experiments In the One-to-Many cGAN, M denotes the number of generators in network, which is greater or equals to 2. When M = 1, the One-to-Many cGAN is equivalent to a cGAN in [9]. Accordingly, M styles of target images need prepared in advance for training M generators. We set M = 2 in experiment. The target styles can be grey image, original photorealistic image, or artistic image processed with filter. To examine the performance of the One-to-Many cGANs, cGAN in [9] is used as benchmark. Besides, image quality assessments and perceptual studies are used as metrics. The architecture of generator is U-net [9], where Convolution-BatchNorm-ReLu layers is constructed as modules. In U-net, skip connections between each layer i and layer n − i concatenate all channels, so that low-level information can be shared to reduce costs and distortions. As all input images is resized to 256 × 256, the generator consists of eight layers. We use PatchGAN as discriminator because Patch-based GAN allow faithful conversion even if sample size is small [9]. The patch size is set to 70 × 70 because it forces outputs sharp in both the spatial and spectral (colorfulness) dimensions, and yields higher quality than 256 × 256 Patch [9]. The general training flow of One-to-Many cGAN is show in Fig. 3. Architecture of generator and discriminator are introduced in the Appendix. We follow the approach in [9] to optimize our generators and discriminator, alternately performing gradient descent on D and Gm. The SGD with mini-batch of images is used. We use Adam algorithm for training. The learning rate is 0.0002 and the momentum parameter for Table 2 Details of experimental datasets Datasets

Train/Test Split


Output 1/Output 2

UTZap50K MVOD5K BehTex7K

23,988 / 202 214 / 7 2342 / 76

edges of shoe right-side, left-front grayscale

shoe / similar shoe left-back, right-back, left-side, right-front / top, front color / Oil Paint*

*processed with the Photoshop, filter style is ‘Oil Paint’ configured with following parameters: Stylization: 2.3; Cleanliness: 2.3; Scale: 0.8; Bristle Detail; 10.0; Angle: −60; Shine: 1.2

Multimed Tools Appl

Adam is 0.5. All convolutions are 4 × 4 spatial filters applied with stride length 2. The weight on L1 term (λ) is set to 100. Images in experimental datasets are organized as lists. An input-outputs list composes of an input image, an output image with one style, and output image with another style. The settings of the split of data, and input-output lists for each dataset is shown in Table 2. Algorithm 1 Synthesize Adversarial Training input I = ({ ,


}, { ,


}, … { ,

}), epoch

for t = 1:epoch for each generator Gm, m=1:M generate estimation of


: y(m) = Gm -> createFake(I(m))

end for for m=1:M UpdateGenerators(Gm) end for UpdateDiscriminators(D) end for

6.1 Two implementations of the synthesize adversarial training In this section, we introduce implementations of the hybrid fake and the cascading learning for Synthesize Adversarial Training. The algorithms are shown below. Algorithm 2 Updating Generators loop for m=1:M if use_hybrid_fake Obtain SFI* using weighted sum of all

: fake = = Weighted(

, m=1:M;


Generate judge of the SFI*: decision(m) = D -> makeDecision(fake) else if use_cascade_learn Generate judge of individual fake instance: decision(m) = D -> makeDecision( Obtain error of the judge: error_fake(m) = Evaluate(decision(m), label_fake) Obtain error of G (e.g. L1 term is used): error_L1(m) = Evaluate(


Learning from judge: Gm->Backward(x, error_fake(m)+error_L1(m)) end loop for



Multimed Tools Appl

* Synthesized Fake Instance is denoted as SFI; Synthesized Real Instance is denoted as SRI. The error_fake refers to error of Discriminator judgement on the synthesized fake instance, rather than the fake instance ym generated by Gm. It is consistent with operations in updating Discriminator, where Discriminator makes decision on synthesized real and fake, rather than fake instances produced by individual Generator. Additionally, it can save computing cost since the decision can be reused in updating both Generators and Discriminators. Algorithm 3 Updating Discriminators loop for m=1:M Obtain real instance: real(m) = Weighted (

, m=1:M;

) if use_hybrid_fake else if use_cascade_learn =

Generate judge of real instance: decision_real(m) = MakeDecision(real(m)) Obtain evaluation of the judge: error_real(m) = Evaluate(decision_real(m), label_real) Learning from the judge: D->Backward(real(m), error_real(m)) Obtain fake instance fake(m) =

= Weighted(

, m=1:M;

) if use_hybrid_fake else if use_cascade_learn =

Generate judge of fake instance: decision_fakel = MakeDecision(fake(m)) Obtain evaluation of the judge: error_fake(m) = Evaluate(decision_fake(m), label_fake) Learning from the judge: D->Backward(fake(m), error_fake(m)) Obtain error of discriminator on the

-th instance: error_D(m) = (error_real(m) + error_fake(m)) / 2

if use_bybrid_fake break loop for end loop for Obtain error of discriminator on all instances: error_D = mean(error_D(m), m=1:M)

* Synthesized Fake Instance is denoted as SFI; Synthesized Real Instance is denoted as SRI. In hybrid fake, the hybrid fake is constructed for evaluating loss of Generators, while Discriminator is updated once at a batch of fakes. In cascading learning, however, Discriminator is adjusted against every generator at a fake that is to be

Table 3 Overall performance of IQA indices over 7 datasets* IQA Index





0.6874 0.7137 0.8430 0.8423

0.5161 0.5398 0.6593 0.6827

0.7020 0.7602 0.8407 0.8728

*data are from [31], high SROCC, KROCC or PLCC indicates a high correlation between human judgements and algorithm decisions

Multimed Tools Appl

evaluated towards the corresponding Generator. If the batch size is set to 1(learning one target image at a time), M fakes rather than a hybrid fake are evaluated successively. The next step is Discriminator update. The Discriminator updates follows the same procedure in hybrid fake and cascading learning. For each fake, Discriminator makes decision on whether it is a real image or synthesized image that produced by Generator. Incorrect decisions corresponded instance will be back-propagated for self-adjustment.

6.2 Experimental evaluations Evaluation of computer-generated images is an open and difficult problem [4, 21]. It worth mentioned that the objective of image-to-image task is not to provide exact raw prediction on each pixel [9, 32, 37], but to produce a plausible candidate that is able to fool an observer convinced that it is natural and realistic. In other words, these computer-generated image should be subjectively perceived to be created by human rather than computers. Thus, traditional evaluation metrics like pixel-level MSE or AUC are not applicable because structured loses of images are unable to be captured [9]. Instead, qualitative comparison [7, 17] and statistics of subjective judgements [4, 9, 32, 37] are preferred. The less loss of image y from the ground truth image ^y, or the less misclassification of fake-real pair in perceptual test, the closer to this objective [9, 32, 37].

Fig. 5 The edges-to-shoe translation using cGAN and One-to-Many cGAN. First row is examples of images generated by cGAN, including an edge image followed by output and ground truth. The last two rows are examples of images generated by One-to-Many cGAN, including the input edge images (first column), followed by output 1, shoe 1 (ground truth 1), output 2, and shoe 2 (ground truth 2)

Multimed Tools Appl

Fig. 6 Examples of images generated by MAD-GAN

6.2.1 Qualitative comparison with cGAN The meta-network of the One-to-Many cGAN is cGAN, so we expect the images that predicted by the complex is better than or at least comparable to cGAN. Besides, we expect novel images that have recognizable styles but are dissimilar to training images. We made images produced by our model in comparison with those generated by cGAN in one-to-one fashion, in order to gain observations on the differences between a complex model and a meta-model.

6.2.2 Image quality assessment (IQA) Apart from subjective assessments, quantitative measures as Peak Signal to Noise Ratio (PSNR), Universal Quality Index (UQI) [35], Structural Similarity Index (SSIM) [36], and Visual Information Fidelity (VIF) [22] is used [26], to gain insights into the differences between synthesized images and ground truth images. Quantitative measures are not for evaluating loss from algorithm estimate to the target, but rather function as an indicator of structural completeness and cognitive similar to the target (it is the basis of synthesized images being capable of in comparison to target images). According to a comprehensive evaluation of several image quality assessment algorithms [31], the SSI and the VIF yields comparable higher overall performance over 7 datasets (Table 3). Considering that the structural Table 4 VIF comparison with Pix2pix and PAN

*data are from [26]



Pix2pix PAN One-to-Many cGAN

0.2268* 0.2393* 0.2262/0.0865

Multimed Tools Appl Table 5 Overall VIF on two types of translations (translation 1/translation 2) Datasets







Hybrid Fake Cascading learning Hybrid Fake Cascading learning

0.0320/0.0122 0.0476/0.0210 0.4263/0.1657 0.3934/0.1714

0.0334/0.0127 0.0478/0.0209 0.5176/0.1861 0.4883/0.1893

0.0336/0.0128 0.0462/0.0215 0.4080/0.1555 0.3764/0.1607

0.0330/0.0126 0.0472/0.0211 0.4506/0.1691 0.4194/0.1738


information remains unchanged in the grey-to-color translation, we select VIF as the representative of assessment measures. The source code of VIF algorithm implementation is at http://sse.tongji.edu.cn/linzhang/IQA/Evalution_VIF/eva-VIF.htm.

6.2.3 Analysis between the MAD-GAN and one-to-many cGAN The MAD-GAN is also constructed for capturing diverse data distributions, by enforcing diversity enforcing term that encourages difference between instances. Therefore, instances produced from diverse generator are sufficiently different. However, this diversity enforcing may lead to poor generalization because each agent is forced focus on some instances. For instance, the green, blue or black. Our method can not only capture the target distribution, but also enable producing more diversified results.

7 Results 7.1 UT-Zappos50K The aim of experiment on UT-Zappos50K is to examine edges-to-shoe translation [9, 26]. The primary task follows edges-to-photo translation in [9]. To further challenge our model, we introduce an edges-to-similar-photo translation as secondary task. The results are show in Figs. 5 and 6. We compare our results with pix2pix [9] and PAN [26]. The results are shown in Table 4.

Fig. 7 Examples of images generated by One-to-Many cGAN with cascading learning training scheme on UTZap50K dataset. Row of images are edges of shoe(input), dark style of shoe (output 1), and color style of shoe (output 2)

Multimed Tools Appl

Fig. 8 Examples of images generated by One-to-Many cGAN with cascading learning training scheme on BehTex7K dataset. Row of images are grayscale(input), color (output 1), and oil paint style (output 2) of images

7.2 MVOD5K We adopt MVOD5K dataset to examine view-estimation ability of our framework in the first translation, and view-generalization ability in the second translation. As shown in Fig. 9 (hybrid fake) and Fig. 10 (cascading learning), two implementations learn logical views of objects. Color-space information is preserved during translation. But some local color distributions are lost, i.e. logo, shoelaces. In addition, hybrid fake implementation provides fine estimation of cross-view contour, even when training data is deficient (only 216 input-outputs training pairs). As show in Fig. 10, cascading learning implementation could generate views whose outlines is close to the corresponding target view, but distortion may happen. Common failures are incomplete contours. For hybrid fake, the overall VIF is 0.0330 in the first translation, and is 0.0126 in the second translation. For cascading learning, the VIF is 0.0472 and 0.0211. The results are far worse than colorization and style transfer tasks (on BehTex7K) but acceptable. Notice that the VIF is a strict measure on the output image. VIF of the Sematic labels to Cityscapes images is 0.06581, of the Aerial photos to Maps images is 0.1617 [26].

Fig. 9 Examples of images generated by One-to-Many cGAN with hybrid fake training scheme on MVOD5K dataset. Input and ground truths are diverse views of same object

Multimed Tools Appl

Fig. 10 Examples of images generated by One-to-Many cGAN with cascading learning training scheme on MVOD5K dataset. Input and ground truths are diverse views of same object

7.3 BehTex7K The aim of constructing BehTex7K dataset is to examine pixel-level prediction ability of our framework. In the first translation, we explore color prediction ability; in the second translation, both color prediction and texture transfer ability is tested. Similarly,

Fig. 11 Examples of images generated by One-to-Many cGAN with hybrid fake training scheme on BehTex7K dataset. The input is grayscale images (first column), followed by output 1, color images (ground truth 1), output 2, and oil paint style (ground truth 2)

Multimed Tools Appl

Fig. 12 Examples of images generated by One-to-Many cGAN with cascading learning training scheme on BehTex7K dataset. The input is grayscale images (first column), followed by output 1, color images (ground truth 1), output 2, and oil paint style (ground truth 2)

our framework produces reasonable colorization towards a grey image. As Figs. 11 and 12 show, two implementations of our model are able to generate local color distribution. Besides, a different colorization may be produced. New colorizations follow original distribution scheme, but may render with diversified color. As for the second translation, results show two implementations of our framework could learn colorization and texture transfer simultaneously. The VIF on the BehTex7K is approximately 50% (Table 5). The results imply our framework is able to perform on grey-to-color and grey-to-Oil-Paint translations. A greater diversity of outputs is observed in cascading learning implementation (Fig. 12) than those in hybrid fake implementation of our framework (Fig. 11). This is because we keep generators independent to each other during cascading learning training. This diversity is presented as distinct outputs, as well as scattered colorization and extensively rendering in each output.

8 Discussion We provide two implementations of synthesized adversarial training. Both hybrid fake and cascading learning can capture target distribution of images. The advantage of

Multimed Tools Appl

cascading learning is that individual distribution of data is preserved, since fake instances are learned and evaluated separately. Results show our model can generating two distinct style of images. Compared with MAD-GAN (Fig. 6), our method can generate much more diversified instances within the corresponding distribution (Figs. 7 and 8). Besides, most research considered learning a one-to-many mapping as a problem of prediction or reasoning. However, in some cases, the mapping is analogy or creation work. For instance, the image labeling, where multi-labels are eligible to describe an image. The method of automatic tagging is through constructing a corpus of text called dictionary that assumed to include all possible word may be used as tags [18]. But in crowdsource labeling, novel words or expressions may be created out of the scope of the pre-constructed dictionary. Our model, to some extent, is able to learn such human creativity, generating brand-new but recognizable subjects. However, the results on MVOD7K is not as well as other two datasets. One reason may be the cross-view translation is intrinsically challenging. Similar worse results were found in photo-to-labels (best VIF = 0.1638) and labels-to-photo translation (best VIF = 0.06581) [26]. Another possible cause is that VIF metric measure visual information such as pixel color rather than structure information. Third, the available instances for edges-to-shoe translation in MVOD5K is only 221 pairs. In comparison, UTZap50K has 24,190 image lists and BehTex7K has 2418 lists.

9 Conclusions We presented a framework, One-to-Many cGAN, for generating multiple translated images using multiple generators and synthesized adversarial training scheme. We proposed two implementations of the synthesized adversarial learning: hybrid fake and cascading learning. We experimented our framework on edges-to-photo (original photo and similar photo), cross-view, grey-to-color (photorealistic and Oil Paint style) translation tasks. The results show that in edges-to-photo and grey-to-color translations, both hybrid fake and cascading learning schemes could generate high-quality images in multiple target styles. Deficiency of training instances contributes a lot to framework performance decay. In cross-view translation, blurs and distortions may produce in outputs, particularly when using cascading learning. Common failures of cross-view translation are incomplete contours when translating 2D projection of object (i.e. shape) from one viewpoint to another. Although capturing multiple data distributions is challenging, the Visual Information Fidelity (VIF) of output images generated by the One-to-Many cGAN on the UT-Zappos50K and BehTex7K datasets are comparable to One-to-One translation model (e.g. pix2pix with cGAN). Besides, in comparison with the MAD-GAN, the One-to-Many cGAN is able to generate more diversified instances within target distribution.

10 Future work In the future, one of the direction is to challenge the One-to-Many GAN(s) framework on other ‘smart’ tasks, i.e. design tasks. Besides, the one-to-many characteristics of our model are inherently compatible to multimodal and multi-view representations. Although results of the

Multimed Tools Appl

One-to-Many cGAN learning one view of object to another view are not as well as our expectations, the possibility of our framework learning cross-view images could not be excluded. Thus, another direction is to experiment other generative models as meta-model of the One-to-Many GAN(s) framework, for instance, Disco-GAN that could learn cross-domain relations [11]. The cascading learning approach may also support using heterogeneous generators in framework, so that the One-to-Many framework may be applicable to additional multimedia information, e.g. text [19]. Acknowledgements This paper is supported by the National Natural Science Foundation of China (61303137), the National Science and Technology Support Program (2015BAH21F01) and the Art Project for National Social-Science Foundation (15BG084). We thank Dr. Preben Hansen from Stockholm University, Department of Computer Science, for assistance in proofreading and technical editing of the manuscript.

Appendix Derivations and Proofs Derivation of the objective of a set of generators As for optimizing over multiple generators, Ghosh et al. [7] modified the objective function of the discriminator where along with finding the fakes, the discriminator has to find the generator that produced the given fake. k

max Ex∼pdata ðxÞ ½logDkþ1 ðxÞ þ ∑ Exi ∼pgi ðxÞ ½logDi ðxi Þ D



For a fixed generator, the objective is to minimize: k

Ex∼pd logDkþ1 ðxÞ þ ∑ Ex∼pgi logð1−Dkþ1 ðxÞÞ



For a set of generators, the objective is: k


G1 ;G2 ;…;Gk

Ex∼pdata ðxÞ ½logDkþ1 ðxÞ þ ∑ Exi ∼pgi ðxÞ ½logð1−Dkþ1 ðxÞÞ



where k is the number of generators, denoted as M, pd the true data distribution, denoted as pdata, pgi the distribution learned by the i-th generator, denoted as pGm in this context. Introducing the conditioned variable into (17), and replacing notations by those used in this paper, the objective of a set of conditional generators is: min

G1 ;G2 ;…;GM

M   Ex;y∼pdata ðx;yÞ logDM þ1 ðx; yÞ þ ∑ Ex;y∼pGm ðx;yÞ log 1−DM þ1 ðx; yÞ



Likely, the objective of conditional discriminator is: M

max Ex;y∼pdata ðx;yÞ logDM þ1 ðx; yÞ þ ∑ Ex;y∼pGm ðx;yÞ logDm ðx; yÞ D



Multimed Tools Appl

We add L1 regularization term to reduce blurs in image [9], yield the final objective (12), (13) and (14). Note that as for hybrid fake implementation, a hybrid instance I is used by discriminator, rather than a individual instance x.

Proofs [7] has provided detailed propositions and theorems about the objective of training a set of generators and a discriminator for an unconditional GAN. The proofs of the One-to-Many cGAN are inspired by these propositions and theorems. We introduce conditioned variable into the optimal distribution learned by the unconditional discriminator [7], and proposed a general format of the optimal distribution learned by a conditional discriminator: pGm ðx; yÞ

Dm ðx; yÞ ¼

; ∀m∈f1; 2; …; M þ 1g



pdata ðx; yÞ þ ∑ pGm ðx; yÞ m¼1

Note that the unknown pGMþ1 ≔pdata to avoid clutter [7]. Then, replacing Dm and DM + 1 in (18) using (20), yields 2 6 Ex;y∼pdata ðx;yÞ log6 4

3 pdata ðx; yÞ M

pdata ðx; yÞ þ ∑ pGm ðx; yÞ 2



M 6 B B þ ∑ Ex;y∼pGm ðx;yÞ 6 [email protected]− m¼1

7 7 5 13

pGMþ1 ðx; yÞ M

pdata ðx; yÞ þ ∑ pGm ðx; yÞ

C7 C7 A5




þ1 m ∑M m¼1 D

¼ 1,

2 6 Ex;y∼pdata ðx;yÞ log6 4

3 pdata ðx; yÞ M



M 7 6 B 7 þ ∑ Ex;y∼p ðx;yÞ 6logB Gm 5 m¼1 4 @


∑ pGm ðx; yÞ



pdata ðx; yÞ þ ∑ pGm ðx; yÞ pdata ðx; yÞ þ ∑ pGm ðx; yÞ m¼1 " # m¼1 " # pdata ðx; yÞ pG ðx; yÞ ≔Ex;y∼pdata ðx;yÞ log þ MEx;y∼pG ðx;yÞ log −ðM þ 1ÞlogðM þ 1Þ þ M logM pavg ðx; yÞ pavg ðx; yÞ

where pG ¼

∑M m¼1 pGm ðx;yÞ , M

pavg ðx; yÞ ¼

pdata ðxÞþ∑M m¼1 pGm ðx;yÞ , M þ1

13 C7 C7 A5


  and supD ðpG Þ ¼ ⋃M m¼1 supD pGm .

The final term (22) obtains its minimum –(M + 1) log(M + 1) + M log M, when pdata ¼ ∑M m¼1 pGm ðx;yÞ M

[7]. When the number of generator M is equal to 1, the One-to-Many cGAN obtains the minimum value of log 4 of the Jensen-Shannon divergence based objective function in the original GAN [8].

Multimed Tools Appl

The convergence of pGm can be shown by computing gradient descent update at the optimal   D giving the corresponding Gm. Each supD pGm ; D forms convex in pGm with a unique global optimal value as proven in [7]. Therefore, with sufficiently small updates of pGm , pGm converges to the corresponding pdata(xm).

Architecture of generator and discriminator We denote C(k) a Convolution-BatchNorm-ReLU layer with k filters, CD(k) a Convolution-BatchNorm-Dropout-ReLU layer with a dropout rate of 50%. All ReLUs in discriminator and the encoder of generator are leaky, with slop 0.2. All ReLUs in the decoder are not leaky. The generator is a modified encoder-decoder architecture called U-Net [9]: Encoder: C(64)-C(128)-C(256)-C(512)-C(512)-C(512)-C(512)-C(512) Decoder: CD(512)-CD(1024)-CD(1024)-CD(1024)-CD(1024)-C(512)-C(256)-C(128) The discriminator is a 70 × 70 Markovian discriminator (PatchGAN) [9]: C(64)-C(128)-C (256)-C(512). BatchNorm is not applied to the first layer C(64).

References 1. Cai B, Xu X, Jia K, Qing C, Tao D (2016) DehazeNet: an end-to-end system for single image haze removal. IEEE Trans Image Process 25(11):5187–5198 2. Çalışır F, Baştan M, Ulusoy Ö, Güdükbay U (2017) Mobile multi-view object image search. Multimedia Tools & Applications 76(10):12433–12456 3. Chen M, Denoyer L (2016) Multi-view Generative Adversarial Networks arXiv eprint arXiv:1611.02019 4. Elgammal A, Liu B, Elhoseiny M, Mazzone M (2017) CAN: Creative Adversarial Networks, Generating "Art" by Learning About Styles and Deviating from Style Norms. arXiv eprint arXiv:1706.07068 5. Gao Z, Zhang H, Xu GP, Xue YB, Hauptmannc AG (2015) Multi-view discriminative and structured dictionary learning with group sparsity for human action recognition. Signal Process 112:83–97 6. Gatys LA, Ecker AS, Bethge M (2016) Image style transfer using convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2414–2423 7. Ghosh A, Kulharia V, Namboodiri V, Torr PHS, Dokania PK (2017). Multi-Agent Diverse Generative Adversarial Networks. arXiv eprint arXiv:1606.07536 8. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: International Conference on Neural Information Processing Systems, pp 2672–2680 9. Isola P, Zhu JY, Zhou TH, Efros, AA (2016) Image-to-Image Translation with Conditional Adversarial Networks arXiv eprint arXiv:1611.07004 10. Jacob VG, Gupta S (2009) Colorization of grayscale images and videos using a semiautomatic approach. In: 2009 16th IEEE International Conference on Image Processing, pp 1653–1656. doi:10.1109/ ICIP.2009.5413392 11. Kim T, Cha M, Kim H, Lee JK, Kim J (2017) Learning to Discover Cross-Domain Relations with Generative Adversarial Networks. arXiv eprint arXiv:1703.05192 12. Kwak H, Zhang BT (2016) Ways of Conditioning Generative Adversarial Networks. arXiv eprint arXiv: 1611.01455 13. Liu MY, Tuzel O (2016) Coupled generative adversarial networks. arXiv preprint arXiv: 14. Liu A-A, Su Y-T, Jia P-P, Gao Z, Hao T, Yang Z-X (Jun. 2015) (2015) Multipe/single-view human action recognition via part-induced multitask structural learning. IEEE Transactions on Cybernetics 45(6):1194– 1208

Multimed Tools Appl 15. Liu Y, Qin Z, Luo Z, Wang H (2017) Auto-painter: Cartoon Image Generation from Sketch by Using Conditional Generative Adversarial Networks. arXiv eprint arXiv:1705.01908 16. Liu Z et al. (2017) Multiview and multimodal pervasive indoor localization. ACM on Multimedia Conference ACM: 109–117 17. Luan F, Paris S, Bala K (2017) Deep Photo Style Transfer. arXiv eprint arXiv:1703.07511 18. Mirza M, Osindero S (2014) Conditional generative adversarial nets. Computer Science 2672–2680 19. Nie L, Wang M, Zha Z, et al (2011) Multimedia answering: enriching text QAwith media information: 695–704 20. Perarnau G, Weijer JVD, Raducanu B, Álvarez JM (2016) Invertible Conditional GANs for image editing. In Conference and Workshop on Neural Information Processing Systems 2016. arXiv eprint arXiv: 1611.06355 21. Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved Techniques for Training GANs. arXiv eprint arXiv:1606.03498 22. Sheikh HR, Bovik AC (2006) Image information and visual quality. IEEE Trans Image Process 15(2):430– 444. https://doi.org/10.1109/TIP.2005.859378 23. Vedran V, Raymond C, Gravier G (2017) Generative adversarial networks for multimodal representation learning in video hyperlinking. In: ACM on International Conference on Multimedia Retrieval, pp 416–419 24. Wang X, Gupta A (2016) Generative Image Modeling Using Style and Structure Adversarial Networks. arXiv eprint arXiv:1603.05631 25. Wang Y, Zhang L, Weijer JVD (2016) Ensembles of Generative Adversarial Networks. arXiv eprint arXiv: 1612.00991 26. Wang C, Xu C, Tao D (2017) Perceptual Adversarial Networks for Image-to-Image Transformation. arXiv eprint arXiv:1706.09138 27. Xie S, Tu Z (2017) Holistically-nested edge detection. Int J Comput Vis 125:3–18 28. Yang Y, Ma Z, Hauptmann AG, Sebe N (2013) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Transactions on Multimedia 15(3):661–669 29. Yi Z, Zhang H, Tan P, Gong M (2017) DualGAN: Unsupervised Dual Learning for Image-to-Image Translation. arXiv eprint arXiv:1704.02510 30. Yu A, Grauman K (2014) Fine-grained visual comparisons with local learning. In: Computer Vision and Pattern Recognition, pp 192–199 31. Zhang L, Zhang L, Mou X, Zhang D (2012) A comprehensive evaluation of full reference image quality assessment algorithms. In: 2012 19th IEEE International Conference on Image Processing, pp 1477–1480. doi:10.1109/ICIP.2012.6467150 32. Zhang R, Isola P, Efros AA (2016). Colorful Image Colorization. arXiv eprint arXiv:1603.08511 33. Zhang H et al (2016) Online collaborative learning for open-vocabulary visual classifiers. IEEE Computer Vision and Pattern Recognition: 2809–2817 34. Zhang H, Sindagi V, Patel VM (2017) Image De-raining Using a Conditional Generative Adversarial Network. arXiv eprint arXiv:1701.05957 35. Zhou W, Bovik AC (2002) A universal image quality index. IEEE Signal Processing Letters 9(3):81–84. https://doi.org/10.1109/97.995823 36. Zhou W, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612. https://doi.org/10.1109/TIP.2003.819861 37. Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. arXiv eprint arXiv:1703.10593

Multimed Tools Appl

Chunlei Chai is an associate professor at computer science and technology college, Zhejiang University. He is interested in human-machine interaction and intelligent design.

Jing Liao is a Ph.D. candidate at computer science and technology college, Zhejiang University. She is interested in machine learning and human-system interaction.

Ning Zou is a lecturer at computer science and technology college, Zhejiang University. His research focus is on production design, ergonomics design and human-system interaction.

Multimed Tools Appl

Lingyun Sun is a professor at Zhejiang University. He is the Deputy Director of International Design Institute, Ng Teng Fong Chaired Professor and the director of ZJU-SUTD Innovation, Design and Entrepreneurship Alliance, and a deputy director of ZJU-Beidou Joint Innovation Design Engineering Center. His research interests include Design Intelligence, Innovation and Design, Information and Interaction Design. He is a PI to research grants funded by National Natural Science Foundation and National Basic Research Program of China. He has more than 30 publications in peer-viewed journals and conferences, include Design Studies and Science China. He is also the inventor of more than 20 granted patents as well as the winner of 5 IF and Red Dots awards.