Spectral Normalization for Generative Adversarial Networks

Spectral Normalization for Generative Adversarial Networks

Takeru Miyato 1 2 Toshiki Kataoka 1 Masanori Koyama 3 Yuichi Yoshida 4

Abstract Instability in training is a major challenge in the study of generative adversarial networks. This paper proposes a novel weight normalization called spectral normalization to stabilize the training of the discriminator. Our method is computationally light and easy to incorporate into the existing implementations. We experimentally confirmed that our method is capable of generating images with better quality than many previous training stabilization techniques.

1. Introduction Generative adversarial networks (GANs) (Goodfellow et al., 2014) is a framework of training a generative model that mimics a given target distribution, and it consists of a generator that produces the model distribution and a discriminator that distinguishes the model distribution from the target. With the goal of to reducing the difference between the model distribution and the target distribution measured by the best possible discriminator at each step of the training, The algorithm with GANs train the generator and the discriminators sequentially in turn. A persisting challenge in the training of GANs is the performance control of the discriminator. In high dimensional spaces, the density ratio estimation by the discriminator is often inaccurate and unstable during the training, and generator networks fail to learn multi-modal structure of the target distribution. Even worse, when the support of the model distribution and the support of the target distribution are disjoint, the training may arrive at a discriminator that can perfectly distinguish the model distribution from the target (Arjovsky & Bottou, 2017), bringing the training of the generator to complete halt. This calls for a necessity to introduce some form of restriction to the choice of the discriminator. 1

2

Preferred Networks, Inc., Japan ATR Cognitive Mechanism Laboratories, Japan 3 Mathmatics, Ritsumeikan University 4 National Institute of Informatics. Correspondence to: Takeru Miyato . ICML 2017 Workshop on Implicit Models. Copyright 2017 by the author(s).

In this paper we propose a novel weight normalization method called spectral normalization that can stabilize the training of discriminator networks. Our normalization is not only hyperparameter free, but is easy to incorporate into the existing training implementations at low computational cost. In the following sections, we will give brief explanation for reason behind the effectiveness of spectral normalization for GANs and compare its performance against other regularization techniques, such as weight normalization (Salimans & Kingma, 2016) and gradient penalty (Gulrajani et al., 2017). We also show that, in the absence of regularization techniques (e.g., batch normalization, weight decay and feature matching on the discriminator), spectral normalization can improve the sheer quality of the generated images better than weight normalization and gradient penalty.

2. Method Let us consider a simple discriminator network of the following form: hl := al (W l hl−1 + bl ),

l ∈ {1, ..., L},

(1)

where W l ∈ Rdl ×dl−1 , bl ∈ Rdl , hl ∈ Rdl , h0 (x) = x and al is an element-wise non-linear activation function. We will denote the final layer by f (x) = hL (x) in this paper, so that the final output of the discriminator is given by D(x) = A(f (x)) where A is an activation function corresponding to the divergence of distance measure of the user’s choice. The standard objective of GANs is to find minG maxD V (G, D), where min and max of G and D are taken over the set of generator and discriminator functions, respectively. The conventional form of V (G, D) (Goodfellow et al., 2014) is given by E

x∼qdata

[log D(x)] −

E [log(1 − D(G(z)))],

(2)

z∼p(z)

where qdata is the data distribution, z ∈ Rdz is a latent variable, p(z) is the standard normal distribution N (0, I), and G : Rdz → Rd0 is a deterministic generator function to be learned through the adversarial min-max optimization. The machine learning community has been pointing out recently that the function space from which the discriminators are selected crucially affects the performance of


GANs. A number of works (Uehara et al., 2016; Qi, 2017; Gulrajani et al., 2017) advocate the importance of introducing Lipschitz constraint in assuring the boundedness of the density ratio estimation by the discriminator. A particularly successful endeavors in this array are Qi (2017); Gulrajani et al. (2017), which proposed methods to control the Lipschitz constant of the discriminator by adding regularization terms that is dependent on x. We would follow their footstep and search for the discriminator D from the set of K-Lipschitz continuous functions. We will use ∥f ∥Lip to denote the Lipschitz constant, or the smallest value M such that ∥f (x) − f (x′ )∥2 /∥x − x′ ∥2 ≤ M for any x, x′ . 2.1. Spectral Normalization Our spectral normalization controls the Lipschitz constant of the discriminator function f by literally constraining the spectral norm of each layer g : Rdin 7→ Rdout . Let us denote by σ(A) the L2 operator norm of the matrix A, given by σ(A) := max

h:h̸=0

∥Ah∥2 = max ∥Ah∥2 , ∥h∥2 ∥h∥2 ≤1

(3)

which is equivalent to the largest singular value of A. This way, we may say ∥g∥Lip = suph σ((∇g)(h)). If the Lipschitz norm of the activation function ∥al ∥Lip is equal to 1 1 , we can appeal to the rule of operator norm to claim that the ∥f ∥Lip is bounded by : L ∏ l=1

∥(hl−1 7→ W l hl−1 + bl )∥Lip =

L ∏

σ(W l ).

(4)

) ∂V (G, D) 1 ( [ T] ˆ δh − λu1 v1T , = E ∂W σ(W )

Our spectral normalization normalizes the spectral norm of the weight matrix W l by applying the transformation ¯ SN (W ) := W/σ(W ) W (5) ( ) ¯ SN = 1 is satisfied so that the Lipschitz constraint σ W at each layer. It is clear that this normalization makes ∥f ∥Lip ≤ 1. 2.2. Gradient Analysis of the Cost Function with Spectrally normalized weights

(7)

ˆ [·] represents empirical expectation over the miniwhere E ( ( ))T ¯ batch, , and λ := [ T ( δ :=)] ∂V (G, D)/∂ WSN h ¯ SN h ˆ δ W E We would like on the implication of (7). The ] [ to comment ˆ δhT is the same as the derivative of the first term E weights without normalization. In this light, the second term in the expression can be seen as a penalty term for the first singular components with the adaptive regularization ¯ SN h are pointing coefficient λ. λ is positive when δ and W in similar direction, and this prevents the columns of W from concentrating into one direction in the course of the training. In other words, spectral normalization prevents the transformation of each layer from being to oversensitive in one direction. 2.3. Reparametrization Motivated by the Spectral Normalization Originally, the regularization techniques of this form were invented as a part of reparametrization technique (Salimans & Kingma, 2016). For example, one may benefit from the advantage of the spectral normalization we saw above in the reparametrization ˜ := γ W ¯ SN , W

l=1

(8)

where γ is a scalar variable to be learned. This reparametrization compromises the 1-Lipschitz constraint on W , but gives more freedom to the model while keeping it from becoming degenerate. For this reparametrization, we need to control the Lipschitz condition by other means, such as the gradient penalty (Gulrajani et al., 2017). The family of reparameterizations of this type uses otherwise ¯ SN , such as weight normalized normalized W in place of W W (Salimans & Kingma, 2016). 2.4. Fast Approximation of the Spectral Norm

¯ SN (W ) with respect to Wij is: The gradient of W ¯ SN (W ) ) ∂W 1 ( ¯ SN , = Eij − [u1 v1T ]ij W ∂Wij σ(W )

left and right singular vectors of W . If h is the hidden node ¯ SN , the derivative of in the network to be transformed by W V (G, D) calculated over the mini-batch with respect to W in the discriminator D is given by:

(6)

where Eij is the matrix whose (i, j)-th entry is 1 and zero everywhere else, and u1 and v1 are respectively the first 1 For examples, ReLU (Jarrett et al., 2009; Nair & Hinton, 2010; Glorot et al., 2011) and leaky ReLU (Maas et al., 2013) satisfies the condition, and many popular activation functions satisfy K-Lipschitz constraint for some predefined K as well.

We will be doing huge harm to the computational runtime if we naively apply singular value decomposition at every round of the algorithm in order to compute σ(W ). Instead, we can use the power iteration to estimate σ(W ) (Yoshida & Miyato, 2017). See the appendix section for the details.

3. Difference from Related Approaches The major advantage of spectral normalization is the sheer girth of the search space for the parameter matrix W . In


order to control the regularity of the discriminator, well known contemporary methods normalizes sup norm or L2 norm of the row vectors, or the Frobenius norm of W . Mathematically, these approaches are putting constraint on the sum of the singular values of W , which inadvertently imposes a trade-off relation between the maximum singular value and the rank. This can force the trained discriminator to be largely dependent on select few features, rendering the algorithm to be able to match the model distribution to the target distribution only on very low dimensional subspace. Spectral normalization on the other hand, is free of such drawback, because one can allow multiple feature dimensions to have high singular value without compromising the maximum singular value. Please see the appendix section for the detailed comparison of these regularization techniques in this light.

4. Experiments: Image Generation We conducted image generation experiments with different discriminator regularization techniques to assess the performances of the methods. The tested techniques include spectral normalization, weight normalization (Salimans & Kingma, 2016), WGAN-GP (Gulrajani et al., 2017), as well as batch normalization (Ioffe & Szegedy, 2015) and layer normalization (Ba et al., 2016). In order to prevent the methods from overtly violating the Lipschitz condition, we excluded the multiplier parameter γ from all models. In fact, when we tested the methods with and without the multiplier parameter γ, the methods implemented without γ yielded better results. As for the architecture of the discriminator and generator, we used convolutional neural networks. For the evaluation of the spectral norm for the convolutional weight W ∈ Rdout ×din ×h×w , we treated the operator as a square matrix of dimension dout × (din hw) 2 . We trained the parameters of the generator with batch normalization (Ioffe & Szegedy, 2015). We refer the readers to Table 4 in the appendix section for more details of the architectures. For the training of GANs, we used benchmark datasets: CIFAR-10 (Torralba et al., 2008), and STL-10 (Coates et al., 2011). For all methods other than WGAN-GP, we used the standard objective function for the adversarial loss defined in (2). We set dz to 128 for all of our experiments. For the updates of G in all methods other than WGAN-GP, we used the alternate cost proposed by Goodfellow et al. (2014): − Ez∼p(z) [log(D(G(z)))], as used in Goodfellow et al. (2014); Warde-Farley & Bengio (2017). For the updates of D, we used the original cost defined in (2). 2 Note that, since we are conducting the convolution discretely, the spectral norm will depend on the size of the stride and padding. However, the answer will only will differ by some predefined K.

For Wasserstein GANs with gradient penalty (WGANGP) (Gulrajani et al., 2017), we used the following objective function: E

x∼qdata

[D(x)] −

E [D(G(z))] z∼p(z)

ˆ 2 − 1)2 ], − λ E [(∥∇xˆ D(x)∥

(9)

ˆ x∼p ˆ x

with the last term functioning as a penalty on the gradient. In order to evaluate the stand-alone efficacy of the gradient penalty, we also applied the penalty term of (9) to the standard adversarial loss of GANs (2) . We would refer to this method as ‘GAN-GP’. For each method with gradient penalty, we set λ to 10, as suggested in (Gulrajani et al., 2017). For quantitative assessment of generated examples, we used inception score introduced originally by (Salimans et al., 2016) : I({xn }N E[DKL [p(y|x)||p(y)]]), n=1 ) := exp(∑ N 1 where p(y) is approximated by N n=1 p(y|xn ). In their work, (Salimans et al., 2016) reported that this score is strongly correlated with subjective human judgment of image quality. Following the procedure in (Salimans et al., 2016; Warde-Farley & Bengio, 2017), We generated the sample set of size 5000 ten times from the trained generator and reported the average and the standard deviation of the inception scores. We also tested the methods with multiple hyper parameter settings, including the Adam learning parameters and the number of discriminator updates per generator update (Table 3 in the appendix).

5. Results In tables 1 we show the inception scores of the three methods (WGAN-GP, weight normalization, and spectral normalization) with optimal settings against other contemporary methods on CIFAR-10 and STL-10 dataset. Note that our spectral normalization is superior to DCGAN (Radford et al., 2016) and LR-GAN (Yang et al., 2017) on CIFAR10, and loses only to the model of Warde-Farley & Bengio (2017) by slight margin, which uses sophisticated feature matching with auxiliary denoising autoencoder. Our method achieves competitive performance with WardeFarley & Bengio (2017) on STL-10. In Figure 4 and 6 in the appendix section, we show the images produced by the generators trained with WGANGP, weight normalization, and spectral normalization. The images generated with GAN-GP, batch normalization and layer normalization are shown in Figure 5 in the appendix section. The set of images generated by spectral normalization was clearer and more diverse than the images produced by the weight normalization. We can also see that WGAN-GP failed to train good GANs with high learning

Spectral Normalization for Generative Adversarial Networks 1.0

Conv Layer : 1

1.0

Conv Layer : 2

1.0

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

s2

Table 1: Inception scores with unsupervised image generation on CIFAR-10 and STL-10. † (Warde-Farley & Bengio, 2017), ‡ (Yang et al., 2017), ⋆ (Radford et al., 2016) (experimented by Yang et al. (2017)), ∗ (Dumoulin et al., 2017).

0.0 0

13

Index of s

0.0 26 0

32

1.0

Conv Layer : 2

Conv Layer : 3

0.0 63 0

64

1.0

Conv Layer : 3

1.0 Conv Layer : 4 1.0 Conv Layer : 5 1.0 Conv Layer : 6 1.0 Conv Layer : 7 0.9 0.8 0.8 0.8 0.8 0.7 0.6 0.6 0.6 0.6 0.5 0.4 0.4 0.4 0.4 0.3 0.2 0.2 0.2 0.2 0.1 0.0 0.0 0.0 127 0 64 127 0 128 255 0 128 255 0 256 511

WN SN

(a) CIFAR-10 1.0

CIFAR-10

STL-10

(Real data) Warde-Farley et al.† LR-GAN‡ DCGAN⋆ ALI∗ Spectral Norm. Weight Norm. WGAN-GP

11.24±.12 7.72±.13 7.17±.07 6.64±.14 5.34±.05

26.08±.26 8.51±.13

7.42±.08 6.84±.07 6.68±.06

8.69±.09 7.16±.10 8.42±.13

1.0

Conv Layer : 4

Conv Layer : 5

1.0

1.0

Conv Layer : 6

1.0

0.8

0.8

0.8

0.8

0.8

0.8

0.6

0.6

0.6

0.6

0.6

0.6

0.6

s2

Method

Conv Layer : 1

0.8

0.4

0.4

0.4

0.4

0.4

0.4

0.2

0.2

0.2

0.2

0.2

0.2

0.0 26 0

0.0 63 0

0.0 127 0

0.0 127 0

0.0 255 0

0.0 255 0

13

32

64

64

128

WN SN

0.4

0.2 0.0 0

Index of s

Conv Layer : 7

128

256

511

(b) STL-10

rates and high momentums (D,E and F). On the contrary, our method was able to achieve high inception scores on both datasets(Figure 3a and 3b in the appendix). Our method is also competitive in terms of computation time, because the computational cost for power iteration is negligible relative to the cost of forward and backward propagations (Figure 7 in the appendix).

Figure 1: Ordered squared singular values of weight matrices trained with different methods, based on the parameter set that yielded the best inception score : Weight Normalization (WN) and Spectral Normalization (SN). We scaled the singular values so that the largest singular values is equal to 1. For WN and SN, we calculated singular values of the normalized weight matrices.

Table 2: Inception scores on CIFAR-10 with different reparametrization methods. († experimented by Gulrajani et al. (2017)) Method

5.1. Singular Values Analysis on the Weights of the Trained Discriminator D In Figure 1, we show the ordered squared singular values of the weight matrix on each layer in the trained discriminator D. When trained with weight normalization, the singular values of the weight matrices in low layers tend to concentrate on few components. On the other hand, great many singular values of W in these layers take high value when trained with spectral normalization. Outputs of lower layers have gone through only few sets of rectified linear transformations. Prematurely marginalizing out many features of the input distribution at these layers can result in oversimplified discriminator, and hence a generator with less diverse output. 5.2. Comparison of Reparametrization with Different Normalization Methods We also compared the performance of our normalization against the contemporary methods as a part of reparametrization technique in the form of (8). For the architectures used in this set of experiment, please see the appendix. Table 2 summarizes the result on CIFAR-10. We see that the implicit regularization of our method that prevents the oversensitivity significantly improves the inception score from the baseline on the regular CNN, and slightly improves the score on the ResNet based CNN. We can also confirm that our method improves the generalization performance of the discriminator (Figure 8 in the ap-

Inception score

Standard CNN WGAN-GP with SN WGAN-GP with WN WGAN-GP

7.20±.08 6.36±.04 6.68±.06

ResNet (WGAN-GP† ) WGAN-GP with SN WGAN-GP

7.86±.08 7.83±.06 7.71±.12

pendix).

6. Conclusion This paper proposes spectral normalization that stabilizes training of GANs. When we apply our method to GANs on image generation tasks, the generated examples are more diverse than the conventional weight normalization and achieve better or comparative inception scores relative to previous studies. We shall also note that method imposes global regularization on the discriminator as opposed to local regularization introduced by WGAN-GP. The combination of these philosophically complimentary methods in fact proved to be effective. In the future work, we would like to further investigate our method on more theoretical basis, and experiment our algorithm on larger and more complex datasets, as well as on implicit models (Mohamed & Lakshminarayanan, 2017; Tran et al., 2017).


Acknowledgments We would like to thank the members of Preferred Networks, Inc., particularly Shin-ichi Maeda, Eiichi Matsumoto, Masaki Watanabe and Keisuke Yahata for insightful comments and discussions.

References Arjovsky, Martin and Bottou, L´eon. Towards principled methods for training generative adversarial networks. In ICLR, 2017. Arjovsky, Martin, Chintala, Soumith, and Bottou, L´eon. Wasserstein gan. In ICML, 2017. Arpit, Devansh, Zhou, Yingbo, Kota, Bhargava U, and Govindaraju, Venu. Normalization propagation: A parametric technique for removing internal covariate shift in deep networks. arXiv preprint arXiv:1603.01431, 2016. Ba, Jimmy Lei, Kiros, Jamie Ryan, and Hinton, Geoffrey E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016. Coates, Adam, Lee, Honglak, and Ng, Andrew Y. In AISTATS, 2011. Dumoulin, Vincent, Belghazi, Ishmael, Poole, Ben, Lamb, Alex, Arjovsky, Martin, Mastropietro, Olivier, and Courville, Aaron. Adversarially learned inference. In ICLR, 2017. Glorot, Xavier, Bordes, Antoine, and Bengio, Yoshua. Deep sparse rectifier neural networks. In AISTATS, 2011.

Jia, Kui. Improving training of deep neural networks via singular value bounding. arXiv preprint arXiv:1611.06013, 2016. Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. In ICLR, 2015. Maas, Andrew L, Hannun, Awni Y, and Ng, Andrew Y. Rectifier nonlinearities improve neural network acoustic models. In ICML, 2013. Mohamed, Shakir and Lakshminarayanan, Balaji. Learning in implicit generative models. Adversarial Training Workshop on NIPS, 2017. Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann machines. In ICML, 2010. Qi, Guo-Jun. Loss-sensitive generative adversarial networks on lipschitz densities. arXiv preprint arXiv:1701.06264, 2017. Radford, Alec, Metz, Luke, and Chintala, Soumith. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016. Salimans, Tim and Kingma, Diederik P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In NIPS, 2016. Salimans, Tim, Goodfellow, Ian, Zaremba, Wojciech, Cheung, Vicki, Radford, Alec, and Chen, Xi. Improved techniques for training GANs. In NIPS, 2016.

Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In NIPS, 2014.

Torralba, Antonio, Fergus, Rob, and Freeman, William T. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE transactions on pattern analysis and machine intelligence, 30(11):1958– 1970, 2008.

Gulrajani, Ishaan, Ahmed, Faruk, Arjovsky, Martin, Dumoulin, Vincent, and Courville, Aaron. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017.

Tran, Dustin, Ranganath, Rajesh, and Blei, David M. Deep and hierarchical implicit models. arXiv preprint arXiv:1702.08896, 2017.

He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. In CVPR, 2016.

Uehara, Masatoshi, Sato, Issei, Suzuki, Masahiro, Nakayama, Kotaro, and Matsuo, Yutaka. Generative adversarial nets from a density ratio estimation perspective. Adversarial Training Workshop on NIPS, 2016.

Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. Jarrett, Kevin, Kavukcuoglu, Koray, Ranzato, Marc’Aurelio, and LeCun, Yann. What is the best multi-stage architecture for object recognition? In ICCV, 2009.

Warde-Farley, David and Bengio, Yoshua. Improving generative adversarial networks with denoising feature matching. In ICLR, 2017. Xiang, Sitao and Li, Hao. On the effect of batch normalization and weight normalization in generative adversarial networks. arXiv preprint arXiv:1704.03971, 2017.


Yang, Jianwei, Kannan, Anitha, Batra, Dhruv, and Parikh, Devi. Lr-gan: Layered recursive generative adversarial networks for image generation. arXiv preprint arXiv:1703.01560, 2017. Yoshida, Yuichi and Miyato, Takeru. Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941, 2017.


A. Fast Approximation of the Spectral Norm σ(W ) As we mentioned above, the spectral norm σ(W ) by which we are regularizing each layer of the discriminator is the largest singular value of W . As such, we will be doing huge harm to the computational runtime if we naively apply singular value decomposition at every round of the algorithm. Instead, we can use the power iteration method to estimate σ(W ) (Yoshida & Miyato, 2017). ˜ and v˜ that are randomly initialized. If there is no Let us describe this shortcut in more detail. We begin with vectors u ˜ and v˜ are not orthogonal to the first singular vectors, u ˜ and v˜ converge multiplicity in the dominant singular values and if u to the first left and right singular vectors u and v through the following updating rule. T ˜ ˜ 2, u ˜ ← W v˜/∥W v˜∥2 v˜ ← W T u/∥W u∥

(10)

We can then approximate the spectral norm of W with the pair of so-approximated singular vectors: ˜ T W v˜. σ(W ) ≈ u

(11)

If we use SGD for updating W , the change in W at each update would be small, and hence the change in its largest singular ˜ computed at each step of the algorithm as the value. In our implementation, we took advantage of this fact and reused the u initial vector in the subsequent step. In fact, with this ‘recycle’ procedure, one round of power iteration was sufficient in the actual experiment to achieve satisfactory performance. Algorithm 1 in the appendix section summarizes the computation ¯ with this approximation. Note that this procedure is very computationally of the spectrally normalized weight matrix W cheap even in comparison to the calculation of the forward and backward propagations on neural networks. Please see Figure 7 in the experiment section for actual computational time with and without spectral normalization. Algorithm 1 SGD with spectral normalization ˜ l ∈ Rdl for l = 1, . . . , L with a random vector (sampled from isotropic distribution). • Initialize u • For each update and each layer l: 1. Apply power iteration method to a unnormalized weight W l : ˜ l /∥(W l )T u ˜ l ∥2 v˜l ← (W l )T u ˜ l ← W l v˜l /∥W l v˜l ∥2 u ¯ SN with the spectral norm: 2. Calculate W l ¯ SN W (W l ) = W l /σ(W l ), l ˜T ˜l where σ(W l ) = u l W v

3. Update W l with SGD on mini-batch dataset DM with a learning rate α: l ¯ SN W l ← W l − α∇W l ℓ(W (W l ), DM )

B. Related Approaches This section is dedicated to the comparative study of spectral normalization and other regularization methods for discriminators. In particular, we will show that contemporary regularizations including weight normalization and weight clipping implicitly impose constraints on weight matrices that places unnecessary restriction on the search space of the discriminator. More specifically, we will show that weight normalization and weight clipping unwittingly favor low-rank weight matrices. This can force the trained discriminator to be largely dependent on select few features, rendering the algorithm to be able to match the model distribution with the target distribution only on very low dimensional feature space.


B.1. Weight Normalization and Frobenius Normalization The weight normalization introduced by (Salimans & Kingma, 2016) is a method that normalizes the ℓ2 norm of each row vector in the weight matrix3 : [ T T ]T ¯ WN := w ¯1 , w ¯ 2 , ..., w ¯ dTo , W ¯ i (wi ) := wi /∥wi ∥2 , where w (12) ¯ WN and W , respectively. ¯ i and wi are the ith row vector of W where w Still another technique to regularize the weight matrix is to use the Frobenius norm: ¯ FN := W/∥W ∥F , W where ∥W ∥F :=

√

tr(W T W ) =

√∑ i,j

(13)

2. wij

Originally, these regularization techniques were invented with the goal of improving the generalization performance of supervised training (Salimans & Kingma, 2016; Arpit et al., 2016). However, recent works in the field of GANs (Salimans et al., 2016; Xiang & Li, 2017) found their another raison d’etat as a regularizer of discriminators, and succeeded in improving the performance of the original. These methods in fact can render the trained discriminator D to be K-Lipschitz for a some prescribed K and achieve the desired effect to a certain extent. ¯ WN : However, weight normalization (12) imposes the following implicit restriction on the choice of W ¯ WN )2 + · · · + σT (W ¯ WN )2 = do , σ1 (W where T = min(di , do ), where σt (A) is a t-th singular value of matrix A. T ¯ WN W ¯ T ) = ∑do wi wi = do . tr(W WN i=1 ∥wi ∥2 ∥wi ∥2

The above equation holds because

(14) ∑min(di ,do ) t=1

¯ WN )2 = σt (W

√ ¯ WN h∥2 for a fixed unit vector h is maximized at ∥W ¯ WN h∥2 = do when σ1 (W ¯ WN ) = Under this restriction, the norm ∥W √ ¯ WN ) = 0 for t = 2, . . . , T , which means that W ¯ WN is of rank one. Using such W corresponds to using do and σt (W only one feature to discriminate the model probability distribution from the target. ¯ FN )2 + σ2 (W ¯ FN )2 + · · · + σT (W ¯ FN )2 = 1, and the same argument as Similarly, Frobenius normalization requires σ1 (W above follows. Here, we see a critical problem in these two regularization methods. In order to retain as much norm of the input as ¯ WN h large. For possible and hence to make the discriminator more sensitive, one would hope to make the norm of W weight normalization, however, this comes at the cost of reducing the rank and hence the number of features to be used for the discriminator. Thus, there is a conflict of interests between weight normalization and our desire to use as many features as possible to distinguish the generator distribution from the target distribution. The former interest often reigns over the other in many cases, inadvertently diminishing the number of features to be used by the discriminators. Consequently, the algorithm would produce a rather arbitrary model distribution that matches the target distribution only at select few features. Our spectral normalization, on the other hand, do not suffer from such a conflict in interest. Note that the Lipschitz constant of a linear operator is determined only by the maximum singular value. In other words, the spectral norm is independent of rank. Thus, unlike the weight normalization, our spectral normalization allows the parameter matrix to use as many features as possible while satisfying local 1-Lipschitz constraint. Our spectral normalization leaves intact the freedom of choosing the number of singular components (features) to feed to the next layer of the discriminator. To see this more visually, we refer the reader to Figure (14). Note that spectral normalization allows for a wider range of choices than weight normalization. 3 In the original literature, the weight normalization was introduced as a method for reparametrization of the form (8), given by [ ]T ¯ ¯ 1T , γ2 w ¯ 2T , ..., γdo w ¯ dTo where γi ∈ R is to be learned in the course of the training. In this work, we deal with the case WWN := γ1 w γi = 1 so that we can assess the methods under the Lipschitz constraint.

Spectral Normalization for Generative Adversarial Networks 1.0 0.8 0.6

s2

SN WN

0.4 0.2

0.0 0

25

49

Index of s

Figure 2: The possible sets of fifty singular values plotted in increasing order for weight normalization (Blue) and for spectral normalization (Red). For the set of singular values permitted under the spectral normalization condition, we scaled √ ¯ WN by 1/ do so that its spectral norm is exactly 1. By the definition of the weight normalization, the area under the W blue curves are all bound to be 1. Note that the range of choice for the weight normalization is small. In summary, weight normalization and Frobenius normalization favor skewed distributions of singular values, making the column spaces of the weight matrices lie in (approximately) low dimensional vector spaces. On the other hand, our spectral normalization does not compromise the number of feature dimensions used by the discriminator. In fact, we will experimentally show that GANs trained with our spectral normalization can generate a synthetic dataset with wider variety and higher inception score than the GANs trained with other two regularization methods. B.2. Weight Clipping Still another regularization technique is weight clipping introduced by (Arjovsky et al., 2017) in their training of Wasserstein GANs. Weight clipping simply truncates each element of weight matrices so that its absolute value is bounded above by a prescribed constant c ∈ R+ . Unfortunately, weight clipping suffers from the same problem as weight normalization and Frobenius normalization. With weight clipping with the truncation value c, the value ∥W x∥2 for a fixed unit vector x is maximized when the rank of W is again one, and the training will again favor the discriminators that use only select few features. (Gulrajani et al., 2017) refers to this problem as capacity underuse problem. They also reported that the training of WGAN with weight clipping is slower than that of the original DCGAN (Radford et al., 2016). B.3. Singular Value Clipping and Singular Value Constraint One direct and straightforward way of controlling the spectral norm is to clip (bounding) the singular values (Jia, 2016). This approach, however, is computationally heavy because one needs to implement singular value decomposition in order to compute all the singular values. A similar but less obvious approach is to parametrize W ∈ Rdo ×di as follows from the get-go and train the discriminators with this constrained parametrization: W := U SV T ,

(15) T

T

subject to U U = I, V V = I, max Sii = K, i

(16)

where U ∈ Rdo ×P , V ∈ Rdi ×P , and S ∈ RP ×P is a diagonal matrix. However, it is not a simple task to train this model while remaining absolutely faithful to this parametrization constraint. Our spectral normalization, on the other hand, can carry out the updates with relatively low computational cost without compromising the normalization constraint. B.4. WGAN with Gradient Penalty (WGAN-GP) Recently, (Gulrajani et al., 2017) introduced a technique to enhance the stability of the training of Wasserstein GANs (Arjovsky et al., 2017). In their work, they endeavored to place K-Lipschitz constraint on the discriminator by augmenting the adversarial loss function with the following regularizer function: ˆ 2 − 1)2 ], λ E [(∥∇xˆ D(x)∥ ˆ x∼p ˆ x

(17)


ˆ is: where λ > 0 is a balancing coefficient and x ˆ := ϵx + (1 − ϵ)x, ˜ x

(18)

˜ = G(z), z ∼ pz . where ϵ ∼ U [0, 1], x ∼ pdata , x

(19)

Using this augmented objective function, (Gulrajani et al., 2017) succeeded in training a GAN based on ResNet (He et al., 2016) with an impressive performance. The advantage of their method in comparison to spectral normalization is that they can impose local 1-Lipschitz constraint directly on the discriminator function without a rather round-about layer-wise normalization. This suggest that their method is less likely to underuse the capacity of the network structure. ˆ suffers from the obvious problem of At the same time, this type of method that penalizes the gradients at sample points x not being able to regularize the function at the points outside of the support of the current generative distribution. In fact, the generative distribution and its support gradually changes in the course of the training, and this can destabilize the effect of the regularization itself. On the contrary, our spectral normalization regularizes the function itself, and the effect of the regularization is more stable with respect to the choice of the batch. In fact, we observed in the experiment that a high learning rate can destabilize the performance of WGAN-GP. Training with our spectral normalization does not falter with aggressive learning rate. Moreover, WGAN-GP requires more computational cost than our spectral normalization with single-step power iteration, because the computation of ∥∇x D∥2 requires one whole round of forward and backward propagation. In the experiment section (Figure 7), we compare the computational cost of the two methods for the same number of updates. Having said that, one shall not rule out the possibility that the gradient penalty can compliment spectral normalization and vice versa. Because these two methods regularizes discriminators by completely different means, one may achieve better results by combining them.

C. Details of Experiments and Results For optimization, we used the Adam optimizer (Kingma & Ba, 2015) in all of our experiments. We tested with 6 settings for (1) the number of discriminator updates per one generator update (ndis (2) the hyper-parameters on Adam (the learning rate α, first-order momentum β1 , and second-order momentum β2 ). We list the details of these settings in Table 3. Out of these 6 settings, A, B, and C are the settings used in previous representative works. The purpose of the settings D, E, and F is the evaluation of the algorithms implemented with more aggressive learning rates. We train GANs with 100K generator updates for all experiments without Table 3: Hyper-parameter settings we tested in our experiments. †, ‡ and ⋆ are the hyperparameter settings following (Gulrajani et al., 2017), (Warde-Farley & Bengio, 2017) and (Radford et al., 2016), respectively. α

β1

β2

ndis

0.0001 0.0001 0.0002 0.001 0.001 0.001

0.5 0.5 0.5 0.5 0.5 0.9

0.9 0.999 0.999 0.9 0.999 0.999

5 1 1 5 5 5

Setting A† B‡ C⋆ D E F

C.1. Quality Assessment on Generated Examples In Figures 3a and 3b, we show the inception scores of each method with the settings A–F. We can see that spectral normalization is relatively robust to aggressive learning rates and momentum parameters. WGAN-GP fails to train good GANs at high learning rates and high momentum parameters on both CIFAR-10 and STL-10. Weight normalization is more robust than WGAN-GP on CIFAR-10 in this aspect. The optimal performance of weight normalization was inferior


(a) CIFAR-10

WN

SN

A B C D E F LN

Inception score

9 8 7 6 5 4 3 2 1 0

WGAN-GP

SN

WN

LN

BN

A B C D E F WGAN-GP

8 7 6 5 4 3 2 1 0

GAN-GP

Inception score

to both WGAN-GP and spectral normalization on STL-10, which consists of more diverse examples than CIFAR-10. Best scores of spectral normalization are better than all other methods on both CIFAR-10 and STL-10.

(b) STL-10

Figure 3: Inception scores on CIFAR-10 and STL-10 with different methods and hyperparameters.

C.2. Generated Images With Different Methods

WGAN-GP

Dataset

Weight Norm.

Spectral Norm.

A

B

C

D

E

F

Figure 4: CIFAR-10 generated images with different methods: WGAN-GP, weight normalization, and spectral normalization.


GAN-GP

Dataset

Layer Norm.

Batch Norm.

A

B

C

D

E

F

Figure 5: CIFAR-10 generated images with different methods: GAN-GP, layer normalization, with batch normalization.


WGAN-GP

Dataset

Weight Norm.

Spectral Norm.

A

B

C

D

E

F

Figure 6: STL-10 generated images trained on STL-10 with different methods: WGAN-GP, weight normalization, and spectral normalization.

C.3. Training Time In Figure 7, we show the computational time in second for 100 updates of the generator. As we mentioned in Section B.4, WGAN-GP is slower than other methods because WGAN-GP needs to calculate the gradient of gradient norm ∥∇x D∥2 . On CIFAR-10, spectral normalization is a tad slower than weight normalization, but significantly faster than WGAN-GP. For STL-10, the computational time of spectral normalization is almost the same as vanilla GANs, because the relative computational cost of the power iteration (10) is negligible when compared to the cost of forward and backward propagation on CIFAR-10 (images size of STL-10 is larger (48 × 48))

45 40 35 30 25 20 15 10 5 0 WGAN-GP WN

SN Vanilla

Seconds for 100 generator updates

Seconds for 100 generator updates


100

(a) CIFAR-10 (image size:32 × 32 × 3)

80 60 40 20 0 WGAN-GP WN

SN Vanilla

(b) STL-10 (images size:48 × 48 × 3)

Figure 7: Computational time for 100 generator updates. We set ndis = 5

0.0 0.5 1.0 1.5 2.0 2.5 3.0

7.5 Inception score

Critic loss

C.4. Learning Curves with Different Reparametrization Methods

0

WGAN-GP (train) WGAN-GP (valid)

20000 40000 60000 80000 100000 Update of the generator WGAN-GP w/ WN (train) WGAN-GP w/ WN (valid)

(a) Critic loss

WGAN-GP w/ SN (train) WGAN-GP w/ SN (valid)

7.0 6.5 6.0 5.5 5.0

0

20000 40000 60000 80000 100000 Update of the generator WGAN-GP

WGAN-GP w/ WN

WGAN-GP w/ SN

(b) Inception score

Figure 8: Learning curves of (a) critic loss and (b) inception score on different reparametrization method on CIFAR10;weight normalization (WGAN-GP w/ WN), spectral normalization (WGAN-GP w/ SN), and parametrization free (WGAN-GP).


D. Network Architectures Table 4: CNN models for SVHN, CIFAR-10 and STL-10 used in our experiments on image Generation. The slopes of all lReLU functions in the networks are set to 0.1. Input RGB image x ∈ RM ×M ×3 z ∈ R128 ∼ N (0, I) dense → Mg × Mg × 512 4×4, stride=2 deconv. BN 256 ReLU 4×4, stride=2 deconv. BN 128 ReLU

3×3, stride=1 conv. 64 lReLU 4×4, stride=2 conv. 64 lReLU 3×3, stride=1 conv. 128 lReLU 4×4, stride=2 conv. 128 lReLU

4×4, stride=2 deconv. BN 64 ReLU

3×3, stride=1 conv. 256 lReLU 4×4, stride=2 conv. 256 lReLU

3×3, stride=1 deconv. 3 Tanh

3×3, stride=1 conv. 512 lReLU

(a) Generator, Mg = 4 for SVHN and CIFAR10, and Mg = 6 for STL-10

dense → 1 (b) Discriminator, M = 32 for SVHN and CIFAR10, and M = 48 for STL-10