Generative Adversarial Networks: Technology Review

2 downloads 0 Views 1MB Size Report
Generative adversarial networks (GANs) are a class of neural networks that are used in unsupervised ..... Modeling the data generator as a stochastic policy in.
Generative Adversarial Networks: Technology Review By Tarun Bhatia (tbhatia30) AT gatech.edu Acknowledgement I heartfully thank Prof Dr. Ling Liu for giving me the opportunity to write on this topic. News about GANs always fascinated me, but I never had time to explore this topic in detail. Writing about GANs in my technology review was absolutely a fantastic experience, and I hope readers will find the same while reading.

Introduction There is lot of hype around GANs being the next big thing in deep learning, well let’s see:I have tried to keep this article simple and intuitive so that people can easily understand it. What is GAN? Generative adversarial networks (GANs) are a class of neural networks that are used in unsupervised machine learning [1]. As the name suggests GANs are composed of 2 neural networks 1) Generative (that produces answers) 2) Adversarial other Discriminator (that distinguishes between real and generated answers). These networks are trained competitively, the task of the adversarial network is to penalise the generator network so that Generative network produces answers so close to real ones that adversarial network cannot distinguish between real and synthetic solutions even with unlimited time and sufficient resources. Why GAN? Consider you want to generate new cat images. You have a training set of cats. Problem Generate the loss function

If Random Input

Generator

Training Data

Sample Optimizer Loss function

Generating a good loss function so that generated image is close to training data is a hard problem, if you think of generating loss function such that generated image is close to your mean of your training dataset you reward it and penalise it if it is far away, this would result in blurry images. So we have Adversarial network which is trained to predict if the image is from the training distribution of images or not. When Generator is optimised it is optimised to maximize the mistakes the discriminator does (classification loss). It is optimizing itself to generate answers that look so real that discriminator is fooled into believing that it is real. Discriminator is optimised to tell that generated samples are different from training data. This is alternating optimization.

This is simple right. If you have optimal discriminator then we can show that GAN optimizes Jensen Shanon Divergence rather than Kullback Leiber Divergence, if it had optimized KL divergence then we would have generated samples from the mean of the training distribution and would not have much variance involved. One more advantage of adversarial networks is that they can represent very sharp and even degenerate distributions, while methods based on Markov chains or other models require that the distribution be somewhat blurry in order for the chains to be able to mix between modes. GANs were introduced in in December 2014 by Ian Goodfellow [1] and his colleagues from University Of Montreal, samples generated were blurry like these below but these were still good enough at that time.

Disadvantages: Mode Collapse It was hard to train GAN. They were not stable because of the alternating optimization and Generator and Discriminator fighting against each other. Generator would be optimised without optimising the discriminator thus leading to collapse of Generator, i.e. Generator outputs only handful or even single result which is able to fool the discriminator. This is obviously a failure as we originally wanted to learn the underlying the distribution of data rather than output a few successfully synthesized results. This is also known as Mode collapse. Predicting Pixels Based on Context

GANs are trained to predict all pixels in an image at once. Hence giving one pixel and predicting its neighboring pixels is difficult. Difficulty in reaching Convergence Generator and discriminator losses keep oscillating and network does not converge to an optimal solution. Relative Strength of two networks When either of the two networks become extremely strong relative to other, network does not learn beyond a point. Simple mathematical understanding of GANs A neural network G(z, v₁) is used to model the Generator. It maps the input noise variables z to the desired data space x . Also there is a second neural network D(x, v₂) which models the discriminator and outputs the probability that the data came from the real dataset, in the range (0,1), where, vᵢ represents the weights that define each neural network. As a result, the Discriminator is trained to correctly classify the input data as real or fake. Which means it’s weights are updated as to maximize the probability that any real data input is classified as belonging to the real dataset, while minimizing the probability that any fake image is classified as belonging to the real dataset. In more technical terms, the loss function used maximizes the function D(x), and at the same time it also minimizes D(G(z)). Further, the Generator is trained to fool the Discriminator by generating data that is as realistic as possible, which means that the Generator’s weights are optimized to maximize the probability that any fake image is classified as belonging to the real dataset. Formally this means that the loss function used for this network maximizes D(G(z)). In practice, the logarithm of the probability is used in the loss functions , since using a log loss heavily penalises classifiers that are confident about an incorrect classification. Since we know during training both the Discriminator and Generator are trying to optimize opposite loss functions, they can be thought of two agents playing a minimax game with value function V(G,D). In this minimax game, the generator is trying to maximize it’s probability of having it’s outputs recognized as real, while the generator is trying to minimize this same value.

Training GAN

The fundamental steps to train a GAN can be described as following: 1. 2. 3. 4. 5.

Sample a noise data set and a real data set, each with size m. Train the Discriminator on this data. Sample a different noise data subset with size m. Train the Generator on this data. Repeat from Step 1.

GAN practical hacks Here are some practical tricks when working with GANs. 



Normalizing images: Standard practice of normalizing images by mean normalizing and scaling by standard deviation should work. Subtracting the dataset mean serves to "center" the data. Ideally you would like to divide by the standard deviation of that feature or pixel as well if you want to normalize each feature value to a z-score. The reason we do both of those things is because in the process of training our network, during backpropagation we would like each feature to have a similar range so that our gradients don't go out of control. Another way you can think about it is deep learning networks traditionally share a lot of parameters if you didn't scale your inputs in a way that resulted in similarly ranged feature values (ie: over the whole dataset by subtracting mean) sharing wouldn't happen very easily because to one part of the image weight is a lot and to another it's too small. Make sure that images are normalized to values between -1 and +1. Also output of generators should be within the same range so we add a tanh layer [4]. A modified loss function

In GAN papers, the loss function to optimize G is min (log 1-D), but in practice folks practically use max log D because the first formulation has vanishing gradients early on [1]. 

Flip Labels

Label of the real image is made fake(say class 0) and that of generated image is made real (class 1).While training generator make labels noisy for the generator, this is motivated by G wanting to degrade(fool) D. Also occasionally while training Discriminator flip the labels, this to prevent discriminator from becoming too strong. 





   

Use Batch Normalizaton It is recommended to use batch normalization for both discriminator and Generator but what is skipped is how to use batch normalization. Always use different batches for real and fake data, you should never mix the two. Batches help discriminator and generator become stronger.[4] Avoid Sparse Gradients Stability of the GANs suffer a lot when we use sparse gradients like maxpooling or RELU. Use soft gradients like LeakyReLU. When downsampling use Average pooling, strided convolution, ewhen upsampling use transposed convolution with stride or pixel shuffle. DCGAN / Hybrid Models Use DCGAN when you can. If you can’t use DCGANs and no model is stable, use a hybrid models like VAE + GAN [4] Stability tricks from Reinforcement Learning Experience Replay can help in stability[4] Optimizer Adam optimizer [4] works well. If discriminator loss quickly goes down to 0. Its not a good sign. If loss of generator steadily decreases, then it is fooling D with Garbage.

GAN Applications Image generation

High resolution Image Generation

Grammatical Abstract Reasoning

Image in-painting

Semantic Segmentation

Text to image synthesis

Video Generation

Age Progression/Regression

Arithmetic on Faces

Pose Estimation

Style Transfer

Disco GAN (Transform a shoe into its equivalent bag)

Extensions GANs were extended in various ways to solve the issues and make them more useful. Class Conditional GANS Apart from noise vector they added a one hot class vector during training time. For each iteration, the generator takes as input not only z but also a one-hot encoded vector indicating the digit. The discriminator input consists then of not only the real or fake sample but also the same label vector. Proceeding the same way as before but with this slight change of inputs, the Conditional GAN (CGAN) learns to generate samples conditioned on the label it takes as input [15].

This is supervised algorithm. This model could be used to learn a multi-modal model, and provide preliminary examples of an application to image tagging. SRGAN Ledig et al [5] proposed a new feed-forward network as the generating function that used a perceptual loss that was the weighted combination of several components as you can see in the figure below. In essence, this paper features a deep residual network with the capability for large upscaling factors with photo-realistic reconstructions of low-resolution images. These were egarded as the state of the art for super-resolution images.

Several methods are used to introduce auxiliary, conditional information into a super-resolution model to produce images more tuned to the human eye.

Wasserstein GAN The Wasserstein GAN (WGAN) is one of the most popular GANs and consists of an objective change which results in training stability, interpretability (correlation of the losses with sample quality) and the ability of generating categorical data. The key aspect is to approximate the true data distribution, and for that the choice of distance measure between distributions is important, as that is the objective to minimize. The WGAN chooses the Wasserstein or Earth-Mover short for EM distance, because informally it can be interpreted as moving piles of dirt that follow one probability distribution at a minimum cost to follow the other distribution. The cost is quantified by the amount of dirt moved times the moving distance. It can be shown that it converges for sets of distributions for which the Kullback-Leibler and Jensen-Shannon divergences don’t. Wasserstein GAN on categorical data The authors of the WGAN paper [6] show that a GAN trained in this way exhibits training stability and interpretability, but only later was it proven that using the Wasserstein distance also provides the GAN with the ability of generating categorical data (i.e., not continuous-valued data like images or even integer-coded data like 1 for Sunday, 2 for Monday and so on). While if the original GAN was trained on this kind of data, the

discriminator’s loss would remain low throughout iterations while the generator’s wouldn’t stop increasing, training a WGAN on categorical data is done the same way as on continuous-valued data.

Sadly, Wasserstein GAN is not perfect. WGAN still suffers from unstable training, slow convergence after weight clipping (when clipping window is too large), and vanishing gradients (when clipping window is too small). A further improvement on the Wasserstein GAN is the Cramer GAN, discussed later. Bidirectional GAN The Bidirectional GAN (BiGAN) [11] is an attempt at solving this issue of WGAN that it doesn’t allow the access to latent space representations of data. Finding these representations may be useful not only for controlling what data to generate by moving continuously in the latent space, but also for feature extraction. It has an Encoder along with the decoder which maps data to latent representations.The BiGAN discriminator not only distingues in data space but also latent space. It may not be obvious but BiGAN encoder E learns to invert the Generator, although both of them never communicate with each other.

InfoGAN We saw before that CGANs can help us in generating samples according to classes or labels, now can we predict these classes at discriminator? This is important is it helps the machine to understand the underlying features in the input, also known as information disentanglement.

Such a GAN is known as InfoGAN. Intutively, the InfoGAN tries to maximize the mutual information between the generator’s input code space and an inference net’s output. The inference net can be set to simply be an output layer on the discriminator network, sharing all the other parameters, making it computationally free. Once trained, the InfoGAN’s discriminator inference output layer can be used for feature extraction or, if the code space contains label information,it can be used for classification. Adversarial Autoencoder The Adversarial Autoencoder (AAE) is where autoencoders meet GANs. Lets see what is AutoEncoder? An Autoencoder is a neural network that is trained to produce an output which is very similar to its input (so it basically attempts to copy its input to its output) and since it doesn’t need any targets (labels), it can be trained in an unsupervised manner.

In Adversarial Autoencoder model, two objectives are optimized: 1) The minimization of the reconstruction error of the data x through the encoder and decoder networks, P and Q, respectively. 2) The second training criterion is the enforcement of a prior distribution on the code , via adversarial training . So while encoder and decoder are optimized to minimize the distance between x and Q(z), where z is the code space vector of the autoencoder, encoder and Discriminator are optimized as a GAN to force the code space P(x) to match a pre-defined structure. This can be seen as a regularization on the autoencoder, forcing it to learn a meaningful, structured and cohesive code [12] that allows for effective feature extraction or dimensionality reduction. Adversarial autoencoders can be used to disentangle style and content of images as shown in the paper[14]

Figure shows the architecture of an adversarial autoencoder. In the first row we have a standard autoencoder that reconstructs an image x from a latent code z. The bottom row diagrams a second network trained to discriminatively predict whether a sample arises from the hidden code of the autoencoder or from a sampled distribution specified by the user. Another thing we can do is train the Adversarial Autoencoder with labels to force the disentanglement of label and digit style information. This way, by fixing the desired label, variations in the imposed continuous latent space will result in different styles of the same digit. For digit number eight for example:

Time series data Often, real-world structured data consists of time series. This is data in which each sample has some dependence on the previous sample. For this type of data, using Recurrent Neural Network (RNN)-based models are often chosen for their intrinsic ability of modelling it. Leveraging these neural networks in our GAN models could, in principle, result in higher-quality samples and features! Recurrent GAN We will replace Muti Layer Perceptrons (MLP) we used before in our GANs by Recurrent Neural Networks (RNN). In particular, let’s use Long Short-Term Memory (LSTM) units . This had wide spread applications like generating speech, music, medical data, abstract reason and diagram generaton etc.

SEQGAN While using RNNs in GANs is useful for real-valued sequential data generation, it still doesn’t work for discrete sequences . A major reason lies in that the discrete outputs from the generative model make it difficult to pass

the gradient update from the discriminative model to the generative model. Also, the discriminative model can only assess a complete sequence, while for a partially generated sequence, it is non-trivial to balance its current score and the future one once the entire sequence has been generated. Sequence generation framework, called SeqGAN, to solve the problems. Modeling the data generator as a stochastic policy in reinforcement learning (RL), SeqGAN bypasses the generator differentiation problem by directly performing gradient policy update. The RL reward signal comes from the GAN discriminator judged on a complete sequence, and is passed back to the intermediate state-action steps using Monte Carlo search. Extensive experiments on synthetic data and real-world tasks demonstrate significant improvements over strong baselines.

Cramer GAN A further improvement on the Wasserstein GAN is the Cramer GAN, which aims at providing even betterquality samples and improved training stability. Authors claim that crammer distance metric possesses all the three natural properties of probability divergences : 1) sum variance 2)Scale sensitivity 3) Unbiased sample gradients. Wasserstein metric posses first 2 properties but does not posses third. Professor Arthur Gretton [8], notes that overall it is a good idea but the paper uses problematic approximation My thoughts on Future Research in GANs GANs have been very promising in generating new data, this has many applications though mostly until now these applications have been in the field of computer vision (images,videos). Recently researchers at Insilico Medicine proposed an approach of artificially intelligent drug discovery using GANs. The goal is to train generator to sample drug candidates for a given disease as closely as possible to existing drugs from drug database. After training, it would be possible to generate a drug for a previously incurable disease using discriminator to determine whether the sampled drug actually cures the given disease or not. They have also been applied to Reinforcement as Generative Adversarial Imitation Learning, as a result it is possible for agent to learn policies from expert demonstration without rewards on hard AI environments. The great push to concept of disentanglement with GANs is quite interesting, this can have widespread applications, particularly I am interested in seeing its application in financial data, I believe if we can disentangle the various factors we can come up with Black Scholes type of strategy. But still there is long way to go as we still need to improve the stability of the GANs. There is a lot of experimentation going on how to improve the stability, be in on calculation gradients or changing batch normalization to weight normalization. Also, it is unclear if GANs properly learn the underlying distribution, GANs do perform some underlying learning but formalizing what they do is still an open problem. However, I feel the deeper promise GANs is that, in the process of training generative models, we will endow the computer with an understanding of the world and what it is made up of and make them more creative.

References: [1] Goodfellow, Ian, et al. “Generative adversarial nets.” NIPS, 2014. [2] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. “Improved techniques for training gans.” In Advances in Neural Information Processing Systems. [3] Martin Arjovsky and Léon Bottou. “Towards principled methods for training generative adversarial networks.” arXiv preprint arXiv:1701.04862 (2017). [4] https://github.com/soumith/ganhacks [5] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz,Zehan Wang, Wenzhe Shi : “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network.” [6] Martin Arjovsky, Soumith Chintala, and Léon Bottou. “Wasserstein GAN.” [7] https://www.quora.com/Why-isnt-the-Jensen-Shannon-divergence-used-more-often-than-the-Kullback-Leibler-sinceJS-is-symmetric-thus-possibly-a-better-indicator-of-distance [8] https://towardsdatascience.com/notes-on-the-cramer-gan-752abd505c00 [9] Marc G. Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, Rémi Munos: “The Cramer Distance as a Solution to Biased Wasserstein Gradients” [10] A.P. Majtey, P.W. Lamberti, and D.P. Prato “Jensen-Shannon divergence as a measure of distinguishability between mixed quantum states “

[11] Jeff Donahue, Philipp Krähenbühl, Trevor Darrell “Adversarial Feature Learning” [12] https://www.cs.toronto.edu/~hinton/csc2535/notes/lec11new.pdf

[13] Mostapha Benhenda “ChemGAN challenge for drug discovery: can AI reproduce natural chemical diversity?” [14] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow “Adversarial Autoencoders” [15] https://medium.com/jungle-book/towards-data-set-augmentation-with-gans-9dd64e9628e6