Generative Adversarial Networks

Introduction

Imagine an art forger and a museum curator locked in an endless game: the forger produces fake paintings, the curator learns to spot them, so the forger improves, and so on. Round after round, each player forces the other to get better and the forgeries become nearly perfect. This race is the core idea behind Generative Adversarial Networks (GANs): a powerful framework to train so-called generative models. This post walks through the key ideas so you can build a rigorous intuition for what GANs are, why they work, and what their strengths and weaknesses are.

What is Generative Modelling?

Generative modelling is an unsupervised learning task that consists of learning an estimate, \(p_{\text{model}}\), of an unknown probability distribution, \(p_{\text{data}}\), given a set of samples drawn from it, \(\mathcal{D} = \{\vb{x}^{(i)}\}_{i=1}^m\). In some cases, the generative model estimates \(p_{\text{model}}\) explicitly; in others, it does so implicitly, learning to generate samples \(\tilde{\vb{x}}\) from \(p_{\text{model}}\) without ever representing the density directly. The goal is not just to memorise the training data but to capture the underlying structure of the data distribution, enabling the model to synthesise new, plausible examples that share the same statistical characteristics as the training set.

Data can be anything that can be represented numerically: images, text, audio, molecular structures, or even quantum states. In the case of images, for example, this means that the generative model learns to estimate the probability density \(p_{\text{data}}(\vb{x})\) of each possible combination \(\vb{x}\) of the pixel values, sometimes by learning an explicit density function, and sometimes by learning to generate samples that match the distribution without ever representing it directly.

Often, the different components \(x_i\) of the visible or observed variable \(\vb{x}=(x_1,\ldots, x_n)\) are highly dependent on each other. In the context of deep learning, the most commonly used approach to model these dependencies is to introduce several latent or hidden variables, \(\vb{z}\). The model thus tries to learn the dependencies between any pair of components \(x_i\) and \(x_j\) indirectly, through the direct dependencies between \(x_i\) and \(\vb{z}\), and the direct dependencies between \(x_j\) and \(\vb{z}\). During training, the model learns to give meaning to the latent space, and \(\vb{z}\) becomes a compressed representation of the data.

GANs: Two Networks, One Game

Generative adversarial networks, commonly known as GANs, were first presented in (Goodfellow et al. 2014) as a strategy for training a generative model. What defines this technique is that it reframes an unsupervised learning task as a supervised one by introducing two competing submodels:

The generator \(G\): the generative model we ultimately want. It takes a random latent vector \(\vb{z}\) as a seed and produces a new sample \(\tilde{\vb{x}}\). After training, points in the latent space form a compressed representation of the data distribution, and the generator acts as a decoder.
The discriminator \(D\): an auxiliary binary classifier introduced to supply a supervised training signal. It attempts to distinguish real samples (from the training set) from generated ones. How well the generator fools the discriminator at each training step provides the feedback used to update \(G\). Once training is complete, \(D\) is typically discarded.

Figure 1: GANs framework.

As shown in Figure 1, the two models are trained together in what, in game theory, is called a non-cooperative game: \(D\) tries to detect whether a sample is generated or real, while \(G\) tries to maximize the probability that \(D\) is wrong. In the limit, and if both models have sufficient capacity, the discriminator is unable to distinguish the generated samples from the real ones half of the time, which means that the generator has learned to create realistic samples, that is, it has correctly estimated the distribution of data \(p_{\text{data}}\). Because it is natural to analyse GANs with the tools of non-cooperative game theory, this architecture is called “adversarial”. However, it also admits a cooperative interpretation: the discriminator, rather than an adversary, acts more like an instructor teaching the generator how to better estimate the data distribution.

GANs estimate the data distribution implicitly by generating samples from \(p_{\text{model}}\), so there is no need to predefine a parametric family of density functions. The term “network” reflects the fact that both components are typically implemented as deep neural networks — powerful function approximators capable of modelling complex, high-dimensional distributions. More formally, the generator \(G\) is a differentiable function with parameters \(\vb*{\theta}^{(G)}\) that maps a random vector \(\vb{z} \sim p_{\vb{z}}\) (usually a Gaussian) to a sample \(G(\vb{z};\vb*{\theta}^{(G)})\) drawn from a distribution \(p_G\) that approximates \(p_{\text{data}}\). The discriminator \(D\) is a differentiable scalar function with parameters \(\vb*{\theta}^{(D)}\) whose output \(D(\vb{x}; \vb*{\theta}^{(D)})\) estimates the probability \(P_D(Y=1\mid \vb{x})\) that a given sample \(\vb{x}\) comes from the real distribution \(p_{\text{data}}\) (\(y=1\)) rather than from \(p_G\) (\(y=0\)).

The cost functions \(J^{(G)}\) and \(J^{(D)}\) depend on the parameters of both models: the discriminator seeks to minimise \(J^{(D)}(\vb*{\theta}^{(G)}, \vb*{\theta}^{(D)})\) by controlling only \(\vb*{\theta}^{(D)}\), while the generator seeks to minimise \(J^{(G)}(\vb*{\theta}^{(G)}, \vb*{\theta}^{(D)})\) by controlling only \(\vb*{\theta}^{(G)}\). Since each player’s cost function depends not only on its own parameters but also on those of the other player, which it cannot control, it is more natural to describe this scenario as a (non-cooperative) game than as a classical optimisation problem. While the solution to a classical optimisation problem is a (local) minimum in parameter space, the solution to a non-cooperative game is a Nash equilibrium — a tuple \((\vb*{\theta}^{(G)}, \vb*{\theta}^{(D)})\) that is simultaneously a local minimum of \(J^{(D)}\) with respect to \(\vb*{\theta}^{(D)}\) and a local minimum of \(J^{(G)}\) with respect to \(\vb*{\theta}^{(G)}\).

The cost function for the discriminator was originally defined as:

\[ J^{(D)}(\vb*{\theta}^{(G)}, \vb*{\theta}^{(D)}) = -\mathbb{E}_{ \vb{x}\sim p_{\text{data}}}[\log \underbrace{(D(\vb{x}))}_{P_D(Y=1\mid \vb{x})}] -\mathbb{E}_{\vb{z}\sim p_{\vb{z}}}[\log\underbrace{(1-D(G(\vb{z})))}_{P_D( Y=0\mid \vb{x})}], \]

where \(\mathbb{E}_{\vb{x}\sim p}[f(\vb{x})]\) denotes the expected value of \(f (\vb{x})\) when \(\vb{x}\) is taken from the distribution \(p\). This is the cost function proposed in (Goodfellow et al. 2014) and is simply the binary cross-entropy loss, commonly used to train binary classifiers with a sigmoid output.

The simplest version of the game is what game theory calls a zero-sum game:

\[ J^{(G)} = -J^{(D)}. \]

In zero-sum game, the sum of both players’ costs is zero, so any gain for one implies an equal loss for the other.

This relationship allows the entire game to be summarised by a single value function, defined as \[ V(\vb*{\theta}^{(G)}, \vb*{\theta}^{(D)}) = J^{(G)} = -J^{(D )} = \mathbb{E}_{\vb{x}\sim p_{\text{data}}}\bqty{\log\pqty{D(\vb{x})}}+\mathbb{E} _{\vb{z}\sim p_{\vb{z}}}\bqty{\log\pqty{1-D(G(\vb{z}))}}, \]

so the parameters of the optimal generator are

\[ \vb*{\theta}^{(G)*} = \underset{\vb*{\theta}^{(G)}}{\arg\min}\bqty{\max_{\vb*{\theta}^{(D)}} V(\vb*{\theta}^{(G)}, \vb*{\theta}^{(D)})}. \]

Concretely, to learn to correctly classify examples as real or generated, \(D\) is trained to maximize the logarithm of \(P_D(Y=1\mid \vb{x}) = D(\vb{x})\) for the real samples and the logarithm of \(P_D(Y=0\mid \vb{x}) = 1-D(\vb{x})\) for the generated ones (\(\vb{x} = G(\vb{z })\)), so that, ideally, the estimated probability for the real samples is 1 and for the generated ones, 0. On the contrary, \(G\) tries to achieve just the opposite objective, that is, that the probability predicted by the discriminator for the generated samples is close to 1, making the discriminator “believe” that the generated samples are real. Because their solution involves a minimization process in an outer loop and a maximization process in another inner loop, zero-sum games are also called minimax games.

Using this formulation of GANs, in (Goodfellow et al. 2014) it is shown that:

Learning in this game amounts to minimizing the so-called Jensen-Shannon divergence between the distributions \(p_{\text{data}}\) and \(p_G\).
The game has a global optimum in which \(G(\vb{z})\) follows the same distribution as the training data, that is, \(p_G = p_{\text{data}}\), and \(D(\vb {x}) = 1/2\) for all \(\vb{x}\).
Algorithm 1 ensures the convergence of \(p_G\) to \(p_{\text{data}}\) if \(G\) and \(D\) can be directly updated in the function space¹.

Strengths and Weaknesses of GANs

GANs have several strengths that have contributed to their popularity and success in generative modelling. First, they are capable of producing high-quality, realistic samples that are difficult to distinguish from real data. In this regard, GANs showed superior performance compared to previous frameworks such as Variational Autoencoders (VAEs) (Kingma and Welling 2022). Second, generation requires only a single forward pass through the generator, making inference fast. This is a considerable advantage over so-called diffusion models (Ho et al. 2020), which require many sequential denoising steps. Third, the GAN framework is architecture-agnostic: the generator and discriminator can be implemented with any differentiable model, such as convolutional networks or recurrent networks. This flexibility makes GANs applicable to many data types: images, audio, video, and beyond. Finally, the basic framework extends naturally to conditional generation by feeding auxiliary information (class labels, text, a paired image) into both networks, as shown in Figure 2. This has yielded conditional GANs (Mirza and Osindero 2014), AC-GANs (Odena et al. 2017), InfoGANs (Chen et al. 2016), and image-to-image translation models such as Pix2Pix (isola_image?–image_2017) and CycleGAN (Zhu et al. 2017), enabling fine-grained control over the generated output. These extensions have made GANs a versatile tool for many tasks: super-resolution, style transfer, data augmentation, representation learning, and more.

Figure 2: Example of conditional GANs framework. In this case, AC-GANs.

However, GANs also present several weaknesses that have limited their applicability and reliability. Generative modelling is an inherently hard problem: unlike discriminative modelling, a generative model must capture the full richness and diversity of the data distribution rather than just predicting labels. In practice, generative model designers face a persistent trilemma: it is difficult to simultaneously achieve high sample quality, broad diversity (mode coverage), and fast sampling. Most approaches sacrifice at least one of these properties. GANs occupy a distinctive corner of this trilemma: they achieve high sample quality and fast inference, but often at the cost of limited diversity and training instability.

The first problem, limited diversity, is known as mode collapse: the generator learns to produce a narrow subset of plausible outputs rather than covering the full data distribution. It may discover a handful of modes of the data distribution that are easy to replicate and consistently fool the discriminator, then keep generating from those modes while ignoring the rest of the distribution. The root cause is that the discriminator evaluates individual samples in isolation, with no access to diversity information, so there is no direct penalty for repetition. Partial mitigations have been proposed (Salimans et al. 2016), but mode collapse remains an theoretically unresolved failure mode and a clear disadvantage relative to other frameworks such as VAEs and diffusion models.

The second problem of GANs is training instability. The adversarial objective is a game, not a standard loss minimisation, and the dynamics of two coupled optimisers are far less well understood than gradient descent on a single objective. If the discriminator becomes too strong too quickly, the generator’s gradients vanish; if the generator dominates, the discriminator loses its ability to provide a useful signal. Sensitivity to hyperparameters is high, and training can oscillate or diverge without warning. Various best practices have been developed to mitigate these issues (Salimans et al. 2016; Arjovsky et al. 2017; Gulrajani et al. 2017), but they do not eliminate them, and GAN training is still widely regarded as a dark art compared to the stable objectives of other methods such as VAEs or diffusion models.

Conclusion

GANs represent a landmark idea in machine learning: by framing generative modelling as a two-player game, (Goodfellow et al. 2014) turned an intractable unsupervised problem into a tractable adversarial one. The elegance and versatility of the framework, its theoretical grounding in game theory, and its remarkable empirical results have made GANs one of the most influential architectures of the past decade. Yet, as we have seen, this power comes with a price: mode collapse and training instability are not mere engineering nuisances but fundamental consequences of the adversarial objective, and they remain only partially solved to this day. In the broader landscape of generative modelling, GANs occupy a well-defined niche: fast, high-fidelity, and highly controllable. However, they have ceded ground to diffusion models in tasks where diversity and stability matter most. Understanding GANs deeply, both their strengths and their failure modes, is therefore not just historically valuable: it sharpens the intuition needed to navigate the full spectrum of modern generative models and to choose the right tool for the task at hand.

References

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. 2017. “Wasserstein Generative Adversarial Networks.” Proceedings of the 34th International Conference on Machine Learning, July, 214–23. https://proceedings.mlr.press/v70/arjovsky17a.html.

Chen, Xi, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. “InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets.” Advances in Neural Information Processing Systems 29. https://proceedings.neurips.cc/paper_files/paper/2016/hash/7c9d0b1f96aebd7b5eca8c3edaa19ebb-Abstract.html.

Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, et al. 2014. “Generative Adversarial Nets.” Advances in Neural Information Processing Systems 27. https://proceedings.neurips.cc/paper_files/paper/2014/hash/f033ed80deb0234979a61f95710dbe25-Abstract.html.

Gulrajani, Ishaan, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. “Improved Training of Wasserstein GANs.” Advances in Neural Information Processing Systems 30. https://papers.nips.cc/paper_files/paper/2017/hash/892c3b1c6dccd52936e27cbd0ff683d6-Abstract.html.

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. “Denoising Diffusion Probabilistic Models.” Advances in Neural Information Processing Systems 33: 6840–51. https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html.

Kingma, Diederik P., and Max Welling. 2022. Auto-Encoding Variational Bayes. arXiv. https://doi.org/10.48550/arXiv.1312.6114.

Mirza, Mehdi, and Simon Osindero. 2014. Conditional Generative Adversarial Nets. arXiv. https://doi.org/10.48550/arXiv.1411.1784.

Odena, Augustus, Christopher Olah, and Jonathon Shlens. 2017. Conditional Image Synthesis With Auxiliary Classifier GANs. arXiv. https://doi.org/10.48550/arXiv.1610.09585.

Salimans, Tim, Ian Goodfellow, Wojciech Zaremba, et al. 2016. “Improved Techniques for Training GANs.” Advances in Neural Information Processing Systems 29. https://papers.nips.cc/paper_files/paper/2016/hash/8a3363abe792db2d8761d6403605aeb7-Abstract.html.

Zhu, Jun-Yan, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. “Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks.” 2017 IEEE International Conference on Computer Vision (ICCV) (Venice), October, 2242–51. https://doi.org/10.1109/ICCV.2017.244.

Footnotes

In practice, \(G\) and \(D\) are represented by deep neural networks, and updates are made in the parameter space, so this result does not apply, although it is not irrelevant.↩︎