Let’s derive the evidence lower bound used in variational inference using Jensen’s inequality and the Kullback-Leibler divergence.
In Bayesian inference we are often tasked with solving analytically intractable integrals. Approximate techniques such as Monte-Carlo and quadrature will work well when the variable being integrated over is low-dimensional. However, for high-dimensional integrals, practitioners often resort to Markov chain Monte-Carlo or variational inference (VI). In this article, we will cover the latter starting from the very basic primitive results encountered in high school or an undergraduate science degree.
Background
In this section we’ll briefly highlight the preliminary results required for understanding VI. The aim here is to firstly establish notation and refresh readers of certain results. If these results are unfamiliar to the reader, then an introduction to Bayesian statistics, such as the Bayesian Data Analysis book may be a more suitable starting point.
Bayes’ Theorem
The cornerstone of Bayesian statistics, Bayes’ theorem seeks to quantify the probability of an event, conditional on some observed data. The event under consideration here is arbitrary and we simply denote its value by $\theta$. We further state our prior belief about the value of $\theta$ through a prior distribution $p_{0}(\theta)$. Our observed data is denoted by $y$. Together, Bayes’ theorem can be expressed by
$$\begin{align}
\label{equn:BayesTheorem}
p(\theta \mid y) = \frac{p(y \mid \theta)p_{0}(\theta)}{p(y)} \ .
\end{align}$$
We'll expand more on this when we get to the topic of VI but, for completeness, we call $p(y\mid\theta)$ the likelihood and $p(y)$ the marginal likelihood.
Expectation
In $\eqref{equn:BayesTheorem}$, $\theta$ is a random variable with associated density $\pi(\theta)$. Whilst we place a prior belief $p_{0}(\theta)$ about the density of $\theta$, this, in practice, is almost certainly not the true underlying density of $\theta$. Assuming $\theta$ is a continuous random variable, the expectation operator is defined as
$$\begin{align}
\label{equn:ExpectationDef}
\mathbb{E}[\theta] = \int \theta \pi(\theta)\mathrm{d}\theta \ .
\end{align}$$
The expectation operator is also linear. This means that if we have a second random variable $\zeta$, then the following identity is true
$$\begin{align}
\label{equn:expectation_linear}
\mathbb{E}[\theta + \zeta] = \mathbb{E}[\theta] + \mathbb{E}[\zeta]\ .
\end{align}$$
Jensen’s Inequality
For a convex function $f$, Jensen’s inequality states the following inequality
$$\begin{align}
\label{equn:Jensens}
f(\mathbb{E}[\theta]) \leq \mathbb{E}[f(\theta)] \ .
\end{align}$$
Informally, a convex function $f$ is a u-shaped function such that if we pick any two values $x_1, x_2$ and draw a line between $f(x)$ and $f(y)$, then the line will be above the function's curve. ![](/imgs/vi_from_scratch/convex_fn.png)
Motivation
With the necessary mathematical in tools, I’ll now motivate the use of VI through three cases where an intractable integral can commonly occur.
Normalising Constants
Bayes’ theorem in $\eqref{equn:BayesTheorem}$ can be written as
$$\begin{align}
\label{equn:NormalisingConstant}
p(\theta \mid y) = \frac{p(y \mid \theta)p_{0}(\theta)}{\int p(y \mid \theta)p_{0}(\theta)\mathrm{d}\theta}\ .
\end{align}$$
The denominator here is simply the integral of the numerator with respect to our model parameters $\theta$ and its purpose is to ensure that the posterior distribution $p(\theta \mid y)$ is a valid probability density function. However, $\theta$ can often be a high-dimensional parameter and the integrand may have no nice analytical form. As such, evaluating the integral can consume a large portion of a Bayesian's time.
Marginalisation
If we introduce an additional parameter $z \in Z$ that we’d like to obtain a posterior distribution for, then we now have the joint posterior distribution $p(\theta, z \mid y)$. Assuming we can calculate this quantity, then often wish to marginalise $z$ from our joint posterior which requires evaluation of the integral
$$\begin{align}
\label{equn:Marginalisation}
p(\theta\mid y) = \int_{Z}p(\theta, z \mid y)\mathrm{d}z \ .
\end{align}
$$
Again, in the absence of an analytical solution, the evaluation of this integral will be challenging for high-dimensional $Z$.
Expectations
If we have our posterior distribution $p(\theta\mid y)$, then we may wish to evaluate the expectation of a function of $\theta$ e.g.,
$$\begin{equation}
\label{equn:Expectations}
\mathbb{E}_{p(\theta \mid y)}[f(\theta)]=\int_{\Theta} f(\theta)p(\theta \mid y) \mathrm{d}\theta \ .
\end{equation}
$$
Examples of such cases are for the conditional mean where $f(\theta)=\theta$ and the conditional covariance $f(\theta) = \theta\theta' - \mathbb{E}[x]\mathbb{E}'[x]$.
Variational Inference
Hopefully by this point the need for techniques for evaluating high-dimensional and challenging integral is clear. In this derivation we’ll be considering the case of an intractable normalising constant which makes accessing our posterior distribution intractable. VI seeks resolve this issue by identifying the distribution $q_{\lambda}^{\star}$ from a family of distributions $\mathcal{Q}$ that best approximates the true posterior $p(\theta\mid y)$. $\lambda$ denotes the parameters of $q$ and are known as variational parameters.
Typically, $\mathcal{Q}$ is the family of multivariate Gaussian distributions, but this is by no means a requirement. We use the Kullback-Leibler (KL) divergence to quantify how well a candidate distribution $q\in\mathcal{Q}$ approximates $p(\theta\mid y)$ and we write this as
$$\begin{align}
\label{equn:KLDiv}
\operatorname{KL}(q_{\lambda}(\theta) \Vert p(\theta\mid y) := \int q(\theta)\log\frac{q(\theta)}{p(\theta\mid y)} \ .
\end{align}$$
There are two things to note here, 1) the KL-divergence is asymmetric (To see this, simply rewrite $\eqref{equn:KLDiv}$ with $q$ and $p$ reversed) and 2) we have dropped the dependence of $q$ on $\lambda$ for notational brevity.
We define our optimal distribution $q^{\star}\in\mathcal{Q}$ as the distribution that satisfies
$$\begin{align}
\label{equn:OptimalQ}
q^{\star} = \underset{q\in\mathcal{Q}}{\operatorname{argmin}}\operatorname{KL}(q(\theta)\ ,\Vert\ ,p(\theta\mid y))\ .
\end{align}$$
Clearly this objective is intractable as simply evaluating the KL divergence term in $\eqref{equn:OptimalQ}$ would require evaluating $p(y)$ - the very reason we have had to resort to VI in the first place. To circumvent this we introduce the evidence lower bound (ELBO). The ELBO, as the name might suggest, provides a lower bound on $\log p(y)$ that we can tractably optimise as the KL divergence is always greater than or equal to zero. When $q(\theta) = p(\theta\mid y)$, then clearly the ELBO is equal to the marginal log-likelihood.
We will now proceed to derive the ELBO in two ways: first using the definition of the KL divergence from $\eqref{equn:KLDiv}$ and secondly by applying Jensen’s inequality from $\eqref{equn:Jensens}$ to $\log p(y)$.
KL Divergence Derivation
From $\eqref{equn:KLDiv}$ we can rewrite the KL divergence in terms of expectations
$$\begin{align}
\label{equn:KLDivTwo}
\operatorname{KL}(q_{\lambda}(\theta)\Vert p(\theta\mid y)) &:= \int q(\theta)\log\frac{q(\theta)}{p(\theta\mid y)} \\
& =\mathbb{E}_{q}[\log q(\theta) - \log p(\theta\mid y)] \ .
\end{align}$$
By the linearity of the expectation, we now have the following
$$\begin{align}
\operatorname{KL}(q_{\lambda}(\theta)\Vert p(\theta\mid y) & = \mathbb{E}_{q}[\log q(\theta) - \log p(\theta\mid y)] \nonumber \\
& = \mathbb{E}_q[\log q(\theta)] - \mathbb{E}_q[\log p(\theta \mid y)] \nonumber \\
& = \mathbb{E}_q[\log q(\theta)] - \mathbb{E}_q[\log p(\theta, y)] + \mathbb{E}_q[\log p_{0}(\theta)] = p(\theta, y)p_{0}(\theta) \nonumber \\
& = \mathbb{E}_q[\log q(\theta)] -\mathbb{E}_q[\log p(\theta, y)] + \log p_{0}(\theta) \label{equn:ELBOKLOne}.
\end{align}$$
The final step relies on the face that $p_{0}(\theta)$ is independent of $q(\theta)$ and the expectation is simply a constant i.e., $\mathbb{E}q[\log p{0}(\theta)] = \log p_{0}(\theta) \mathbb{E}_q[1]$.
We now introduce the ELBO of $q$, denoted $\operatorname{ELBO}(q)$ as the following
$$\begin{align}
\operatorname{ELBO}(q) & = \underbrace{\mathbb{E}_q[\log p(\theta, y)] - \mathbb{E}_q[\log q(\theta)]}_{\text{Rearrangement of \ref{equn:ELBOKLOne}}}\label{equn:ELBOKLDerivPrecursor}\\
& = \underbrace{\mathbb{E}_q[\log p(y \mid \theta) + \log p_{0}(\theta)]}_{p(\theta, y) = p(y \mid \theta) p_{0}(\theta)} - \mathbb{E}_q[\log q(\theta)] \nonumber\\
& = \mathbb{E}_q[\underbrace{\log p(y \mid \theta)}_{\text{Data log-likelihood}}] + \mathbb{E}_q[\underbrace{\log p_{0}(\theta)}_{\text{Prior on } \theta}] - \mathbb{E}_q[\underbrace{\log q(\theta)}_{\text{Variational dist.}}] \nonumber\\
& = \mathbb{E}_q\left[{\log \frac{p_{0}(\theta)}{q(\theta)}}\right] + \mathbb{E}_q[\log p(y \mid \theta)] \nonumber\\
& = \mathbb{E}_q[\log p(y \mid \theta)]+\operatorname{KL}(q(\theta) \Vert p_{0}(\theta)) \label{equn:ELBOKLDeriv} \ .
\end{align}$$
Intuitively, we can see the ELBO as the expectation of our data's log-likelihood under the variational distribution summed against the KL-divergence from our variational distribution to our prior. When optimising this with respect to the variational parameters, the first term in $\eqref{equn:ELBOKLDeriv}$ will optimise $q$ to best explain the data, whilst the second term will regularise the optimisation with respect to the prior.
Through $\eqref{equn:ELBOKLDerivPrecursor}$ $\eqref{equn:ELBOKLDeriv}$, we can see that
$$\begin{align}
\log p(y) = \operatorname{ELBO}(q) + \operatorname{KL}(q(\theta)\Vert p(\theta \mid y))\ .
\end{align}$$
Observing that the KL-divergence is greater than or equal to zero, then the ELBO is a bound as
$$\begin{align*}
\log p(y) \geq \operatorname{ELBO}(q) \ ,
\end{align*}$$
with equality being met if, and only if $q(\theta) = p(\theta\mid y)$.
Jensen’s Inequality Derivation
The ELBO in $\eqref{equn:ELBOKLDeriv}$ can also be derived using Jensen’s inequality. We’ll proceed in a similar fashion to above by giving the derivation in full with underbraced hints.
$$\begin{align}
\log p(y) & = \underbrace{\log\int p(y, \theta)\mathrm{d}\theta}_{\text{By \ref{equn:NormalisingConstant}}} \nonumber \\
& = \log\int p(y, \theta)\underbrace{\frac{q(\theta)}{q(\theta)}}_{=1} \mathrm{d}\theta\nonumber \\
& = \log\int q(\theta)\frac{p(y, \theta)}{q(\theta)} \mathrm{d}\theta\nonumber \\
& = \log\underbrace{\left(\mathbb{E}_{q}\left[\frac{p(y, \theta)}{q(\theta)}\right]\right)}_{\text{By \ref{equn:ExpectationDef}}}\nonumber \\
& \underbrace{\geq \mathbb{E}_{q}\left[\log\frac{p(y, \theta)}{q(\theta)}\right]}_{\text{By Jensen's inequality \ref{equn:Jensens}}} \label{equn:ELBOJensenDeriv}
\end{align}$$
At this point we are done in that we have applied Jensen's inequality to get a bound on the marginal log-likelihood. However, unlike the bound in $\eqref{equn:ELBOKLDeriv}$, it is not immediately obvious how the result in $\eqref{equn:ELBOJensenDeriv}$ can be practically used. To illuminate this, we'll rearrange the inequality a little more.
$$\begin{align}
\mathbb{E}_{q}\left[\log\frac{p(y, \theta)}{q(\theta)}\right]&=\mathbb{E}_q[\log p(y, \theta) - \log q(\theta)] \nonumber \\
& = \mathbb{E}_q[\log p(y\mid \theta) + \log p_{0}(\theta) - \log q(\theta)]\nonumber \\
& = \mathbb{E}_q[\log p(y\mid \theta)] + \mathbb{E}_q\left[\log \frac{p_{0}(\theta)}{q(\theta)} \right]\nonumber \\
& = \mathbb{E}_q[\log p(y \mid \theta)]+\operatorname{KL}(q(\theta) \Vert p_{0}(\theta)) \nonumber \\
& = \ref{equn:ELBOKLDeriv} \ . \nonumber
\end{align}$$
With a little rearranging, we can now close the loop and connect the ELBO derivation using Jensen's inequality with the earlier derivation in terms of the KL divergence definition.
Further Reading
For a much deeper look at VI, see the excellent review paper by Blei et. al., (2017).