Tempering

Bottlenecks to mixing

A situation that arises often is when you’re trying to sample from a distribution $\pi$ that has a few islands of high mass, separated by a large region (ocean) with very low mass. A simple, model, example is the Gibbs measure $\pi_\beta \propto e^{-\beta f}$ for a function $f$ that has multiple local minima. To see why MCMC algorithms get stuck near local minima, let’s choose a one dimensional function $f$ that has two separated local minima.

Example function with multiple local minima.

Choose $\Lambda$ to be a discrete grid of points on $[-1, 1]$. At every point on this grid we propose it’s neighbors (uniformly at random), and then accept/reject it according to the Metropolis–Hastings rule to sample from the (un-normalized) probability distribution $e^{-\beta f}$. We start the chain at the local minimum $X_0 = 0.5$. After 10,000 iterations with $β = 10$ the figure below shows that the chain hasn’t left the neighborhood of the local minimum, and returns samples that are not yet close to our desired target distribution.

Metropolis--Hastings — Metropolis–Hastings chain sampling from $e^{-\beta f}$, with $X_0 = 0.5$.

Before discussing strategies to improve performance, let us briefly understand why this happens. Let $x_0$ be a local minimum and $h$ be the grid spacing of our lattice $\Lambda$. Suppose now we are fixing $\beta$ large, and just running the Metropolis–Hastings chain that proposes nearest neighbors and accepts/rejects accordingly to sample from $\pi_u \propto e^{-\beta f}$. If the chain is at $x_0$, then \begin{equation} P(x_0, x_0 \pm h) = e^{-\beta( f(x_0 \pm h) - f(x_0) )} \approx e^{-\beta h^2 f^{\prime\prime}(x_0) / 2} \,. \end{equation} When $\beta$ is large, the chain will be slow to leave $x_0$.

Moreover, once it leaves $x_0$, it gets pushed back towards $x_0$. Indeed, for $x$ in a small neighborhood to the left of $x_0$ we have \begin{equation} P(x, x + h) = \frac{1}{2} \,, \quad\text{and}\quad P(x, x - h) = \frac{e^{-\beta(f(x-h)-f(x))}}{2} \approx \frac{e^{-\beta h f’(x)}}{2} < \frac{1}{2} \,. \end{equation} Thus if the chain is at $x_0$, it is extremely unlikely for it to move left towards $0$, and more likely to move back towards $x_0$. This makes it very unlikely for the chain to leave the neighborhood of $x_0$, and find the other global minimum located at $-x_0$. In this case one can show that the mixing time is \begin{align} \tmix(\epsilon) &\leq \paren[\Big]{ \frac{2\pi}{\beta f^{\prime\prime}(-x_0)} }^{1/2} \paren[\Big]{\frac{e^{+\beta(f(0) - f(-x_0)}}{h}} \paren[\Big]{ \frac{1}{2}\abs[\Big]{\ln \paren[\Big]{ \frac{2 \pi \epsilon^2}{\beta f^{\prime\prime}(-x_0) h^2} }} + \beta( f(0) - f(-x_0) ) } \\ &\leq \frac{e^{C(\beta(f(0) - f(-x_0)))}}{h} \abs{\ln (h \epsilon)} \end{align}

The above shows that when $\beta$ is large (or the temperature is small), the Metropolis–Hastings chain may get stuck at local minima for exponentially long periods of time.

Simulated Tempering

The idea behind tempering strategies is to leverage the fact that when $\beta$ is small we have access to Markov chains that mix fast. Say we desire samples from the distribution $\pi_{\beta} \propto e^{-\beta f}$ for some $\beta > 0$. Choose a sequence $\beta_1$, …, $\beta_K$ such that $\beta_K = \beta$, and $\beta_1$ is small enough so that the mixing time of the chain used to sample from $\pi_{\beta_1}$ is small. Now consider the new measure $\pi^*$ on the product space $\set{1, \dots, K} \times \mathcal X$ defined by \begin{equation} \pi^*(k, x) = \frac{1}{k} \pi_{\beta_k}(x) \,. \end{equation}

We sample from this measure using the Metropolis–Hastings algorithm.

Algorithm 1 (Simulated Tempering).

Let $R$ be is a proposal kernel (transition matrix) on the set of indexes $\set{1, \dots, K}$, and $Q$ be a proposal kernel on $\mathcal X$.

Start with $k_0 \in \set{1, \dots, K}$ and $X_0 \in \mathcal X$ arbitrarily.
Suppose at time $n$ $(k_n, X_n) = (k, x)$.
Propose changing $k$ to $k’$ with probability $R(k, k’)$. Accept it with probability \begin{equation} A(k, k’) = \min\set[\Big]{ 1, \frac{\pi_{\beta_{k’}}(x) R(k’, k)}{\pi_{\beta_k}(x) R(k, k’)} } \,, \end{equation} and let $\ell$ be the new temperature index (i.e. $\ell = k’$ if you accept, and $\ell = k$ otherwise).
Propose changing the state to $y$ with probability $Q(x, y)$. Accept it with probability \begin{equation} A(x, y) = \min\set[\Big]{ 1, \frac{\pi_{\beta_\ell}(y) Q(y, x)}{\pi_{\beta_{\ell}}(x) Q(x, y)} } \end{equation}

This Markov chain will sample from the distribution $\pi^*$. After $N$ iterations, it produces several samples $\set{ (k_N, X_N) }$. Extracting all the points $X_N$ for which $k_N = K$ will gives samples from $\pi_{\beta_K} = \pi_\beta$.

Remark 2. When proposing a state change, you only need unnormalized densities, as the normalization constant will cancel. When proposing a temperature index change, you need the normalized densities, as you’re evaluating the ratio $\pi_{\beta_{k’}} / \pi_{\beta_k}$ with $k \neq k’$. The normalization of $\pi_{\beta_{k’}}$

Remark 3. In practice, you can choose $R$ to be the transition matrix of the nearest neighbor random walk on $\set{1, \dots, K}$.

When running simulated tempering, the inverse temperature changes randomly; when it becomes small the chain explores the state space fast, and can move from island to island efficiently. When the temperature increases again we will get samples from our desired distribution involving points from all islands, leading to faster mixing.

Proposition 4. The distribution of the samples returned by simulated tempering after $N$ iterations converges to $\pi_{\beta}$ as $N \to \infty$.

Proof: Follows directly from properties of the Metropolis–Hastings algorithm, and the convergence theorem.

Parallel Tempering

One can avoid the use of normalization constants by running $K$ chains in parallel. We now consider the new measure $\pi^*$ on the product space $\mathcal X^K$ defined by \begin{equation} \pi^*(x^1, \dots, x^K) = \pi_{\beta_1}(x^1) \cdots \pi_{\beta_K}(x^K) \,. \end{equation} (Here we used super-scripts to denote coordinates so we can reserve the subscript for the time index.) One can sample from this distribution by simply running $K$ independent chains, each designed to sample from $\pi_{\beta_1}$, …, $\pi_{\beta_K}$. Doing this is more work and nullifies any advantage of the chains mixing faster at higher temperatures. The idea behind parallel tempering is to run this chain, but also swap states.

Algorithm 5 (Parallel Tempering).

Let $R$ be is a proposal kernel (transition matrix) on the set of indexes $\set{1, \dots, K}$, and $Q$ be a proposal kernel on $\mathcal X$.

Start with $X_0 = (X_0^1, \dots, X_0^K) \in \mathcal X$ arbitrarily.
Suppose at time $n$, $X_n = x = (x^1, \dots, x^K) \in \mathcal X^K$.
For each $k$, run one iteration of a Markov chain (e.g. Metropolis–Hastings) whose stationary distribution is $\pi_{\beta_k}$ starting from $x^k$ to obtain a new point $y^k$. Let $y = (y^1, \dots, y^K) \in \mathcal X^K$.
Choose $(i, j)$ from $\set{1, \dots, K}$ (without replacement) uniformly at random. Define $\tilde y$ by \begin{equation} \tilde y^k = \begin{cases} y^j & \text{ if } k = i\,,\\ y^i & \text{ if } k = j\,,\\ y^k & \text{ otherwise} \end{cases} \end{equation}
Propose $X_{n+1} = \tilde y$ as the new state and accept it with probability \begin{equation} A(k, k’) = \min\set[\Big]{ 1, \frac{\pi_{\beta_{i}}(y^j) \pi_{\beta_j}(y^i)}{\pi_{\beta_{i}}(y^i) \pi_{\beta_j}(y^j)} } \,. \end{equation} Otherwise set $X_{n+1} = X_n$.

This Markov chain will sample from the distribution $\pi^*$. After $N$ iterations, it produces several samples $\set{ X_N \in \mathcal X^K }$. For any $k$, extracting the $k$-th coordinate will gives samples from $\pi_{\beta_k}$. In particular, $\set{X_N^K}$ gives samples from $\pi_{\beta_K} = \pi_\beta$.

Proposition 6. Let $\set{ X_N \in \mathcal X^K }$ be the samples obtained by parallel tempering after $N$ iterations. For every $k \in \set{1, \dots, K}$, $\dist(X_N^k) \to \pi_{\beta_k}$ in total variation as $N \to \infty$.

Proof: Follows directly from properties of the Metropolis–Hastings algorithm, and the convergence theorem.

Numerical example 7.

To illustrate the performance of parallel tempering in a simple setting, look at this notebook. It codes up parallel tempering and compares it to a vanilla Metropolis–Hastings to sample from the bimodal distribution shown in this figure.

You can download the notebook here. The functions implementing the Metropolis–Hastings and parallel tempering algorithms from this notebook have been stripped; if you implement them, the rest should run and show you a comparison between a vanilla Metropolis–Hastings and parallel tempering.

This is just the tip of the iceberg. There are a huge number of algorithms that focus on speeding up convergence of sampling algorithms, and if this interests you, browse through the references listed in the main website.