Markov Chain Monte Carlo

The philosophy behind Markov Chain Monte Carlo (MCMC for short) is to sample from a target distribution by constructing a Markov chain which has the target distribution as the stationary distribution. Now simulating many iterations of the chain for a long time will produce samples from a distribution that is close to the target distribution. Two questions that come to mind are:

  1. How do we construct the chain?

  2. How fast does the chain converge to the stationary distribution?

One popular construction of the chain uses the Metropolis Hastings algorithm, which we describe below. The rate of convergence to the stationary distribution is a harder question which will return to later.

The Metropolis Hastings Algorithm

Let $\mathcal X$ be a (possibly very large) state space, and let $\pi_u$ be an unnormalized probability distribution. That is $\pi_u \colon \mathcal X \to (0, \infty)$ is some function (for instance $\pi_u(x) = e^{-\beta H(x)}$ as we had in the Ising model, or the team ranking examples). We normalize $\pi_u$ to obtain a (normalized) probability distribution $\pi$ by defining $Z = \sum_x \pi_u(x)$, and $\pi = \pi_u / Z$. In practice, $Z$ is very hard to compute and it is desirable to be able to sample from $\pi$ without knowledge of $Z$.

The Metropolis–Hastings algorithm produces a Markov chain with stationary distribution $\pi$, without needing to know the normalization constant $Z$. In order to use the Metropolis–Hastings algorithm, we need a way to sample “near by” states. This is often called a proposal mechanism. In the language of Markov chains, a proposal mechanism is simply a Markov chain that is easy to simulate. Say this Markov chain has transition matrix $Q$; of course this transition matrix need not have $\pi$ as the stationary distribution. The Metropolis–Hastings algorithm corrects this chain by introducing the right amount of lazyness to make the stationary distribution exactly $\pi$.

In the examples we have seen so far, examples of proposal mechanisms are:

  1. For the Ising model: Given a spin configuration $\sigma \colon \Lambda \to \set{\pm 1}$, define $Q(\sigma, \tau) = 1/\abs{\Lambda}$ if there exists exactly one $i \in \Lambda$ such that $\sigma(i) \neq \tau(i)$, and $Q(\sigma, \tau) = 0$ otherwise. Given $\sigma$, to randomly choose $\tau$ with probability $Q(\sigma, \tau)$ we pick $i \in \Lambda$ uniformly at random, and set $\tau(j) = \sigma(j)$ if $j \neq i$ and $\tau(i) = - \sigma(i)$.

  2. For the hard disks model, we describe a valid state by $x = (x_1, \dots, x_N)$ with each $x_i \in [\epsilon, 1 - \epsilon]^2$ representing the center of a disk. A valid state is one where $\abs{x_i - x_j} \geq \epsilon$. In this case the state space is continuous, and so $Q(x, y)$ is a probability density function (and not a probability distribution). Fix $\rho > 0$. Given $x$, let $Q(x, y)$ be the uniform distribution on all states $y$ such that there exists exactly one $i$ such that $x_i \neq y_i$, and for this $i$ we also have $\abs{x_i - y_i} < \rho$.

Now the Metropolis–Hastings algorithm defines a Markov chain as follows:

  1. Given $X_n = x$, propose a new state $y \in \mathcal X$ with probability $Q(x, y)$.

  2. Define the acceptance probability $A(x, y)$ by

    \begin{equation}\label{e:acceptanceRatio}\tag{A} A(x, y) \defeq \min\set[\Big]{ 1, \frac{\pi_u(y) Q(y, x)}{\pi_u(x) Q(x, y)} } \end{equation}

  3. Choose $U_{n+1} \sim \Unif([0, 1])$ independently and define

    \begin{equation} X_{n+1} = \begin{cases} y & \text{ if } U \leq A \,, \\ X_n & \text{ otherwise} \,. \end{cases} \end{equation}

Proposition 1.

The Metropolis chain defined above is a reversible chain with stationary distribution $\pi$.

We will prove Proposition 1 shortly. Before we do, here are a few remarks:

  1. When implementing the Metropolis Hastings algorithm, computing $A$ using \eqref{e:acceptanceRatio} usually leads to catastrophic floating point errors. It is almost always better to pencil / paper simplify the expression first. If $y$ is obtained by making a small change to $x$, then there will typically be a lot of cancellation in $\pi_u(y) / \pi_u(x)$. You may also want to work with the logarithm of $A$ instead of $A$.

  2. Even though $\pi_u$ appears in the denominator, the algorithm can still be used if $\pi_u$ is allowed to take on the value $0$. In this case we need to ensure that we start the chain at a “valid state” (i.e. a state where $\pi_u > 0$). Now the chain will never move to “invalid states” (i.e. where $\pi_u = 0$), as proposals to these states will keep getting rejected.

  3. Even though $Q(x, y)$ appears in the denominator, the algorithm can be used even if $Q(x, y)$ is allowed to take on the value $0$ for some values of $y$. Indeed, if $Q(x, y) = 0$ then the state $y$ is never proposed, and you never need to compute the acceptance ratio $A(x, y)$.

  4. Irreducibility and aperiodicity can usually be arranged easily; though many times the Metropolis–Hastings algorithm is used without checking the required conditions (and sometimes even when they don’t hold).

  5. In the continuous space setting, one needs an additional condition to ensure convergence. Many treatments don’t even state these conditions. Checking them is usually hard.

  6. The most important practical consideration is the rate of convergence. Even though $\dist(X_n) \to \pi$ as $n \to \infty$, the convergence rate may be extremely slow. You see this in practice if your chain gets stuck – new proposals keep getting rejected.

Remark 2. (Proposal choice) In practice, one wants to choose $Q$ to maximize the rate of convergence. Proposing nearby states often has the advantage that proposals get accepted more often (this is called “exploitation”). On the other hand, proposing nearby states has the disadvantage that the chain explores the space very slowly. There’s usually a tradeoff between exploration and exploitation, and one usually optimizes $Q$ in a problem specific manner.

Remark 3. Historically, the Metropolis algorithm was introduced to uniformly sample configurations of non-overlapping disks in a box as described here. The algorithm is taken from the original article by Metropolis, Rosenbluth, Teller and Teler, which is where the Metropolis–Hastings algorithm gets its name from. Hastings (20 years later) adjusted it to be able to sample from non-uniform distributions (with non-symmetric proposals) as we stated above.

Reversible chains

Definition 4. A Markov chain is said to be reversible if its transition matrix $P$ and stationary distribution $\pi$ satisfy the detailed balance condition \begin{equation}\label{e:detailedBalance}\tag{DB} \pi(x) P(x, y) = \pi(y) P(y, x)\,. \end{equation}

The name reversible stems from the fact that (when $\pi > 0$) a chain is reversible if and only if the transition matrix of the time reversal (defined below) is the same as the transition matrix of the chain itself.

Definition 5 (Time reversal). Let $(X_n)$ be a Markov chain with transition matrix $P$ and stationary distribution $\pi >0$. The time reversal is a Markov chain $\hat X$ such that for every $x_0, x_1 \in \mathcal X$ we have \begin{equation} \P^\pi( X_0 = x_0, X_1 = x_1) = \P^\pi( \hat X_0 = x_1, \hat X_1 = x_0 ) \,. \end{equation}

Proposition 6 The transition matrix of the time reversal, denoted by $\hat P$, is given by \begin{equation} \hat P(x_1, x_0) = \frac{\pi(x_0) P(x_0, x_1)}{\pi(x_1)} \,. \end{equation}

Remark 7. Notice $\hat P$ is a valid transition matrix. Indeed \begin{equation} \sum_{x \in \mathcal X} \hat P(y, x) = \sum_{x \in \mathcal X} \frac{\pi(x) P(x, y)}{\pi(y)} = 1 \,. \end{equation}

Remark 8. When $\pi > 0$, the transition matrix $P$ is reversible if and only if it coincides with the transition matrix of the time reversal (i.e. $P = \hat P$).

Proposition 9. Suppose $\mu$ is any distribution that satisfies the detailed balance condition \begin{equation} \mu(x) P(x, y) = \mu(y) P(y, x)\,. \end{equation} Then $\mu$ must be a stationary distribution.

Proof. By detailed balance, \begin{equation} \sum_{x \in \mathcal X} \mu(x) P(x, y) = \sum_{x \in \mathcal X} \mu(y) P(y, x) = \mu(y) \sum_{x \in \mathcal X} P(y, x) = \mu(y) \end{equation}

Remark 10. The converse is false. If a Markov chain has stationary distribution $\pi$, then it need not satisfy the detailed balance condition \eqref{e:detailedBalance}. One example of this is a cycle between 3 states where every point moves one step clockwise with probability $1$.

Remark 11. If a Markov chain satisfies the detailed balance condition with the uniform distribution, then the transition matrix is symmetric and hence doubly stochastic (i.e. both row sums and column sums are $1$). There are, of course, many doubly stochastic matrices that are not symmetric.

Reversibility of the Metropolis Hastings Chain

Proof of Proposition 1 . The transition matrix of the Metropolis chain is given by \begin{equation} P(x, y) = \begin{cases} Q(x, y) A(x, y) & y \neq x \,\\ 1 - \sum_{y’ \neq x} Q(x, y’) A(x, y’) & y = x\,. \end{cases} \end{equation} Pick $x, y \in \mathcal X$ and suppose first $y \neq x$, and $\pi(x) Q(y, x) \geq \pi(y) Q(x, y)$. In this case \begin{equation} A(x, y) = \frac{\pi(y) Q(y, x) }{\pi(x) Q(x, y)} \qquad\text{and}\qquad A(y, x) = 1\,, \end{equation} and so \begin{align*} \pi(x) P(x, y) &= \pi(x) Q(x, y) A(x, y) = \pi(y) Q(y, x) \\ &= \pi(y) Q(y, x) A(y, x) = \pi(y) P(y, x)\,, \end{align*} and so the detailed balance condition \eqref{e:detailedBalance} holds in this case. By symmetry, when \begin{equation} \pi(x) Q(y, x) \leq \pi(y) Q(x, y) \end{equation} we also have \eqref{e:detailedBalance}. When $y = x$, the detailed balance condition \eqref{e:detailedBalance} is trivially true. Thus by Proposition 9 , $\pi$ is a stationary distribution for the Metropolis–Hastings chain.

Remark 12. There are other choices of the acceptance ratio for which the stationary distribution is $\pi$. One choice (Barker ‘65) is \begin{equation} A(x, y) = \frac{1}{1 + \frac{\pi(x) Q(x, y)}{\pi(y) Q(y, x)} } = \frac{1}{1 + \frac{\pi_u(x) Q(x, y)}{\pi_u(y) Q(y, x)} } \end{equation} The advantage of the traditional choice \eqref{e:acceptanceRatio} is that it maximizes $A(x, y) + A(y, x)$, which in turn minimizes the asymptotic variance (Peskun ’73).

Remark 13. When applying the Metropolis–Hastings algorithm, you want to choose a proposal mechanism to ensure that the chain is irreducible and aperiodic. If these two conditions are satisfied, then Proposition 1 and the convergence theorem will imply that $\dist(X_n) \to \pi$ as $n \to \infty$. This means if you run the chain long enough, you’ve generated points that are sampled according to the desired target distribution $\pi$.

The practical issue is that the convergence may happen very slowly, and you may have to wait for a very long time before $\dist(X_n)$ is close enough to the stationary distribution. In general you want to choose your proposal mechanism in a manner that improves the rate of convergence of $\dist(X_n)$ to the stationary distribution. There’s no silver bullet. Coming up with these mechanisms is usually problem specific, and estimating the rate of convergence for a given proposal mechanism is not easy.

Glauber Dynamics (Gibbs Sampling)

A situation that arises often is when the state space $\mathcal X$ is of the form $\mathcal X = \set{ x \colon V \to S}$ for some finite sets $S, V$. (A typical example is the Ising model, where $S = \set{\pm 1}$, and $V$ is the set of all vertices in $\Lambda$.) Suppose now we have an be an unnormalized probability distribution $\pi_u$ (which we recall is simply a function from $\mathcal X$ to $[0, \infty)$). Usually $\pi_u$ is easy to compute, but converting it to a probability distribution requires computing the normalization constant $Z = \sum_{x \in \mathcal X} \pi_u(x)$ which is computationally infeasible.

The Glauber dynamics defines a Markov chain that can be used to sample from $\pi$ (without knowing $Z$), provided when one of the vertices is fixed, the marginals of $\pi$ are easy to sample from.

Explicitly, let $x \in \mathcal X$, $v \in V$ and define $\mathcal X(x, v) = \set{ y \in \mathcal X \st y(w) = x(w) ~\forall w \neq v}$. The Glauber dynamics is a Markov chain with the following transition matrix: \begin{equation} P(x, y) = \begin{cases} \displaystyle \frac{\pi_u(y)}{\abs{V} \pi_u(\mathcal X(x, v))} & \text{if there exists } v \in V \text{ such that } y \in \mathcal X(x, v) \, \\ 0 & \text{otherwise} \,. \end{cases} \end{equation} Here $\pi_u( \mathcal X(x, v) ) = \sum_{z \in \mathcal X(x, v)} \pi_u(z)$. You can simulate Glauber dynamics by picking $v \in V$ uniformly at random. Then changing the value of $x(v)$ to a value in $V$ chosen from the distribution $\pi$ conditioned on $\mathcal X(x, v)$.

Note this doesn’t require knowledge of the normalization constant $Z$. If $V$ is a small set (as in the Ising model), or the distribution $\pi$ conditioned on $\mathcal X(x, v)$ is easy to sample from then the Glauber dynamics is easy to simulate.

Proposition 14.

The Glauber dynamics defines an irreducible, aperiodic, reversible chain with stationary distribution $\pi$.

This chain is also often used in the continuous setting. Suppose $\pi$ is a probability distribution on $\R^d$, where $d$ is large (say $d \approx 100$). Suppose further that for every $i \in \set{1, \dots, d}$, the marginal on the $i$-th coordinate is easy to sample from. (This happens, for instance, when $\pi = \pi_u / Z$, and there’s an easy to compute formula for $\pi_u$.) Now the Glauber dynamics (aka the Gibbs sampler) would run as follows:

  1. Suppose $X_n = x = (x_1, \dots, x_d) \in \R^d$.

  2. Choose $i \in \set{1, \dots, d}$ uniformly at random.

  3. Let $p$ be the one dimensional distribution whose density is the $i$-th marginal of $\pi$. That is, $p$ is a probability distribution whose density is proportional to \begin{equation} p_u(t) = \pi_u(x_1, \dots, x_{i-1}, t, x_{i+1}, \dots, x_d) \,. \end{equation} Sample $t \in \R$ from the distribution $p$, and then replace the $i$-th coordinate of $x$ with $t$ (i.e. set $X_{n+1} = (x_1, \dots, x_{i-1}, t, x_{i+1}, \dots, x_d)$).

Example 15. Consider the hard disks in a box example where we want to randomly pack $N$ non-overlapping disks into the unit square. A valid arrangement can be represented as a vector $x \in \R^{2N}$ using the disk centers as coordinates. The Gibbs sampler would sample from all valid configurations using the following Markov chain. Start with an arbitrary valid arrangement. Pick one of the disks uniformly at random, and then (uniformly) randomly picks the vertical / horizontal coordinate of that disk center. If everything else is held fixed, there will be a finite union of intervals $S$ so that for any $t \in S$, so that setting the chosen coordinate to $t$ will still yield a valid configuration. The set $S$ is easy (but not immediate) to compute. The Gibbs sampler is the Markov chain that chooses $t$ uniformly at random from $S$ and replaces the chosen coordinate with $t$.