TV Convergence
We will now assume we are working with an irreducible and aperiodic (but not necessarily reversible) Markov chain $(X_n)$ on a finite state space $\mathcal X$, with transition matrix $P$ and stationary distribution $\pi$.
Mixing Times
Definition 1. For every $x \in \mathcal X$, let $\delta_x$ be the Dirac delta distribution at $x$ (i.e. $\delta_x(y) = 1$ if $y = x$ and $0$ otherwise). Define \begin{equation} d(n) \defeq \max_{x \in \mathcal X} \norm{\delta_x P^n - \pi }_\TV \defeq \frac{1}{2} \max_{x \in \mathcal X} \sum_{y \in \mathcal X} \abs{P^n(x, y) - \pi(y)} \,, \end{equation}
Proposition 2. For any initial distribution \begin{equation} \norm{\dist(X_n) - \pi}_\TV \leq d(n)\,. \end{equation}
Proof: Direct
Thus the rate of convergence to $\pi$ for an arbitrary initial distribution is controlled by $d(n)$. Clearly $d(n) \leq 1$. The convergence theorem shows that if the chain is irreducible and aperiodic $d(n) \to 0$ as $n \to \infty$. We claim that $d$ is decreasing as a function of $n$; moreover, once $d(n) < 1/2$ it decreases exponentially.
Proposition 3. The function $d$ is decreasing.
Proof: Direct
Definition 4. Let $\epsilon \in (0, 1/2)$. The mixing time, denoted by $\tmix(\epsilon)$, is defined by \begin{equation} \tmix(\epsilon) = \min \set{ n \in \N \st ~d(n) < \epsilon } \,. \end{equation}
For any $\epsilon \in (0, 1/2)$, we have \begin{equation} d(n) \leq (2\epsilon)^{\floor{n / \tmix(\epsilon)}} \end{equation} Consequently, for any $n \in \N$ and any initial distribution $\mu$ we have \begin{equation} \norm{\dist(X_{n}) - \pi}_\TV \leq (2\epsilon)^{\floor{n / \tmix(\epsilon)}} \,. \end{equation}
Proof sketch: Let $N = \tmix(\epsilon)$, show \begin{equation} \norm{\mu P^N - \pi}_\TV = \norm{(\mu - \pi) P^N}_\TV \leq (2\epsilon) \norm{\mu - \pi}_\TV \,, \end{equation} and iterate.
Proposition 6. If $0 < \delta < \epsilon < 1/2$, then \begin{equation} \tmix( \delta) \leq \tmix(\epsilon) \log_{2\epsilon} \delta \,. \end{equation}
Proposition 7. Suppose there exists $N \in \N$, and $\delta > 0$ and a probability measure $\gamma$ such that for every $x, y \in \mathcal X$ we have \begin{equation} P^N(x, y) \geq \delta \gamma(y) \,. \end{equation} Then \begin{equation} \tmix(\epsilon) \leq \frac{N \ln \epsilon}{\ln( 1 - \delta )} \end{equation}
Proof: Follows from the Harris lemma.
Corollary 8. For the page rank chain, we have \begin{equation} \tmix(\epsilon) \leq \ln_\alpha \epsilon \end{equation}
Remark 9. In general, estimating mixing times is hard and is the subject of many advanced books (e.g. LP17). One can show (see LP17, Chapter 15) that the mixing time of the Glauber Dynamics for the Ising model on a graph with $N$ vertices and maximal degree $\Delta$ is: \begin{equation} \tmix(\epsilon) \leq \ceil[\Big]{ \frac{N \paren{\ln N + \abs{\ln \epsilon}}}{1 - \Delta \tanh(\beta)} } \end{equation}
Relaxation time
For a reversible chain, \begin{equation} (\trel - 1) \ln \paren[\Big]{\frac{1}{2\epsilon}} \leq \tmix(\epsilon) \leq \trel \ln\paren[\Big]{\frac{1}{\epsilon \sqrt{2 \pi_\min}}} \end{equation}
Proof sketch: The upper bound follows immediately from the TV convergence theorem. For the lower bound choose an eigenvalue $\lambda$ and an eigenfunction $\varphi$ so that $1 - \abs{\lambda} = \gamma_*$ and $P\varphi = \lambda \varphi$. Now for every $x$ we have \begin{equation} \abs{\lambda^n \varphi(x)} = \abs{P^n \varphi(x)} \leq 2 \paren[\big]{ \max_y \abs{\varphi(y)} } \norm{P^n(x, \cdot) - \pi(\cdot)}_\TV \,. \end{equation} Choose $x$ so that $\abs{\varphi(x)} = \max_y \abs{\varphi(y)}$, and $n = \tmix(\epsilon)$ and solve.
Example 11. For the lazy random walk on the cycle $\set{0, \dots, N-1}$ we know \begin{equation} \gamma_* = \frac{1}{2}\paren[\Big]{ 1 - \cos\paren[\Big]{ \frac{2\pi}{N}} } \geq \frac{\pi^2}{N^2} \end{equation} and so \begin{equation} \tmix(\epsilon) \leq \frac{N^2}{2\pi^2} \ln \paren[\Big]{\frac{N}{2 \epsilon^2}} \end{equation} It turns out this bound is a little suboptimal – one can show that in this case $\tmix \leq N^2$, which is better by a $\ln N$ factor.
Example 12. For the Glauber chain on the binary hypercube $\set{\pm 1}^d$, we know $\gamma_* = 1/2d$, and hence \begin{equation} \tmix \leq \frac{d}{2} \ln \paren[\Big]{\frac{2^d}{2 \epsilon^2 }} = \frac{d^2 \ln 2}{2} + d \ln \paren[\Big]{\frac{1}{2 \epsilon^2 }} \end{equation} This is also suboptimal, and we will later see \begin{equation} \tmix(\epsilon) \leq d \ln \paren[\Big]{\frac{d}{\epsilon}} \end{equation}
Even if the chain is not reversible, it is often useful to study mixing in $\ell^2$. For this, define \begin{equation} d^{(2)}(n) = \max_{x \in \mathcal X} \norm[\Big]{ \frac{\delta_x P^n}{\pi} - 1 }_{\ell^2(\pi)} = \max_{x \in \mathcal X} \paren[\bigg]{ \paren[\Big]{\sum_{y \in \mathcal X} \frac{P^n(x, y)}{\pi(y)} - 1 }^2 \pi(y) }^{1/2} \,. \end{equation}
Proposition 13. The function $d^{(2)}$ is sub-multiplicative. That is, $d^{(2)}(m + n) \leq d^{(2)}(m) d^{(2)}(n)$.
Proof sketch: First show \begin{equation} \abs[\bigg]{\frac{P^{m + n}(x, z)}{\pi(z)} - 1}^2 \leq \sum_{y \in \mathcal X} \paren[\bigg]{ \frac{P^m(x, y)}{\pi(y)} - 1 }^2 \paren[\bigg]{ \frac{P^n(y, z)}{\pi(z)} - 1 }^2 \pi(y) \,. \end{equation} Now multiply by $\pi(z)$ and sum.
Definition 14. The $\ell^2$ mixing time $\tmix^{(2)}$ is defined by \begin{equation} \tmix^{(2)} = \min \set{n \in \N \st d^{(2)}(n) < \epsilon} \,. \end{equation}
Note $d^{(2)}(1)$ can be quite large (on the order of $1 / \sqrt{\pi_\min}$). So for small $m, n$ sub-multiplicativity of $d^{(2)}$ tells us nothing. However, once $n \geq \tmix^{(2)}(\epsilon)$ for some $\epsilon < 1$, $d^{(2)}$ must decrease exponentiallyin the sense that \begin{equation} d^{(2)}(k \tmix^{(2)}(\epsilon)) \leq \epsilon^k \,. \end{equation}
Time averaging
Often we are interested in using samples from $\pi$ to compute the average of an observable (e.g. the expected magnetization in the Ising model). Let $\pi$ be a distribution we are interested in sampling from, $f \colon \mathcal X \to \R$ be a function (called an observable), and suppose we are interested in computing \begin{equation} I_\pi(f) \defeq \sum_{x \in \mathcal X} f(x) \pi(x) \,, \end{equation} to some degree of accuracy.
One strategy that can be used is as follows. Let $(X_n)$ be a Markov chain with stationary distribution $\pi$ with $X_0 \sim \mu_0$ where $\mu_0$ is a distribution that is easy to sample from. Choose $M$ large, and run $M$ independent copies of this chain $(X^1_n)$, …, $(X^M_n)$. By the law of large numbers, for every $n \in \N$ \begin{equation} \lim_{M \to \infty} \frac{1}{M} \sum_{i = 1}^M f(X^i_n) = \sum_{x \in \mathcal X} f(x) \mu_n(x) = \mathcal I_{\mu_n}(f) \,, \end{equation} where $\mu_n(x) = \P(X^1_n = x) = \cdots = \P(X^M_n = x)$. When $n \geq \tmix(\epsilon)$, we know that $\norm{\mu_n - \pi}_\TV < \epsilon$. These two observations can be combined to obtain the following quantitative error bound.
Proposition 15. If $n > \tmix(\epsilon)$ then \begin{equation} \E \paren[\Big]{ \frac{1}{M} \sum_{i =1}^M f(X^i_n) - \mathcal I_\pi(f) }^2 \leq \max_{x \in \mathcal X} \abs{f(x)}^2 \paren[\Big]{ \frac{2}{M} + \epsilon } \end{equation}
Remark 16. We can replace $\max \abs{f}$ above with $\max \abs{f} - \min \abs{f}$
Proof sketch: Using the fact that $(a + b)^2 \leq 2(a^2 + b^2)$ we see \begin{equation} \E \paren[\Big]{ \frac{1}{M} \sum_{i =1}^M f(X^i_n) - \mathcal I_\pi(f) }^2 \leq \frac{2}{M} \var_{\mu_n}(f) + 2 (I_{\mu_n} f - I_\pi(f)) \end{equation}
The disadvantage of using $M$ independent copies of the chain to estimate $\mathcal I_{\pi}(f)$ is the computational cost of running $M$ copies; and the “wasted computational effort” discarding iterations up to $\tmix(ep)$. An different strategy, which we describe below, is time averaging. This can be used either in addition or in lieu of simulating independent copies. The starting point of this technique is the Ergodic theorem.
Theorem 17 (Ergodic theorem). Let $X_n$ be an irreducible Markov chain with stationary distribution $\pi$. Let $f \colon \mathcal X \to \R$ be any function. Then \begin{equation} \P \paren[\Big]{ \lim_{N \to \infty} \frac{1}{N} \sum_{n = 1}^N f(X_n) = \sum_{x \in \mathcal X} f(x) \pi(x) } = 1\,. \end{equation}
Remark 18. This is not something that can be deduced from the law of large numbers. Indeed, the law of large numbers only applies if you have independent (or at least rapidly decorrelating) trials. Here $X_1, X_2, \dots$ are far from independent as $X_{n+1}$ is obtained from $X_n$.
Proof. There’s a one page proof in the appendix of LP17, which we might cover in class if time permits.
Remark 19. Define \begin{equation} \mu = \mathcal I_\pi f = \sum_{x \in \mathcal X} f(x) \pi(x) \,, \quad\text{and}\quad \hat \mu_N = \frac{1}{N} \sum_{n = 0}^{N-1} f(X_n) \,. \end{equation} This estimate is biased in the sense that \begin{equation} \E \hat \mu_N \neq \mu = \mathcal I_\pi f \,. \end{equation} However, the ergodic theorem says that our estimator is asymptotically unbiased in the sense that $\hat \mu_N \to \mu$ (almost surely) as $N \to \infty$.
Let $X$ be an irreducible Markov chain with stationary distribution $\pi$. For every $x \in \mathcal X$ we have \begin{equation} \pi(x) = \lim_{n \to \infty} \frac{1}{N} \abs{ \set{ n \leq N-1 \st X_n = x } } \end{equation} almost surely.
Suppose $X$ is an irreducible and ergodic Markov chain on a finite state space $\mathcal X$. Let $f\colon \mathcal X \to \R$, Then \begin{equation} \sqrt{N} \paren[\Big]{ \frac{1}{N} \sum_{n=0}^{N-1} f(X_n) %\hat \mu_N - \mathcal I_\pi(f) } \xrightarrow{d} \mathcal N(0, \tau^2) \,, \end{equation} where \begin{equation} \tau^2 = \lim_{N \to \infty} \sqrt{N} \var \paren[\Big]{ %\hat{\mathcal I}_N \frac{1}{N} \sum_{n=0}^{N-1} f(X_n) } \end{equation}
Remark 22. When the state space is not finite, you need to assume that $\mathcal I_\pi \abs{f}^{2 + \epsilon} < \infty$, and the chain is geometrically ergodic.
The deficiency of the above results is that there’s no rate of convergence. For instance, we would like to know how large we need to make $N$ to ensure $\hat \mu_N$ is close to $\mu = \mathcal I_\pi f$.
Suppose $(X_n)$ is an irreducible, reversible chain with transition matrix $P$. For any $\epsilon \in (0, 1)$, $\eta > 0$, if $N > \tmix(\epsilon/2)$ and \begin{equation} M > \frac{4 \var_\pi(f)}{\eta^2 \epsilon \gamma_*} \,, \end{equation} then \begin{equation} \P\paren[\bigg]{ \abs[\Big]{\frac{1}{M} \sum_{n = 0}^{M-1} f(X_{N + n}) - \mathcal I_\pi f} > \eta } < \epsilon \end{equation}
This requires you run your chain for a long time ($\tmix(\epsilon/2)$) before you start averaging; and also that you time average for a long time (at least $1/(\eta^2 \epsilon \gamma_*)$). If you do, then the error you make is at most $\eta$ with probability at least $1 - \epsilon$.
Remark 24 (Burn in). Practically, when doing MCMC simulations one discards the first few iterations before time averaging. This is called a burn in. The reason for this is because of the availability of results like the above: If you discard the first $\tmix(\epsilon/2)$ iterations, then the probability of making an error larger than $\eta$ when time averaging is small. If you don’t discard the first $\tmix(\epsilon/2)$ iterations, the expected error still converges to $0$, but you don’t have an error estimate in probability as stated above.