Monte Carlo Approximation
The Monte Carlo Method
A typical use of samples is to compute averages. For instance when studying phase transitions in the Ising model, given a spin configuration $\sigma$, the magnetization is \begin{equation} \mu \defeq M(\sigma) = \frac{1}{\abs{\Lambda}} \sum_{i \in \Lambda} \sigma(i) \,. \end{equation} Our interest was in computing \begin{equation} \E \abs{M} = \sum_{\sigma \in \mathcal X} \abs{M(\sigma)} \pi_\beta(\sigma) \,, \end{equation} where $\pi_\beta$ is the Gibbs distribution. The above sum has too many terms to be computationally tractable.
If, however, we could efficiently samples from $\pi_\beta$, then we can approximate the above sum by taking $N$ i.i.d. spin configurations $\sigma_1$, …, $\sigma_N$ with distribution $\pi_\beta$, and define \begin{equation} \hat \mu_N \defeq \sum_{n = 1}^N \abs{M(\sigma_n)} \,. \end{equation} The law of large numbers says that $\hat \mu_N \to \mu$ as $N \to \infty$. Moreover, the mean square error is exactly \begin{equation} \E (\hat \mu_N - \mu)^2 = \frac{1}{N} \E( \abs{M} - \mu )^2 = \frac{\var(\abs{M})}{N} \,. \end{equation} If $\var(\abs{M})$ is not too large, this gives a (computationally tractable) way of approximating $\E \abs{M}$.
Of course, this is this is not specific to the Ising model, and true in general.
Let $F \colon \mathcal X \to \R$ be a function, and suppose $X_1$, …, $X_N$ be i.i.d. $\mathcal X$ valued random variables with distribution $\pi$ (i.e. for every $x \in \mathcal X$, $\P(X_n = x) = \pi(x)$). Define \begin{equation} \hat \mu_N = \frac{1}{N} \sum_{n = 1}^N F(X_n) \,, \quad \mu = \sum_{x \in \mathcal X} F(x) \pi(x) \,. \end{equation} Then \begin{equation} \P\paren[\Big]{ \lim_{N \to \infty} \hat \mu_N = \mu } = 1 \quad\text{and}\quad \E\paren[\big]{ \hat \mu_N - \mu }^2 = \frac{\var(F)}{N} = \frac{1}{N} \sum_{x \in \mathcal X} (F(x) - \mu)^2 \pi(x) \,. \end{equation}
This method was proposed by Stanislaw Ulam inspired by his uncles gambling habits. The name Monte Carlo is the name of a casino in Monaco that his uncle used to gamble at.
Remark 2. Note for fixed $N$, $\E \hat \mu_N = \mu$. For this reason we call $\hat \mu_N$ an unbiased estimator of $\mu$.
Importance Sampling
One measure of the relative error in the above Monte Carlo approximation is \begin{equation} \mathcal E_N \defeq \frac{1}{\abs{\mu}} \paren[\big]{ \E(\hat \mu_N - \mu)^2 }^{1/2} = \frac{\std_\pi(F)}{\abs{\mu} \sqrt{N}} \,, \end{equation} where \begin{equation} \std_\pi(F) = \paren[\Big]{ \sum_{x \in \mathcal X} (F(x) - \mu)^2 \pi(x) }^{1/2} \end{equation} is the standard deviation of $F$ under $\pi$. (Note $\mu = \E_\pi F = \sum_{\mathcal X} F(x) \pi(x)$ is the mean of $F$ under $\pi$.) Sometimes $\std_\pi(F) / \abs{\mu}$ is so large that this approximation gives poor results.
Importance sampling improves this as follows: Let $\pi^*$ be any probability distribution, and suppose $X^*_1$, …, $X^*_N$ are i.i.d. samples from $\pi^*$. Notice \begin{equation} \mu = \sum_{x \in \mathcal X} F(x) \pi(x) = \sum_{x \in \mathcal X} \frac{F(x)\pi(x)}{\pi^*(x)} \pi^*(x) = \E_{\pi^*} \frac{F \pi}{\pi^*} \,. \end{equation} So we can alternately compute $\mu$ by Monte Carlo approximating $F \pi/\pi^*$. Explicitly, set \begin{equation} \hat \mu_N^* = \frac{1}{N} \sum_{n = 1}^N \frac{F(X_n^*) \pi(X_n^*)}{\pi^*(X_n^*)} \,. \end{equation} As before $\hat \mu_N^*$ is an unbiased estimator for $\mu$ (i.e. $\hat \mu_N^* \to \mu$ as $N \to \infty$, and also $\E \mu_N^* = \mu$ for every $N$). However, the relative error is now \begin{equation} \mathcal E_N^* \defeq \frac{1}{\abs{\mu}} \paren[\big]{ \E(\hat \mu_N^* - \mu)^2 }^{1/2} = \frac{\std_{\pi^*}(F \pi / \pi^*)}{\abs{\mu} \sqrt{N}} \,. \end{equation} If we can choose $\pi$ so that $\std_{\pi^*}(F \pi / \pi^*) \ll \std_\pi(F)$, then we can reduce our relative error. This can be done in several situations of practical interest.
Monte Carlo Integration.
In the continuous setting, this is extremely useful as it gives a way to approximate integrals in high dimensions.
Let $X_n$ be $\R^d$ valued, i.i.d. random variables with common probability density function $p$. Let $f \colon \R^d \to \R$ be a function such that $\int_{\R^d} \abs{f(x)} p(x) \, dx < \infty$. Then \begin{equation}\label{e:mcIntF} \lim_{N \to \infty} \frac{1}{N} \sum_{n = 1}^N f(X_n) = \int_{\R^d} f(x) \, p(x) \, dx\,, \quad\text{almost surely}\,. \end{equation} If further $\int_{\R^d} \abs{f(x)}^2 p(x) \, dx < \infty$, then \begin{equation} \var\paren[\Big]{ \frac{1}{N} \sum_{n = 1}^N f(X_n) } = \frac{1}{N} \int_{\R^d} \abs{f(x) - \mu}^2 p(x) \, dx \,, \quad\text{where } \mu = \int_{\R^d} f(x) p(x) \, dx\,. \end{equation}
We will prove this using the law of large numbers.
Remark 4. If $X_n$ are mutually independent and uniformly distributed on $[0,1]^d$, then the above implies \begin{equation} \lim_{N \to \infty} \frac{1}{N} \sum_{n = 1}^N f(X_n) = \int_{\R^d} f(x) \, dx\,, \quad\text{almost surely}\,. \end{equation} By standard Gaussian tail bounds \begin{equation} \P \paren[\Big]{ \abs[\Big]{ \frac{1}{N} \sum_{n = 1}^N f(X_n) - \int_{[0,1]^d} f(x) \, dx } > \lambda \norm{f}_\osc } \leq 2 e^{- 2 N \lambda^2} \,, \end{equation} where $\norm{f}_\osc = \max f - \min f$. Thus if we want to attain an error of $\epsilon > 0$ with probability at least $1 - \delta$, we need to choose \begin{equation} N \geq \frac{\norm{f}_\osc^2}{2 \epsilon^2} \ln \paren[\Big]{\frac{2}{\delta}} %\frac{9}{\epsilon^2} \int_{[0,1]^d} \abs{f(x)}^2 \, dx = O\paren[\Big]{ \frac{1}{\epsilon^2} }\,. \end{equation} (The $\norm{f}_\osc$ above is a little wasteful, and can be replaced with the variance proxy of $f(X_1)$.)
Quadrature, and the curse of dimensionality
The classical method used to compute integrals is quadrature. It turns out that quadrature works well in dimensions $1$ and $2$; but in high dimensions it suffers from the curse of dimensionality. The number of points you need to approximate an integral using quadrature to a given error grows exponentially with the dimension. With Monte Carlo integration, however, the number of points required is independent of dimension.
Proposition 5. Let $f \colon [0, 1]^d \to \R$ be a $C^1$ function. Divide $[0,1]^d$ into $N$ identical cubes $Q_1$, …, $Q_N$, and let $\xi_i$ denote the center of the $i$-th cube. Then, \begin{equation} \abs[\Big]{ \int_{[0,1]^d} f(x) \, dx - \sum_{i=1}^N f(\xi_i) \vol(Q_i) } \leq \frac{\sqrt{d} \max \abs{\grad f}}{2 N^{1/d}} \end{equation}
Remark 6. In order to approximate the integral of $f$ to order $\epsilon$, you need roughly $N = O(1/\epsilon^d)$ cubes. For $\epsilon = 0.01$ and $d = 10$, this is $O(10^{20})$ cubes. If you can examine about $10^{12}$ a second (a generous overestimate for my computer), it will take you a few years to use quadrature to compute this integral to two decimal places.
Remark 7. To approximate the integral of $f$ on the cube $[0, 1]^d$ with accuracy $\epsilon$ you need to:
- Choose $N = O(1/\epsilon^d)$ using quadrature.
- Choose $N = O(1/\epsilon^2)$ using Monte Carlo.
Thus in dimensions $d \geq 2$, it is better to use a Monte Carlo approximation.
For integrals with respect to more general distributions, or on different domains, there is a cost to generating $N$ i.i.d. samples. This could be prohibitive in some case – for instance, if your domain was “weird” and your only option was to rejection sampling the cost of generating samples would be exponential in $d$, and hence the total cost of the Monte Carlo method would also be exponential in $d$. However, in in many situations the sampling cost is computationally tractable (grows polynomially in $d$), and in these cases the Monte Carlo method is a better choice.
Law of Large Numbers
Theorem 3 follows immediately from the Law of Large Numbers.
Theorem 8 (Law of large numbers). Let $X_n$ be a sequence of i.i.d. random variables with $\E \abs{X_n} < \infty$. Then \begin{equation}\label{e:lln}\tag{LLN} \lim_{N \to \infty} \frac{1}{N} \sum_{n = 1}^N X_n = \E X_1\,. \end{equation}
This is easy to prove if we assume $\E X_1^2 < \infty$, and is usually done in every introductory probability course. We will instead prove this without assuming $\E X_1^2 < \infty$ using characteristic functions, below. We were, however, somewhat imprecise when stating \eqref{e:lln}, which involves convergence of random variables. This requires measure theory to treat rigorously and goes beyond the scope of this course. Here are two more precise versions of \eqref{e:lln}.
- 
The weak law of large numbers says \eqref{e:lln} holds in probability. That is, for any $\epsilon > 0$ we have \begin{equation} \lim_{N \to \infty} \P \paren[\Big]{ \abs[\Big]{\frac{1}{N} \sum_{n = 1}^N X_n - \E X_1} > \epsilon } = 0 \end{equation} 
- 
The strong law of large numbers says \eqref{e:lln} holds almost surely. That is, for any $\epsilon > 0$ we have \begin{equation} \P \paren[\Big]{ \lim_{N \to \infty} \frac{1}{N} \sum_{n = 1}^N X_n = \E X_1 } = 1 \,. \end{equation} 
Proof of Theorem 3 . Follows immediately from Theorem 8 and the fact that the variance of independent random variables adds.
Convergence in Distribution
Definition 9. We say a sequence of random variables $X_n$ converges to a random variable $X$ in distribution if the CDF of $X_n$ converges to the CDF of $X$ at all points where the CDF of $X$ is continuous.
For $\R^d$ valued random variables, it is better to define convergence in distribution using bounded continuous test functions instead. However, we’re not going to split hairs about this as we will test convergence in distribution using L'evy’s continuity theorem.
Theorem 10 (Levy’s continuity theorem). A sequence of random variables $X_n$ converge to $X$ in distribution if and only if the characteristic functions of $X_n$ (defined below) converge pointwise to the characteristic function of $X$. That is $X_n$ converges to $X$ in distribution if and only if \begin{equation} \lim_{n \to \infty} \varphi_{X_n}(\lambda) \to \varphi_X(\lambda) \quad \text{for every } \xi \in \R^d\,. \end{equation}
Proof. The proof goes beyond the scope of this course, but is in every standard measure theory based probability book.
Definition 11. Let $X$ be a $\R^d$ valued random variable. The characteristic function of $X$ is the function $\varphi_X \colon \R^d \to \C$ defined by \begin{equation} \varphi_X(\lambda) = \E e^{i \lambda \cdot X} \,, \quad\text{where } i = \sqrt{-1}\,. \end{equation}
Proposition 12. Let $X$ be a random variable.
- $\varphi_X$ is continuous, and $\varphi_X(0) = 1$.
- If $\E \abs{X} < \infty$, then $\varphi$ is differentiable and $\grad \varphi(0) = -i \E X$.
Proof. Proving continuity and differentiability require the dominated convergence theorem, which is beyond the scope of this course. Computing $\varphi_X(0)$ and $\grad \varphi_X(0)$ is direct, and will be done on the board.
Lemma 13. If $c \in \R$ and $X$ is a random variable then $\varphi_{cX}(\lambda) = \varphi_X(c \lambda)$.
Proposition 14. Two random variables $X$ and $Y$ are independent if and only if $\varphi_{(X, Y)}(\lambda, \mu) = \varphi_X(\lambda) \varphi_Y(\mu)$.
Proof. The reverse direction requires Fourier inversion, and is beyond the scope of this course. The forward direction can be done easily.
Proof of Theorem 8 . We will show that the convergence \eqref{e:lln} holds in distribution using Levy’s continuity theorem