Simulated Annealing

Algorithm and Intuition

Suppose we want to minimize a function $f$ in a large (high dimensional) space. A elementary stochastic valley descent algorithm for this is:

Start at a point $x \in \mathcal X$.
Choose a close by point $y$ randomly.
If $f(y) < f(x)$, move to $y$. Otherwise stay at $x$.
Repeat until your computational budget is exhausted, or until $f$ isn’t increasing much.

Remark 1. If you want to maximize $F$ instead, then you move only if $F(y) > F(x)$. In this case the algorithm is called a stochastic hill climb instead.

Clearly valley descent will get stuck at local maxima forever. In practice, it often takes too long to go down into canyons. When you’re working with functions whose derivative you can compute easily, then stochastic gradient descent may provide better results.

An alternate approach is simulated annealing – this may make you climb at certain points, but is better at avoiding getting stuck in local minima. Conventionally, simulated annealing is always stated to minimize a function; replace the function by its negative if you want to maximize it instead.

Algorithm: Simulated Annealing.

Requires $f$. Returns minima of $f$.

Fix a sequence $\beta_n \to \infty$.
Start with some $X_0 \in \mathcal X$.
Generate $X_{n+1}$ from $X_n$ by using an MCMC algorithm to sample from the (un-normalized) probability distribution $\pi_u = e^{-\beta_{n+1} f}$.
Repeat for a large number of iterations.

The final $X$ should (theoretically) be uniformly distributed on the global minima of $f$; but practically they are typically distributed very close to local minima of $f$.

Remark 2. Many authors prefer stating the above algorithm with $T_n = 1/\beta_n$ to be the temperature.

Remark 3. While theoretically, $X$ should be uniformly distributed on the global minima of $f$; in practice we see something different. First we can’t ever send $T_n\to0$ in practice, so $X$ will take values in some small neighborhood of the minima of $f$. Second, many MCMC algorithms tend to get stuck at local minima of $f$ for times comparable to $e^{\beta_n}$. As a result the return value $X$ may take values in small neighborhoods of local minima of $f,$ and not just global minima.

Remark 4. The choice of temperatures is the annealing schedule (or cooling schedule) is situation dependent. The above also suggests that if you cool too rapidly, you lose the advantage of simulated annealing and you might as well run the valley descent. On the other hand if you cool too slowly, you use more computational power. In practice one chooses $T_0$ to be something large (so that $\beta = 1/T_0$ is small and $\pi_\beta$ is easy to sample from), and $T_N$ to be very small (so that $\pi_{\beta_N}$ is concentrated minima), and allow $T_n$ to decrease geometrically from $T_0$ to $T_N$.

The basis simulated annealing is the following elementary observation.

Lemma 5. Let $f \colon \mathcal X \to \R$ be some function, $\beta > 0$, and define and un-normalized probability distribution $\pi_{u,\beta} = e^{-\beta f}$. Let $Z_\beta = \sum_{\mathcal X} \pi_u$ be the normalization constant, and $\pi_\beta = \pi_{u,\beta} / Z$ be the normalized probability distribution. When $\beta = 0$, $\pi$ is uniformly distributed on all of $\mathcal X$. As $\beta \to \infty$, $\pi_\beta$ converges to a probability distribution that is uniformly distributed on the global minima of the function $f$.

Proof sketch. Say $x_*$ is the point where $U$ attains a global minimum. Rewrite $\pi_\beta(x)$ as \begin{equation} \pi_\beta(x) = \frac{e^{-\beta (U(x) - U(x_*))}}{1 + \sum_{x \neq x_*} e^{-\beta(U(x) - U(x_*))}} \end{equation} and send $\beta \to \infty$.

Example: Travelling salesman.

Given $N$ points on the plane (cities), the travelling salesman problem is to find a route that travells through each city exactly once, and returns to the starting point. This is a classic problem which is known to be NP-hard, and you can read more about it on Wikipedia

This has been extensively studied, and there are several well known combinatorial algorithms that yield results close to the optimal path in practical amounts of time.

The simplest algorithm for the traveling salesman problem is the greedy nearest neighbor algorithm: Start anywhere, and simply travel to the nearest unvisited city. There are certain scenarios where this algorithm performs badly; but in most configurations this gives you a travel distance that is within 25% of the minimum.

Simulated annealing was originally invented as an algorithm to improve the results in the traveling salesman problem.

Given a tour $\sigma$ (which is just some permutation of the $N$ cities), let $C(\sigma)$ be the total length (or cost) of the tour. We will use simulated annealing to minimize $C$ over all tours.
To use simulated annealing, we need a MCMC algorithm to sample from $e^{-\beta C}$. This can be done with the Metropolis Hastings algorithm.
Choose a cooling schedule (by experimenting) and run simulated annealing to minimize $C$ with the above proposal mechanism.

Here is what the numerical results look like, and a comparison between using a direct valley descent or blind random trials. Simulated annealing usually gives a small improvement over the greedy algorithm. The notebook can be downloaded here; the part implementing the simulated annealing algorithm is redacted. You have to implement that in order to run the notebook.

Example: Cracking substitution ciphers.

A substitution cipher is one where you create a key that is a permutation of the alphabet (e.g. $A \mapsto K$, $B \mapsto Z$, etc.). Using this key, you can encode and decode a message. At first sight this might seem uncrackable by brute force – your key is one permutation of $28!$ (26 letters plus a period and space punctuation).

This is a needle in an enormous haystack. If you could examine $10^{12}$ keys in a second (which is a generous overestimate), then it would still take you about a billion years to crack this code. Nevertheless, if you’re sending sufficiently long (few paragraphs) of readable text data, this method is crackable in seconds using simulated annealing (or even just a stochastic hill climb).

To crack a substitution cipher, we need to first define a fitness function. Download a few long English books, and compute frequencies of letter sequences. That is compute how often ‘a’ occurs, how often ‘as’ occurs, how often ‘ast’ occurs, and so on. Using these frequencies define a fitness function $F$ that takes as input a string of symbols (e.g. a message), and outputs a real number. The closer the symbol frequencies match English, the higher the fitness should be. (There’s an elegant and clever way to do this, which I’m not describing here as finding it is part of the fun.)

We will now crack substitution ciphers as follows:

Given a guess at a key $\sigma$, define $f(\sigma)$ to be the fitness of the coded message $M$ decoded with key $\sigma$. (We want to maximize $f(\sigma)$.)
Given a (random) key $\sigma$, our proposal mechanism generates a new key $\tau$ by randomly swapping two symbols in $\sigma$.
Run a stochastic hill climb to maximize $f(\sigma)$, or simulated annealing to minimize $-f(\sigma)$. (Both work really well!)

As an example, we took a passage from Arthur Conan Doyle’s Sherlock Holmes, and coded it up with a randomly chosen key. We then downloaded five different books from Project Gutenberg, and ran simulated annealing. Here’s the decoded message after every 100 iterations. The $F=\dots$ shows the fitness of the decoded message with the current guess of the key, and the $D=\dots$ shows the number of symbols where our guessed key and actual key differ. We find the correct key in a few thousand iterations!

F=-470.33, D=27:  . R.SLDBERVW.LREPDS.SLD.GS.FEXFNS. LD.XRPFZO.G.LFAD.SDEMRP.LDFBM.LGP.PDZ GRZ.LDB
F=-374.97, D=24:  STFS OPUEFHJSOFEMP S OPSR SNEANB STOPSAFMNLCSRSONVPS PEIFMSOPNUISORMSMPLTRFLSOPU
F=-328.22, D=24:   RE SANUCEJZ AECMNS SAN IS PCGPHS RAN GEMPOB I APXN SNCTEM ANPUT AIM MNORIEO ANU
F=-301.30, D=21:   CO SANULOVF AOLMNS SAN IS ELHE.S CAN HOMERT I AEZN SNLGOM ANEUG AIM MNRCIOR ANU
F=-294.07, D=17:   CO SANULOVG AOLMNS SAN IS ELHE.S CAN HOMERK I AEQN SNLDOM ANEUD AIM MNRCIOR ANU
F=-290.06, D=18:   CO SANULOFG AOLMNS SAN IS ELVE.S CAN VOMERK I AEZN SNLDOM ANEUD AIM MNRCIOR ANU
F=-246.34, D=13:   DO NSERLOCK SOLMEN NSE IN ALVA.N DSE VOMAUY I SAWE NELTOM SEART SIM MEUDIOU SER
F=-245.93, D=13:   DO NSERLOCK SOLMEN NSE IN ALFA.N DSE FOMAUY I SAWE NELTOM SEART SIM MEUDIOU SER
F=-222.94, D=11:   TO NSERLOCK SOLMEN NSE IN ALUAYN TSE UOMAF. I SAWE NELDOM SEARD SIM MEFTIOF SER
F=-140.26, D= 8:   TO FHERLOCK HOLMEF FHE IF ALWAYF THE WOMAN. I HAUE FELDOM HEARD HIM MENTION HER
F=-141.88, D= 6:   TO FHERLOCK HOLMEF FHE IF ALWAYF THE WOMAN. I HAJE FELDOM HEARD HIM MENTION HER
F=-112.05, D= 4:   TO FHERLOCK HOLMEF FHE IF ALWAYF THE WOMAN. I HAXE FELDOM HEARD HIM MENTION HER
F= -34.58, D= 3:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAZE SELDOM HEARD HIM MENTION HER
F= -26.99, D= 2:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -27.95, D= 3:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -27.64, D= 2:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -27.64, D= 2:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -28.26, D= 3:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -27.64, D= 2:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -28.26, D= 3:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -28.26, D= 3:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -26.99, D= 2:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -25.54, D= 2:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -25.88, D= 2:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -26.44, D= 3:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -26.26, D= 3:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -25.88, D= 2:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -26.26, D= 3:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -26.26, D= 3:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -26.26, D= 3:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -26.26, D= 3:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -26.26, D= 3:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -25.54, D= 2:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER
F= -24.30, D= 0:   TO SHERLOCK HOLMES SHE IS ALWAYS THE WOMAN. I HAVE SELDOM HEARD HIM MENTION HER

A redacted version of the notebook that does the above is here. The cells defining the fitness function, and implementing simulated annealing have been removed as implementing them is part of the homework. When these functions are implemented, the output of the completed notebook is here