Matthew Thorpe

Postdoctoral Associate
Department of Mathematical Sciences
Carnegie Mellon University
7126 Wean Hall
Pittsburgh, PA 15213
Tel: (412) 268 5617
Email: mthorpe "at" andrew "dot" cmu "dot" edu

About Me

I have recently moved to the University of Cambridge as a Research Fellow in the Department of Applied Mathematics and Theoretical Physics. My new webpage can be found here.

I joined CMU in September 2015 as a postdoctoral associate to work on optimal transportation problems with Dejan Slepcev and Gustavo Rohde as part of an NSF funded project (grant number 1421502).

Before CMU I completed my PhD in the maths and statistics doctoral training centre (MASDOC) at the University of Warwick under the supervision of Florian Theil and Adam Johansen. My PhD focussed on applying variational methods to statistical inference problems. In particular I am interested in statistical inference problems with a combinatorial structure.


I am teaching 21-470 Sepected Topics in Analysis: Introduction to Mathematical Biology in the Fall 2016. For course updates, lecture notes and homeworks please see the backboard course page. The syllabus can be found here.



[10] M. Thorpe and D. Slepcev, Analysis of p-Laplacian Regularization in Semi-Supervised Learning, 2017. Arxiv.

[9] M. Thorpe and D. Slepcev, Transportation Lp Distances: Properties and Extensions, 2017. Preprint available here.

To Appear

[8] M. Thorpe, S. Park, S. Kolouri, G. K. Rohde and D. Slepcev, A Transportation Lp Distance for Signal Analysis, Journal of Mathematical Imaging and Vision, 2017. Article available here, preprint avalable here.

[7] M. Thorpe and A. Johansen, Weak Convergence of General Smoothing Splines, the Annals of the Institute of Statistical Mathematics, 2017. Article available here, preprint available here.

[6] M. Thorpe and F. Theil, Asymptotic Analysis of the Ginzburg-Landau Functional on Point Clouds, to appear in the Proceedings of the Royal Society of Edinburgh Section A: Mathematics, 2017. Arxiv.


[5] S. Kolouri, S. Park, M. Thorpe, D. Slepcev and G. K. Rohde, Optimal Mass Transport: Signal processing and machine-learning applications, IEEE Signal Processing Magazine, 34(4):43-59, 2017. Article available here, preprint available here.

[4] M. Thorpe and A. Johansen, Convergence and Rates for Fixed-Interval Multiple-Track Smoothing Using k-Means Type Optimization, Electronic Journal of Statistics, 10(2):3693-3722, 2016. Article available here, preprint available here.

[3] M. Thorpe, Variational Methods for Geometric Statistical Inference, 2015. PhD thesis available here.

[2] M. Thorpe, F. Theil, A. Johansen and N. Cade, Convergence of the k-Means Minimization Problem Using Gamma-Convergence, SIAM Journal on Applied Mathematics, 75(6):2444-2474, 2015. Article available here or here, prepint available here.

[1] A. Gkiokas, A. Cristea and M. Thorpe, Self-reinforced Meta Learning for Belief Generation, Research and Development in Intelligent Systems XXXI, Springer International Publishing, pp. 185-190, 2014. Article available here.


My research interests lie within applied mathematics covering analysis, statistics and probability. I am particularly interested in (1) the asymptotic behaviour of variational problems that arise within statistics and (2) developing optimal transport type distances. For (1) I am typically interested in unlabelled ill-posed inverse problems. That is where one wishes to make estimates of multiple infinite dimensional objects from a finite, but potentially large, unlabelled data set. In (2) I develop and analyse optimal transport distances, similar to the Wasserstein or earth mover distance, for applications in signal and image analysis. A more thorough description on my work is outlined below.

Variational Methods for Statistical Inference

Many problems within statistics can be posed as minimization problems, for example maximum likelihood or maximum-a-posterior estimates. As such one can expect variational methods to be a valuable tool. The power of variational methods is their ability to deal with high frequency oscillations. For example, the smoothing spline problem, which is the problem of fitting a smooth curve through data, can be written as the 'best' curve (in some suitable space) that fits the data. In the non-parametric setting this is an ill-posed minimization problem which requires regularization. A common method is to penalise the Lp norm of the kth derivative which suggests the natural space to pose the problem in is the Sobolev space Wk,p. However, it is well known that convergence (with the number of data points) in this space cannot be achieved as, although the minimizers are bounded, oscillations on the kth derivative prevent Wk,p convergence.

In general variational methods define an estimator to be the minimizer of an energy. Common examples of energies are the the negative log posterior in a Bayesian framework or the likelihood in a frequentist framework. One could also design custom made energies that are designed with certain behaviours in mind.

My own research in this area has focussed on establishing the convergence of an estimator for fitting multiple curves through unlabelled data. A data association hypothesis labels the data. Given the correct data association hypothesis the problem reduces to independently solving multiple smoothing spline problems. In general, one does not know the correct data association hypothesis which leads to a coupled data association and smoothing spline problem. Furthermore the number of possible data association hypothesis grows combinatorially with the number of data points.

The estimator we studied was a generalisation of the k-means method. In Euclidean spaces data is labelled by the closest 'cluster centre'. The objective is to optimally place the cluster centres. One defines an energy

E(u) = ∑i minj=1,..., k |xi-uj|2

where {xi} is the data and u=(u1,...,uk) are the cluster centres. The estimator û is the minimizer of E.

Our estimator in the data association - smoothing spline problem was to treat the multiple unknown curves as the cluster centres and define a cost function by c(u,x) = minj=1,..., k |uj(t)-y|2 where x=(t,y)∈ [0,1] × Rd are the data and uj:[0,1] → Rd is a function. We redefine the k-means energy as

E(u) = ∑i c(u,xi) + R(u)

where R is some regularization. Our results concerned the convergence of minimizers û of E with the number of data points [2] and the associated rates [4].

We pose our convergence theory in a more general setting that can include more complex relationships between data points and cluster centres. Our convergence theorem holds when c is lower semi-continuous and R is lower semi-continuous and coercive. For example, this would include data in time-range-angle space (e.g. data from radar).

A side project that emerged from this work was to show that smoothing splines converge weakly [7], to our knowledge previous convergence results used strong convergence in weaker topologies. We made use of variational methods in our proofs.

My collaborators on this project are Neil Cade (Selex ES Ltd.), Adam Johansen (Warwick) and Florian Theil (Warwick).

Graphical Modelling

Graphical modelling is one way to develop a 'data driven classification method'. Given a data set {xi} ⊂ X and a measure of dissimilarity ρ: X× X→ [0,∞ ) one can define a graph where the nodes are the data points and edges between data points are weighted using ρ. We say there is an edge between xi and xj if ρ(xi,xj)>0. A simple example would be the graph that connects all data points that fall within a certain distance of each other, i.e. ρ(x,y) = c if |x-y|≤ ε and ρ(x,y) = 0 otherwise.

An important question that has been driven by applications in machine learning is how to partition the data set. This is a classification problem; if there are 2 classes then the problem is to find the labels u:{xi} → {0,1} that define the classes Cj = {xi : u(xi) = j}. In some sense u should be chosen to be optimal. This requires defining an objective functional.

Restricting the labels to binary values is known as hard classification. In this case u(xi)∈ {0,1}. The soft relaxation is to allow u(xi)∈ [0,1]. The advantage of allowing soft classification is that it allows for the application of gradient based numerical methods.

My research has concerned estimating the boundaries of the classes. In this case the classification should approximate a hard classification. The approximation is meant in the sense that for 0<ε ≪ 1 the labels satisfy u(xi) ∈ [0,ε] ∪ [1-ε,1]. The implication being that whilst we allow for uncertainty (particularly on the boundaries between classes) each data point should satisfy u(xi)≈ 0 or u(xi)≈ 1.

A popular phase-transition model from material science is the Ginzburg-Landau functional. In [6] we define an analogue of the Ginzburg-Landau functional on graphs, minimizers of which we use as estimators to the classification problem. There are two terms, one represents data fidelity and the other penalises soft classifications:

En(u) = n-2i,j Wij |u(xi)-u(xj| + n-1i V(xi),

we briefly explain these terms below.

We interpret data fidelity as being faithful to the graph geometry. Therefore, we want nodes that are connected by an edge to be similar, if two nodes xi,xj are connected than we expect u(xi)≈ u(xj). This is done by penalizing the difference Wij|u(xi)-u(xj)| where Wij=ρ (xi,xj) is the weight of the edge between xi and xj. When considering hard classifiers one can say that we pay the cost Wij if xi and xj are given different labels.

The second term penalises classifications that are not close to {0,1}. The motivation being that, rather than estimate the classification at each node, one really wants to estimate the class boundary. Therefore we include the penalisation term which prefers hard decisions by choosing V to be, for example, V(x) = x2(1-x)2.

Our contribution [6] was to show that as the number of data points increases the estimators converge to a function defined on a continuous domain (note that the class estimators for finite data sets are defined on a graph which is a discrete domain). Furthermore the limiting classifier is a hard classifier, i.e. it takes only the values {0,1}, and minimizes a weighted total variation over all hard classifiers. Our results allowed for anisotropic weights Wij.

My collaborators on this project are Dejan Slepcev (CMU) and Florian Theil (Warwick).

Transport Based Distances

The optimal transport theory, dating back to Monge in the 1700's, is extremely widely used with applications in astronomy, computer vision and graphics, fluid mechanics, machine learning, operational research and many more as well in traditional mathematical subjects such as partial differential equations, fluid mechanics, geometry, probability theory and functional analysis. My motivation for the subject stems from applied uses in signal and image processing. The Lagrangian nature makes optimal transport distances very desirable as images are compared based on the cost of spatial rearrangement, unlike Lp distances where one makes comparisons at a point. As such, optimal transport distances are better able to represent spatial variations in signals and images than non-Lagrangian distances.

The optimal transport problem (in the Kantorovich formulation) between two probability measures μ and ν on a domain X is defined as

OT(μ,ν) = infπ ∈ Π(μ,ν)X2 c(x,y) d π(x,y)

where Π(μ,ν) is the set of coupled measures on X2 where the first marginal is μ and the second marginal is ν, and c is a lower semi-continuous cost function. Popular choices are c(x,y) = |x-y|p in which case one can define the metric dOT(μ,ν) = p OT(μ,ν)  , when p=2 this is also called the Wasserstein distance and when p=1 the earth mover distance.

There is substantial theory which has encouraged practitioners to adopt these distances, such as, existence of minimizers, fast computational methods and the Riemannian structure (in particular the existence of geodesics). Of course there are numerous extensions to optimal transport, such as partial transport, adding constraints, adapting the formulation to model networks and many many more. I was involved with a review paper written for practitioners [5].

Optimal transport does not directly measure the intensity of a signal. This means that the OT distance is insensitive to high frequency perturbations. Other drawbacks of optimal transport is the restrictions it places on the images/signals. One can only compare non-negative images, they must be normalised (integrate to the same value) and cannot be multi-valued. The reason for this is that images/signals are represented as measures. A project I am involved in has been to develop what we call the TLp distance that directly measures the intensity of a signal whilst retaining many of the key features of optimal transport and without the restrictions stated above.

We consider the coupling (f,μ) where f∈ Lp(μ). The function f represents the intensity of a signal. The TLp distance between (f,μ) and (g,ν) is defined as the optimal transport distance between μ and ν with cost c(x,y) = c(x,y;f,g) = |x-y|p + |f(x)-g(y)|p. Equivalently, one can define the TLp distance as the optimal transport distance between the pushforward measures Ξ = (Id× f)# μ and Λ = (Id× g)# ν with cost function c(x,y) = |x-y|p. The first definition allows for the application of numerical methods without increasing the dimension. The second definition allows for many theoretical properties such as existence and characterisation of minimizers to be carried through from optimal transport theory.

With collaborators I demonstrated the improvement in performance of TLp over distances such as Lp, optimal transport and dynamic time warping [8] with the detailed mathematical framework in a forthcoming publication.

My collaborators on this project are Soheil Kolouri (CMU), Serim Park (CMU), Gustavo Rhode (CMU) and Dejan Slepcev (CMU).