Digital Mathematical Notebook
Optimal Transport 🕷️ and Wasserstein distance
Optimal transport compares probability distributions by asking how much work is needed to move one pile of mass into another. The resulting Wasserstein distances keep track of geometry, not only of pointwise overlap, which is why they appear in probability, PDEs, imaging, machine learning, shape analysis, and topological data analysis.
FigureWithCaption
Why this matters
Compare measures with finite moments in a way that sees spatial displacement.
Wasserstein space carries geodesics, barycenters, and a rich metric structure.
Discrete OT becomes linear programming, and Sinkhorn makes large problems practical.
Matching persistence diagrams is itself an optimal transport problem with the diagonal.
Move mass, not only values
Optimal transport starts from the earth mover picture: one distribution is a pile of mass, the other is a demand, and the cost depends on how far mass must travel.
KeyIdeaBox
Support geometry mattersTwo measures can have almost no pointwise overlap and still be close if one is just a translated copy of the other. Transport sees that geometry immediately, whereas divergences based only on density ratios can become blind or singular.
ExampleBox
Why pointwise comparison can misleadIf a density bump is shifted slightly to the right, the transport cost is small because every particle moves only a little. A pointwise discrepancy can still be large because the two bumps may barely overlap.
RemarkBox
Distances versus divergencesKullback-Leibler divergence, total variation, maximum mean discrepancy, and Wasserstein distance answer different questions. Wasserstein is the one built from moving mass in a metric space.
FigureWithCaption
Monge’s map-based problem
Monge asks for a transport map: each source point chooses a single destination, and the total cost of that rearrangement is minimized.
DefinitionBox
Monge problemGiven measures \(\mu\) and \(\nu\) on spaces \(X\) and \(Y\), and a cost \(c:X\times Y\to [0,\infty)\), find a measurable map \(T:X\to Y\) such that \(T_\#\mu=\nu\) and
$$ \inf_{T_\#\mu=\nu} \int_X c(x,T(x))\,d\mu(x). $$
RemarkBox
Pushforward notationThe condition \(T_\#\mu=\nu\) means that transporting \(\mu\) through the map \(T\) produces the target measure \(\nu\). Every unit of source mass chooses exactly one destination.
ExampleBox
Why Monge can failIf one source atom of mass \(1\) must become two target atoms of mass \(1/2\) and \(1/2\), no map can split that atom. The problem has the right intuition, but the admissible set can be too rigid.
FigureWithCaption
Kantorovich’s transport plans
Kantorovich replaces maps by couplings. Instead of sending each \(x\) to one \(y\), a plan distributes the mass sitting at \(x\) across multiple destinations.
DefinitionBox
Couplings and transport plansA coupling of \(\mu\) and \(\nu\) is a probability measure \(\gamma\) on \(X\times Y\) whose marginals are \(\mu\) and \(\nu\). The set of all such couplings is written \(\Pi(\mu,\nu)\).
FormulaBlock
Kantorovich problem$$ \inf_{\gamma\in\Pi(\mu,\nu)} \int_{X\times Y} c(x,y)\,d\gamma(x,y). $$
This is the primal optimal transport problem. Monge’s problem sits inside it whenever a map \(T\) exists and the induced plan \((\mathrm{id},T)_\#\mu\) is admissible.
KeyIdeaBox
Why the relaxation is so powerfulThe feasible set becomes convex, existence becomes much better behaved, and discrete OT turns into a finite-dimensional optimization problem over a transport matrix.
ExampleBox
Discrete matrix viewFor source weights \(a_i\), target weights \(b_j\), and support points \(x_i,y_j\), a transport plan is a nonnegative matrix \(\pi=(\pi_{ij})\) with row sums \(a\) and column sums \(b\). Each entry \(\pi_{ij}\) is the amount of mass moved from \(x_i\) to \(y_j\).
Wasserstein distances \(W_p\)
Once the cost is chosen to be a power of the ground metric, optimal transport induces a genuine metric on measures with finite moments.
DefinitionBox
Finite \(p\)-moment measures and \(W_p\)On a metric space \((X,d)\), let \(\mathcal{P}_p(X)\) be the set of Borel probability measures with finite \(p\)-th moment. For \(p\ge 1\),
$$ W_p(\mu,\nu)=\left(\inf_{\gamma\in\Pi(\mu,\nu)} \int_{X\times X} d(x,y)^p\,d\gamma(x,y)\right)^{1/p}. $$
RemarkBox
\(W_1\)Linear cost is sensitive to total travel length. It is the Earth Mover’s Distance and has a particularly transparent dual formulation.
RemarkBox
\(W_2\)Quadratic cost emphasizes long moves and leads to a rich Riemannian-style geometry on the space of measures.
TheoremBox
Metric propertyFor every \(p\ge 1\), \(W_p\) is a metric on \(\mathcal{P}_p(X)\). The finite-moment assumption matters: without it, the transport cost may be infinite.
Discrete optimal transport
In the discrete setting, optimal transport becomes a constrained optimization problem over a nonnegative matrix. This is the computational backbone of many OT applications.
FormulaBlock
Linear programming form$$ \min_{\pi\ge 0}\ \sum_{i,j} c_{ij}\pi_{ij}\qquad\text{s.t.}\qquad \pi\mathbf{1}=a,\ \pi^\top\mathbf{1}=b. $$
RemarkBox
Geometry meets optimizationThe vector \(a\) stores source masses, \(b\) stores target masses, and the matrix \(C=(c_{ij})\) stores the cost of moving between support points. The feasible set is the transportation polytope.
Why transport is especially clean in one dimension
In one dimension, order simplifies everything. Optimal couplings are monotone, and Wasserstein distances can be expressed directly through quantiles.
TheoremBox
Quantile formula in 1DIf \(\mu,\nu\in\mathcal{P}_p(\mathbb{R})\) and \(F_\mu^{-1},F_\nu^{-1}\) are generalized inverse CDFs, then
$$ W_p^p(\mu,\nu)=\int_0^1 \lvert F_\mu^{-1}(s)-F_\nu^{-1}(s)\rvert^p\,ds. $$
KeyIdeaBox
Sort and matchIn one dimension, crossing transport paths are wasteful. The optimal plan therefore preserves order: lower quantiles match lower quantiles, upper quantiles match upper quantiles.
RemarkBox
Special case for \(W_1\)For \(W_1\), the same distance can also be written as the \(L^1\) gap between CDFs: \(W_1(\mu,\nu)=\int_{\mathbb{R}} |F_\mu(x)-F_\nu(x)|\,dx\).
Primal and dual viewpoints
The primal problem chooses how mass moves. The dual problem chooses potentials that certify how expensive any movement must be.
FormulaBlock
Kantorovich duality$$ \sup_{\varphi(x)+\psi(y)\le c(x,y)} \left(\int \varphi\,d\mu + \int \psi\,d\nu\right). $$
Feasible potentials \(\varphi,\psi\) never overestimate the cost of moving \(x\) to \(y\). The optimum recovers the same value as the primal transport problem.
TheoremBox
Kantorovich-Rubinstein duality for \(W_1\)On a metric space, $$ W_1(\mu,\nu)=\sup_{\operatorname{Lip}(f)\le 1}\left(\int f\,d\mu-\int f\,d\nu\right). $$ A 1-Lipschitz test function acts as a critic that cannot oscillate faster than the ground metric allows.
KeyIdeaBox
Prices rather than routesInstead of tracking every path of every particle, the dual formulation assigns a scalar score to locations. If the score difference never exceeds transport cost, it produces a lower bound on the primal problem, and at optimality that bound is exact.
ProofToggle • Why \(W_1\) becomes a Lipschitz supremum
Start from the dual constraint \(\varphi(x)+\psi(y)\le d(x,y)\). For the cost \(c(x,y)=d(x,y)\), one can optimize over a single potential by taking \(\psi=-\varphi\) after the \(c\)-transform. The admissibility condition becomes \(\varphi(x)-\varphi(y)\le d(x,y)\), which is exactly the 1-Lipschitz property.
\(W_2\), optimal maps, and displacement interpolation
The quadratic cost \(c(x,y)=\|x-y\|^2\) produces particularly rigid structure. In the right setting, optimal transport is described by a map rather than by a general plan.
TheoremBox
Brenier’s theoremIf \(\mu\) is absolutely continuous with respect to Lebesgue measure on \(\mathbb{R}^d\) and \(\mu,\nu\in\mathcal{P}_2(\mathbb{R}^d)\), then for quadratic cost there exists an optimal map \(T\), unique \(\mu\)-almost everywhere, of the form \(T=\nabla \phi\) for a convex potential \(\phi\).
FormulaBlock
Displacement interpolation$$ \mu_t = ((1-t)\mathrm{id}+tT)_\#\mu,\qquad t\in[0,1]. $$
The path \((\mu_t)_{t\in[0,1]}\) is the canonical geodesic between \(\mu\) and \(\nu\) in Wasserstein space.
RemarkBox
Assumptions are essentialThe conclusion above depends on working in \(\mathbb{R}^d\), with quadratic cost and enough regularity on \(\mu\). Without those assumptions, optimal plans need not come from a single map.
The Benamou-Brenier point of view
Instead of matching only the endpoints, one can describe transport as a time-dependent density \(\rho_t\) moved by a velocity field \(v_t\).
FormulaBlock
Dynamic formulation of \(W_2\)$$ W_2^2(\mu_0,\mu_1)=\inf_{\rho_t,v_t}\int_0^1\int_{\mathbb{R}^d}\|v_t(x)\|^2\,\rho_t(x)\,dx\,dt $$
$$ \text{subject to }\partial_t\rho_t+\nabla\cdot(\rho_t v_t)=0,\qquad \rho_{t=0}=\mu_0,\ \rho_{t=1}=\mu_1. $$
KeyIdeaBox
Kinetic energy of a fluidThe integrand \(\|v_t\|^2\rho_t\) is kinetic energy density. Transport is therefore a minimum action problem: find the least energetic way to flow one density into another.
RemarkBox
Eulerian versus Lagrangian viewsBrenier’s map viewpoint follows particles. Benamou-Brenier follows fields and densities. The two descriptions encode the same geometry from different angles.
Entropic regularization and Sinkhorn scaling
Exact OT can be expensive at large scale. Entropic regularization smooths the problem and turns it into matrix scaling.
FormulaBlock
Entropically regularized OT$$ \min_{\pi\in\Pi(a,b)} \langle C,\pi\rangle + \varepsilon\,\mathrm{KL}(\pi\,\|\,a\otimes b). $$
The parameter \(\varepsilon>0\) rewards spread-out plans and removes the sharp corners of the exact transportation polytope.
TheoremBox
Sinkhorn formThe optimizer has the form $$ \pi_\varepsilon = \operatorname{diag}(u)\,K\,\operatorname{diag}(v),\qquad K_{ij}=e^{-C_{ij}/\varepsilon}. $$ Sinkhorn alternates row and column rescalings until the prescribed marginals are matched.
RemarkBox
Accuracy versus smoothnessAs \(\varepsilon\to 0\), the plan approaches exact OT but becomes numerically sharper. Larger \(\varepsilon\) gives smoother, denser couplings and faster, more stable computation at the price of bias.
Wasserstein barycenters
Ordinary Euclidean averaging ignores transport geometry. Wasserstein barycenters define an intrinsic average of measures by minimizing weighted transport cost.
DefinitionBox
Barycenter problemGiven measures \(\mu_1,\dots,\mu_N\) and weights \(\lambda_i\ge 0\) with \(\sum_i \lambda_i=1\), a \(W_p\)-barycenter solves
$$ \operatorname*{argmin}_{\mu}\ \sum_{i=1}^N \lambda_i W_p^p(\mu,\mu_i). $$
ExampleBox
Average in the right geometryIf one measure is a translated copy of another, the barycenter lies halfway along the transport paths rather than halfway in pointwise intensity.
RemarkBox
1D is again explicitIn one dimension and for quadratic cost, averaging quantile functions gives the barycenter exactly. In higher dimensions barycenters are already substantial computational objects.
Advanced extensions
Classical OT is not the end of the story. Several variants change the geometry or the constraints to solve different comparison problems.
DefinitionBox
Sliced WassersteinProject high-dimensional measures onto many one-dimensional directions, solve cheap 1D OT problems, and average the result. This keeps transport flavor while improving scalability.
DefinitionBox
Unbalanced OTRelax exact marginal constraints and penalize mass creation or destruction. This is useful when source and target have different total mass or when exact conservation is not meaningful.
DefinitionBox
Gromov-WassersteinCompare metric measure spaces without a fixed pointwise correspondence. The objective matches pairwise distance structures rather than points directly.
Applications
The right mental model is not “OT is everywhere,” but rather “OT becomes useful precisely when geometry and distributional mismatch both matter.”
ExampleBox
Image and color transferPixel or color distributions can be moved from one image to another, producing transformations that respect global distributional structure rather than only local intensity differences.
ExampleBox
Domain adaptationSource and target datasets can be aligned through transport plans, making distribution shift explicit and often interpretable.
ExampleBox
Generative modelingWasserstein objectives and their approximations are used to compare model and data distributions in a way that reflects geometry of samples.
ExampleBox
Shape and point-cloud analysisOT compares empirical measures on geometric data, supports barycenters, and interacts naturally with shape descriptors and registration pipelines.
Optimal transport and persistence diagrams
Persistence diagrams are compared by matching cornerpoints to one another or to the diagonal. That is an optimal transport problem with a special reservoir of deletions.
FigureWithCaption
RemarkBox
Bottleneck as \(W_\infty\)The bottleneck distance is the \(p=\infty\) analogue: instead of minimizing total transport effort, it controls the worst matched displacement.
Takeaways and further reading
Optimal transport is powerful because it treats measures as geometric objects. Its theory, computation, and applications all follow from that one decision.
KeyIdeaBox
SummaryOT asks for the least costly way to move mass. Wasserstein distances turn that transport cost into a metric. In one dimension the geometry becomes explicit, in \(W_2\) it becomes deeply geometric, and computationally it becomes practical through regularization and Sinkhorn scaling.
FurtherReadingBox
Where to go nextIf this page felt natural, the next useful step is to study one theory text and one computational text in parallel: it keeps the geometry and the algorithms connected.
Cédric Villani, Topics in Optimal Transportation. A classical entry point with strong geometric motivation.
Filippo Santambrogio, Optimal Transport for Applied Mathematicians. Clear on Monge, Kantorovich, duality, and the Benamou-Brenier viewpoint.
Gabriel Peyré and Marco Cuturi, Computational Optimal Transport. The practical reference for Sinkhorn, barycenters, and modern numerical methods.
Look next at domain adaptation, color transfer, and persistence-diagram metrics to see how OT changes when the data type changes but the transport idea stays the same.