Digital Mathematical Notebook

Optimal Transport 🕷️ and Wasserstein distance

Optimal transport compares probability distributions by asking how much work is needed to move one pile of mass into another. The resulting Wasserstein distances keep track of geometry, not only of pointwise overlap, which is why they appear in probability, PDEs, imaging, machine learning, shape analysis, and topological data analysis.

The page moves from earth-mover intuition to couplings, Wasserstein geometry, duality, regularization, barycenters, and the persistence-diagram viewpoint.

FigureWithCaption

source target minimum total cost
Transport plans distribute mass from source locations to target locations while accounting for the geometry of the move itself.

Why this matters

Probability

Compare measures with finite moments in a way that sees spatial displacement.

Geometry

Wasserstein space carries geodesics, barycenters, and a rich metric structure.

Computation

Discrete OT becomes linear programming, and Sinkhorn makes large problems practical.

TDA

Matching persistence diagrams is itself an optimal transport problem with the diagonal.

Section 1 • Intuition first

Move mass, not only values

Optimal transport starts from the earth mover picture: one distribution is a pile of mass, the other is a demand, and the cost depends on how far mass must travel.

KeyIdeaBox

Support geometry matters

Two measures can have almost no pointwise overlap and still be close if one is just a translated copy of the other. Transport sees that geometry immediately, whereas divergences based only on density ratios can become blind or singular.

ExampleBox

Why pointwise comparison can mislead

If a density bump is shifted slightly to the right, the transport cost is small because every particle moves only a little. A pointwise discrepancy can still be large because the two bumps may barely overlap.

RemarkBox

Distances versus divergences

Kullback-Leibler divergence, total variation, maximum mean discrepancy, and Wasserstein distance answer different questions. Wasserstein is the one built from moving mass in a metric space.

FigureWithCaption

same shape, shifted support mu nu small transport cost if the shift is small
Wasserstein distance measures how far mass must travel, so a translation produces a cost proportional to displacement rather than to pointwise overlap.

Section 2 • The original formulation

Monge’s map-based problem

Monge asks for a transport map: each source point chooses a single destination, and the total cost of that rearrangement is minimized.

DefinitionBox

Monge problem

Given measures \(\mu\) and \(\nu\) on spaces \(X\) and \(Y\), and a cost \(c:X\times Y\to [0,\infty)\), find a measurable map \(T:X\to Y\) such that \(T_\#\mu=\nu\) and

$$ \inf_{T_\#\mu=\nu} \int_X c(x,T(x))\,d\mu(x). $$

RemarkBox

Pushforward notation

The condition \(T_\#\mu=\nu\) means that transporting \(\mu\) through the map \(T\) produces the target measure \(\nu\). Every unit of source mass chooses exactly one destination.

ExampleBox

Why Monge can fail

If one source atom of mass \(1\) must become two target atoms of mass \(1/2\) and \(1/2\), no map can split that atom. The problem has the right intuition, but the admissible set can be too rigid.

FigureWithCaption

a single source atom cannot split under a map mass 1 mass 1/2 mass 1/2
The map viewpoint is elegant when optimal maps exist, but transport plans are needed as soon as splitting becomes unavoidable.

Section 3 • Relax from maps to plans

Kantorovich’s transport plans

Kantorovich replaces maps by couplings. Instead of sending each \(x\) to one \(y\), a plan distributes the mass sitting at \(x\) across multiple destinations.

DefinitionBox

Couplings and transport plans

A coupling of \(\mu\) and \(\nu\) is a probability measure \(\gamma\) on \(X\times Y\) whose marginals are \(\mu\) and \(\nu\). The set of all such couplings is written \(\Pi(\mu,\nu)\).

FormulaBlock

Kantorovich problem

$$ \inf_{\gamma\in\Pi(\mu,\nu)} \int_{X\times Y} c(x,y)\,d\gamma(x,y). $$

This is the primal optimal transport problem. Monge’s problem sits inside it whenever a map \(T\) exists and the induced plan \((\mathrm{id},T)_\#\mu\) is admissible.

KeyIdeaBox

Why the relaxation is so powerful

The feasible set becomes convex, existence becomes much better behaved, and discrete OT turns into a finite-dimensional optimization problem over a transport matrix.

ExampleBox

Discrete matrix view

For source weights \(a_i\), target weights \(b_j\), and support points \(x_i,y_j\), a transport plan is a nonnegative matrix \(\pi=(\pi_{ij})\) with row sums \(a\) and column sums \(b\). Each entry \(\pi_{ij}\) is the amount of mass moved from \(x_i\) to \(y_j\).

Section 4 • From transport cost to a metric

Wasserstein distances \(W_p\)

Once the cost is chosen to be a power of the ground metric, optimal transport induces a genuine metric on measures with finite moments.

DefinitionBox

Finite \(p\)-moment measures and \(W_p\)

On a metric space \((X,d)\), let \(\mathcal{P}_p(X)\) be the set of Borel probability measures with finite \(p\)-th moment. For \(p\ge 1\),

$$ W_p(\mu,\nu)=\left(\inf_{\gamma\in\Pi(\mu,\nu)} \int_{X\times X} d(x,y)^p\,d\gamma(x,y)\right)^{1/p}. $$

RemarkBox

\(W_1\)

Linear cost is sensitive to total travel length. It is the Earth Mover’s Distance and has a particularly transparent dual formulation.

RemarkBox

\(W_2\)

Quadratic cost emphasizes long moves and leads to a rich Riemannian-style geometry on the space of measures.

TheoremBox

Metric property

For every \(p\ge 1\), \(W_p\) is a metric on \(\mathcal{P}_p(X)\). The finite-moment assumption matters: without it, the transport cost may be infinite.

Section 5 • The finite-dimensional practical model

Discrete optimal transport

In the discrete setting, optimal transport becomes a constrained optimization problem over a nonnegative matrix. This is the computational backbone of many OT applications.

FormulaBlock

Linear programming form

$$ \min_{\pi\ge 0}\ \sum_{i,j} c_{ij}\pi_{ij}\qquad\text{s.t.}\qquad \pi\mathbf{1}=a,\ \pi^\top\mathbf{1}=b. $$

RemarkBox

Geometry meets optimization

The vector \(a\) stores source masses, \(b\) stores target masses, and the matrix \(C=(c_{ij})\) stores the cost of moving between support points. The feasible set is the transportation polytope.

InteractiveWidgetShell

Discrete coupling demo

Source masses

Target masses

Sliders change the marginal constraints; the solver recomputes the exact optimal plan for this small \(3\times 3\) problem.

Section 6 • The one-dimensional miracle

Why transport is especially clean in one dimension

In one dimension, order simplifies everything. Optimal couplings are monotone, and Wasserstein distances can be expressed directly through quantiles.

TheoremBox

Quantile formula in 1D

If \(\mu,\nu\in\mathcal{P}_p(\mathbb{R})\) and \(F_\mu^{-1},F_\nu^{-1}\) are generalized inverse CDFs, then

$$ W_p^p(\mu,\nu)=\int_0^1 \lvert F_\mu^{-1}(s)-F_\nu^{-1}(s)\rvert^p\,ds. $$

KeyIdeaBox

Sort and match

In one dimension, crossing transport paths are wasteful. The optimal plan therefore preserves order: lower quantiles match lower quantiles, upper quantiles match upper quantiles.

RemarkBox

Special case for \(W_1\)

For \(W_1\), the same distance can also be written as the \(L^1\) gap between CDFs: \(W_1(\mu,\nu)=\int_{\mathbb{R}} |F_\mu(x)-F_\nu(x)|\,dx\).

InteractiveWidgetShell

1D transport demo

Section 7 • Another way to look at the same problem

Primal and dual viewpoints

The primal problem chooses how mass moves. The dual problem chooses potentials that certify how expensive any movement must be.

FormulaBlock

Kantorovich duality

$$ \sup_{\varphi(x)+\psi(y)\le c(x,y)} \left(\int \varphi\,d\mu + \int \psi\,d\nu\right). $$

Feasible potentials \(\varphi,\psi\) never overestimate the cost of moving \(x\) to \(y\). The optimum recovers the same value as the primal transport problem.

TheoremBox

Kantorovich-Rubinstein duality for \(W_1\)

On a metric space, $$ W_1(\mu,\nu)=\sup_{\operatorname{Lip}(f)\le 1}\left(\int f\,d\mu-\int f\,d\nu\right). $$ A 1-Lipschitz test function acts as a critic that cannot oscillate faster than the ground metric allows.

KeyIdeaBox

Prices rather than routes

Instead of tracking every path of every particle, the dual formulation assigns a scalar score to locations. If the score difference never exceeds transport cost, it produces a lower bound on the primal problem, and at optimality that bound is exact.

ProofToggle • Why \(W_1\) becomes a Lipschitz supremum

Start from the dual constraint \(\varphi(x)+\psi(y)\le d(x,y)\). For the cost \(c(x,y)=d(x,y)\), one can optimize over a single potential by taking \(\psi=-\varphi\) after the \(c\)-transform. The admissibility condition becomes \(\varphi(x)-\varphi(y)\le d(x,y)\), which is exactly the 1-Lipschitz property.

Section 8 • Quadratic cost is special

\(W_2\), optimal maps, and displacement interpolation

The quadratic cost \(c(x,y)=\|x-y\|^2\) produces particularly rigid structure. In the right setting, optimal transport is described by a map rather than by a general plan.

TheoremBox

Brenier’s theorem

If \(\mu\) is absolutely continuous with respect to Lebesgue measure on \(\mathbb{R}^d\) and \(\mu,\nu\in\mathcal{P}_2(\mathbb{R}^d)\), then for quadratic cost there exists an optimal map \(T\), unique \(\mu\)-almost everywhere, of the form \(T=\nabla \phi\) for a convex potential \(\phi\).

FormulaBlock

Displacement interpolation

$$ \mu_t = ((1-t)\mathrm{id}+tT)_\#\mu,\qquad t\in[0,1]. $$

The path \((\mu_t)_{t\in[0,1]}\) is the canonical geodesic between \(\mu\) and \(\nu\) in Wasserstein space.

RemarkBox

Assumptions are essential

The conclusion above depends on working in \(\mathbb{R}^d\), with quadratic cost and enough regularity on \(\mu\). Without those assumptions, optimal plans need not come from a single map.

Section 9 • Transport as a flow of densities

The Benamou-Brenier point of view

Instead of matching only the endpoints, one can describe transport as a time-dependent density \(\rho_t\) moved by a velocity field \(v_t\).

FormulaBlock

Dynamic formulation of \(W_2\)

$$ W_2^2(\mu_0,\mu_1)=\inf_{\rho_t,v_t}\int_0^1\int_{\mathbb{R}^d}\|v_t(x)\|^2\,\rho_t(x)\,dx\,dt $$

$$ \text{subject to }\partial_t\rho_t+\nabla\cdot(\rho_t v_t)=0,\qquad \rho_{t=0}=\mu_0,\ \rho_{t=1}=\mu_1. $$

KeyIdeaBox

Kinetic energy of a fluid

The integrand \(\|v_t\|^2\rho_t\) is kinetic energy density. Transport is therefore a minimum action problem: find the least energetic way to flow one density into another.

RemarkBox

Eulerian versus Lagrangian views

Brenier’s map viewpoint follows particles. Benamou-Brenier follows fields and densities. The two descriptions encode the same geometry from different angles.

InteractiveWidgetShell

Displacement interpolation / density flow

0.00

Section 10 • Computational OT

Entropic regularization and Sinkhorn scaling

Exact OT can be expensive at large scale. Entropic regularization smooths the problem and turns it into matrix scaling.

FormulaBlock

Entropically regularized OT

$$ \min_{\pi\in\Pi(a,b)} \langle C,\pi\rangle + \varepsilon\,\mathrm{KL}(\pi\,\|\,a\otimes b). $$

The parameter \(\varepsilon>0\) rewards spread-out plans and removes the sharp corners of the exact transportation polytope.

TheoremBox

Sinkhorn form

The optimizer has the form $$ \pi_\varepsilon = \operatorname{diag}(u)\,K\,\operatorname{diag}(v),\qquad K_{ij}=e^{-C_{ij}/\varepsilon}. $$ Sinkhorn alternates row and column rescalings until the prescribed marginals are matched.

RemarkBox

Accuracy versus smoothness

As \(\varepsilon\to 0\), the plan approaches exact OT but becomes numerically sharper. Larger \(\varepsilon\) gives smoother, denser couplings and faster, more stable computation at the price of bias.

InteractiveWidgetShell

Regularized OT / Sinkhorn demo

0.18

Section 11 • Averaging in transport geometry

Wasserstein barycenters

Ordinary Euclidean averaging ignores transport geometry. Wasserstein barycenters define an intrinsic average of measures by minimizing weighted transport cost.

DefinitionBox

Barycenter problem

Given measures \(\mu_1,\dots,\mu_N\) and weights \(\lambda_i\ge 0\) with \(\sum_i \lambda_i=1\), a \(W_p\)-barycenter solves

$$ \operatorname*{argmin}_{\mu}\ \sum_{i=1}^N \lambda_i W_p^p(\mu,\mu_i). $$

ExampleBox

Average in the right geometry

If one measure is a translated copy of another, the barycenter lies halfway along the transport paths rather than halfway in pointwise intensity.

RemarkBox

1D is again explicit

In one dimension and for quadratic cost, averaging quantile functions gives the barycenter exactly. In higher dimensions barycenters are already substantial computational objects.

InteractiveWidgetShell

Barycenter demo

0.50

Section 12 • Variants built for different tasks

Advanced extensions

Classical OT is not the end of the story. Several variants change the geometry or the constraints to solve different comparison problems.

DefinitionBox

Sliced Wasserstein

Project high-dimensional measures onto many one-dimensional directions, solve cheap 1D OT problems, and average the result. This keeps transport flavor while improving scalability.

DefinitionBox

Unbalanced OT

Relax exact marginal constraints and penalize mass creation or destruction. This is useful when source and target have different total mass or when exact conservation is not meaningful.

DefinitionBox

Gromov-Wasserstein

Compare metric measure spaces without a fixed pointwise correspondence. The objective matches pairwise distance structures rather than points directly.

Section 13 • Where OT actually shows up

Applications

The right mental model is not “OT is everywhere,” but rather “OT becomes useful precisely when geometry and distributional mismatch both matter.”

ExampleBox

Image and color transfer

Pixel or color distributions can be moved from one image to another, producing transformations that respect global distributional structure rather than only local intensity differences.

ExampleBox

Domain adaptation

Source and target datasets can be aligned through transport plans, making distribution shift explicit and often interpretable.

ExampleBox

Generative modeling

Wasserstein objectives and their approximations are used to compare model and data distributions in a way that reflects geometry of samples.

ExampleBox

Shape and point-cloud analysis

OT compares empirical measures on geometric data, supports barycenters, and interacts naturally with shape descriptors and registration pipelines.

Section 14 • Link back to persistence

Optimal transport and persistence diagrams

Persistence diagrams are compared by matching cornerpoints to one another or to the diagonal. That is an optimal transport problem with a special reservoir of deletions.

FigureWithCaption

Dgm(X) Dgm(Y)
The \(p\)-Wasserstein distance on persistence diagrams uses the same transport logic, with the diagonal representing the option to delete short-lived classes.

RemarkBox

Bottleneck as \(W_\infty\)

The bottleneck distance is the \(p=\infty\) analogue: instead of minimizing total transport effort, it controls the worst matched displacement.

Related page: Topological Data Analysis Persistence diagrams, matching to the diagonal, and stability

Section 15 • Closing map

Takeaways and further reading

Optimal transport is powerful because it treats measures as geometric objects. Its theory, computation, and applications all follow from that one decision.

KeyIdeaBox

Summary

OT asks for the least costly way to move mass. Wasserstein distances turn that transport cost into a metric. In one dimension the geometry becomes explicit, in \(W_2\) it becomes deeply geometric, and computationally it becomes practical through regularization and Sinkhorn scaling.

FurtherReadingBox

Where to go next

If this page felt natural, the next useful step is to study one theory text and one computational text in parallel: it keeps the geometry and the algorithms connected.

Intuition / intro

Cédric Villani, Topics in Optimal Transportation. A classical entry point with strong geometric motivation.

Theory

Filippo Santambrogio, Optimal Transport for Applied Mathematicians. Clear on Monge, Kantorovich, duality, and the Benamou-Brenier viewpoint.

Computational OT

Gabriel Peyré and Marco Cuturi, Computational Optimal Transport. The practical reference for Sinkhorn, barycenters, and modern numerical methods.

Applications

Look next at domain adaptation, color transfer, and persistence-diagram metrics to see how OT changes when the data type changes but the transport idea stays the same.

Source PDF