To qualify as a distance, a measure must satisfy the following properties:
- Symmetry: $ d(P, Q) = d(Q, P) )$
- Triangle inequality: $ d(P, Q) + d(Q, R) \geq d(P, R) $
However, in practice, we often deal with weaker notions of distances, commonly referred to as divergences.
Example: KL Divergence
The Kullback-Leibler (KL) divergence is defined as: $$ D_{\text{KL}}(P || Q) = \int p(x) \log \frac{p(x)}{q(x)} dx $$
Properties of KL Divergence
- Not Symmetric: $$ D_{\text{KL}}(P || Q) \neq D_{\text{KL}}(Q || P) $$
- Infinite for Different Supports:
$$ D_{\text{KL}}(P || Q) \to \infty \quad \text{if } P \text{ and } Q \text{ have different supports.} $$
Addressing These Challenges
Solution 1: Smoothing Distributions💡
To avoid the issue of different supports, one solution is to smooth the distributions to match their supports.
Solution 2: Use a Different Divergence💡
An alternative approach is to use a divergence that naturally handles different supports and adheres to desirable distance properties. One such measure is the Wasserstein Distance.
Wasserstein Distance
The Wasserstein distance, rooted in optimal transport theory, addresses the shortcomings of KL divergence by offering:
- Symmetry
- Triangle inequality
- A meaningful geometry of the space of distributions.
Intuition Behind Optimal Transport
The Wasserstein distance can be understood as the minimum “cost” to transform one distribution into another. Imagine redistributing the “mass” of one distribution $P$ to match another distribution $Q$. Each unit of mass has a transportation cost proportional to the distance it is moved.
Formally, the $p$-Wasserstein distance is defined as:
$$ W_p(P, Q) = \left( \inf_{\gamma \in \Pi(P, Q)} \int | x - y |^p d\gamma(x, y) \right)^{1/p} $$ Here:
- $ \Pi(P, Q) $: The set of all couplings (joint distributions) with marginals $P$ and $Q$.
- $ | x - y | $: The cost of moving mass from $x$ to $y$.
Properties of Wasserstein Distance
- Captures the geometric relationship between distributions.
- Finite even for distributions with disjoint supports.
- Offers meaningful insights in contexts like generative modeling and comparing empirical distributions.
Optimal transport and the Wasserstein distance are widely used in fields such as:
- Machine Learning: Generative models (e.g., GANs with Wasserstein loss).
- Economics: Resource allocation problems.
- Physics: Modeling fluid dynamics.
- Image Processing: Comparing distributions of pixel intensities.
By leveraging the principles of optimal transport, we gain a robust and versatile framework for comparing and transforming probability distributions.