Understanding the curse of dimensionality requires more than just examining the representational aspect of data. A key insight lies in the concept of intrinsic dimensionality (ID)—the number of degrees of freedom within a data manifold or subspace. This ID is often independent of the dimensionality of the space in which the data is embedded. As a result, ID serves as a foundation for dimensionality reduction, which improves similarity measures and enhances the scalability of machine learning and data mining methods.

The Role of Dimensionality Reduction

Dimensionality reduction has gained prominence in machine learning and data science for its ability to simplify complex datasets while preserving essential structure. Methods in this domain are broadly categorized into:

  1. Linear Techniques:

    • Principal Component Analysis (PCA) and its variants (e.g., Bouveyron et al. 2011) focus on orthogonal transformations to project data onto a lower-dimensional space.
  2. Non-linear Techniques (Manifold Learning):

    • These methods include Isometric Mapping (Tenenbaum et al. 2000), Locally Linear Embedding (Roweis and Saul 2000), and Hessian Eigenmapping (Donoho and Grimes 2003).
    • They aim to capture the underlying structure of data that lies on a non-linear manifold.

While most methods require users to specify a target dimension, some techniques adaptively infer it based on the dataset’s intrinsic dimensionality. This adaptability underscores the significance of ID estimation.

Models and Methods for Estimating Intrinsic Dimensionality

Over the years, researchers have developed diverse models to estimate intrinsic dimensionality. These fall into several categories:

  • Topological Approaches: Analyze the tangent space of the manifold using local samples (e.g., Fukunaga and Olsen 1971; Verveer and Duin 1995).

  • Fractal Measures: Use metrics like the Correlation Dimension (Faloutsos and Kamel 1994) to estimate ID based on space-filling properties of data.

  • Graph-based Methods: Employ $k$-nearest neighbors and density metrics to infer ID (Costa and Hero 2004).

  • Parametric Models: Leverage statistical models to estimate ID, such as those by Levina and Bickel (2004).

Global vs. Local Intrinsic Dimensionality

Intrinsic dimensionality measures can be broadly classified into global and local approaches:

  • Global Measures: Analyze the dataset as a whole, treating all objects uniformly. These measures are well-suited for homogeneous datasets with a single dominant manifold.

  • Local Measures: Focus on the $k$-nearest neighbors of a specific point. These methods are essential for heterogeneous datasets comprising multiple, overlapping manifolds. Notable local ID models include:

    • Expansion Dimension (ED) (Karger and Ruhl 2002).
    • Generalized Expansion Dimension (GED) (Houle et al. 2012).
    • Local Intrinsic Dimensionality (LID) (Houle 2013).

Local ID measures are particularly relevant in applications such as similarity search, where they can estimate query complexity or optimize search termination. They are also applied in outlier detection and density estimation.

Balancing Local and Global Insights

Machine learning techniques often face challenges like overfitting when relying heavily on local information. To mitigate this, methods such as Local Tangent Space Alignment (LTSA) (Zhang and Zha 2004) combine local and global perspectives by aligning neighborhoods of points while penalizing overfitting during optimization. This balance enables a more comprehensive understanding of the data structure.

Continuous Intrinsic Dimension

In this section, we explore Local Intrinsic Dimensionality (LID), a model that extends intrinsic dimensionality to continuous distributions of distances, as proposed by Houle (2013). LID quantifies the local intrinsic dimensionality (ID) of a feature space by focusing exclusively on the distribution of inter-point distances.

Defining the Distribution of Distances

Let $(\mathbb{R}^m, \text{dist})$ represent a domain equipped with a non-negative distance function dist. Consider the distribution of distances with respect to a fixed reference point. This distribution can be modeled as a random variable $\mathbf{X}$, with support $[0, \infty)$. The probability density function (PDF) of $\mathbf{X}$ is denoted by $f_{\mathbf{X}}$, where $f_{\mathbf{X}}$ is a non-negative, Lebesgue-integrable function. For any $a, b \in [0, \infty)$ such that $a \leq b$, the probability is given by:

$$ \operatorname{Pr}[a \leq \mathbf{X} \leq b] = \int_a^b f_{\mathbf{X}}(x) , \mathrm{d}x. $$

The corresponding cumulative density function (CDF) $F_{\mathbf{X}}$ is defined as:

$$ F_{\mathbf{X}}(x) = \operatorname{Pr}[\mathbf{X} \leq x] = \int_0^x f_{\mathbf{X}}(u) , \mathrm{d}u. $$

For values where $\mathbf{X}$ is absolutely continuous at $x$, the CDF $F_{\mathbf{X}}$ is differentiable at $x$, and its first-order derivative is $f_{\mathbf{X}}(x)$.

The Local Continuous Intrinsic Dimension

The local intrinsic dimension at distance $x$ is defined as follows:

Definition 1 (Houle, 2013):
Given an absolutely continuous random distance variable $\mathbf{X}$, for any distance threshold $x$ such that $F_{\mathbf{X}}(x) > 0$, the local continuous intrinsic dimension $ \mathrm{ID}_{\mathbf{X}}(x)$ of $\mathbf{X}$ at distance $x$ is:

$$ \mathrm{ID}{\mathbf{X}}(x) \triangleq \lim{\epsilon \to 0^+} \frac{\ln F_{\mathbf{X}}((1+\epsilon) x) - \ln F_{\mathbf{X}}(x)}{\ln (1+\epsilon)} $$

wherever the limit exists.

Relation to the Generalized Expansion Dimension

The LID definition builds upon the generalized expansion dimension (GED) proposed by Houle et al. (2012a). GED measures dimensionality by comparing neighborhood radii $x$ and $(1+\epsilon)x$, replacing neighborhood cardinalities with the expected number of neighbors.

In essence, the LID formulation quantifies the discriminative power of a distance measure. Both LID and GED share the same closed-form representation, reflecting their foundational equivalence in characterizing local dimensionality.

Theorem 1

(Houle 2013) Let X be an absolutely continuous random distance variable. If $F_X$ is both positive and differentiable at x, then:

$$ \text{ID}_X(x) = \frac{x f_X(x)}{F_X(x)}. $$

Local intrinsic dimensionality (Local ID) has potential for wide application thanks to its very general treatment of distances as a continuous random variable. Direct estimation of $ID_X(x)$, however, requires the knowledge of the distribution of $X$. Extreme value theory, which we survey in the following section, allows the estimation of the limit of $x \to 0$ without any explicit assumptions of the data distribution other than continuity.


3. Extreme Value Theory

Extreme value theory is concerned with the modeling of what can be regarded as the extreme behavior of stochastic processes.

Definition 2

Let $\mu \in R$ and $\sigma > 0$. The family of generalized extreme value distributions $F_{GEV}$ covers distributions whose cumulative distribution functions have the form:

$$ F_{GEV} = \exp \left( - \left[ 1 + \xi \left( \frac{x - \mu}{\sigma} \right) \right]^{-\frac{1}{\xi}} \right) & \text{if } \xi \neq 0, \ \exp \left( -\exp \left( -\frac{x - \mu}{\sigma} \right) \right) & \text{if } \xi = 0. $$

A distribution $G \in F_{GEV}$ has support:

$$ \text{supp}(G) = \begin{cases} [\mu - \frac{\sigma}{\xi}, \infty) & \text{when } \xi > 0, \ (-\infty, \mu - \frac{\sigma}{\xi}] & \text{when } \xi < 0, \ (-\infty, \infty) & \text{if } \xi = 0. \end{cases} $$

Its best-known theorem, attributed in parts to Fisher and Tippett (1928), and Gnedenko (1943), states that the maximum of n independent identically-distributed random variables (after proper renormalization) converges in distribution to a generalized extreme value distribution as $n \to \infty$ .

Theorem 2

(Fisher-Tippett-Gnedenko)
Let $(X_i)_{i \in N}$ be a sequence of independent identically-distributed random variables and let $M_n=max X_i$ If there exist a sequence of positive constants $a_n$, $n \in N$ and a sequence of constants $b_n$ , $n \in N$ , such that:

$$ \lim_{n \to \infty} P\left( \frac{M_n - b_n}{a_n} \leq x \right) = F(x), $$

then F(x) belongs to the generalized extreme value family.

Isotropy

A distribution is isotropic if its variance is uniformly distributed across all dimensions. Namely, the covariance matrix of an isotropic distribution is proportion al to the identity matrix. Conversely, an anisotropic distribution of data is one where the variance is dominated by a single dimension. For example, a line in n-dimensional vector space is maximally anisotropic. Robust isotropy metrics should return maximally isotropic scores for balls and minimally isotropic (i.e. anisotropic) scores for lines.

Other measures

  1. Avg cosine similarity : Calculated as one minus the average cosine similarity of $N$ reandomly sampled pairs of points from $X$, data distribution.
  2. Partition Isotropy score : Proposed by Arora et al. $$Z(c):= \sum_x exp(c^T x) $$ where $c$ is chosen from the eigenspectrum of $XX^T$
  3. Intrinsic Dimensionality : To compute the true dimension of a given manifold from which we assume a point cloud has been sampled. Dividing by $n$, the number of samples, gives us a normalized score of isotropy.
  • Linear dimensionality estimate: *Principal Component Analysis to measure the number of significant components. The cumulative explained variance can provide a threshold for identifying intrinsic dimensions.
  • *Non linear dimensionality estimate: - [MLE] –> Levina2005
    - [Moment’s Method]–>() - [Two_NN]
  1. Variance Explained ratio: measures how much total variance is explained by the first $k$ principal components of data. It requieres an a priori number of PC to examine.

Estimating Intrinsic Dimension of a Dataset by MLE

from skdim.id import MLE 

Aim is to derive the maximum likelihood estimator (MLE) of the dimension $m$ from i.i.d. observations $X_1 … X_n$ in $\mathbb{R}^p$ . The observations represent an embedding of a lower-dimensional sample, $X_i = g(Y_i)$ , where $Y_i$ are sampled from an unknown smooth density $f$ on $\mathbb{R}^m$ , with unknown $m \leq p$, and $g$ is a continuous and sufficiently smooth (but not necessarily globally isometric) mapping. This assumption ensures that close neighbors in $\mathbb{R}^m$ are mapped to close neighbors in the embedding.

The basic idea is to fix a point $x$ , assume $f(x) \sim const$ in a small sphere $S_x(R)$ of radius $R$ around $x$, and treat the observations as a homogeneous Poisson process in $S_x(R)$ . Consider the inhomogeneous process:

$$ N(t, x) = \sum_{i=1}^n \mathbf{1} {X_i \in S_x(t)} $$

which counts observations within distance $t$ from $x$ . Approximating this binomial fixed $n$ process by a Poisson process and suppressing the dependence on $x$ for now, we can write the rate $\lambda(t)$ of the process $N(t)$ as:

$$ \lambda(t) = f(x) V(m) m t^{m-1} $$

This follows immediately from the Poisson process properties since $( V(m) m t^{m-1} = \frac{d}{dt} \left[ V(m) t^m \right]$ is the surface area of the sphere $S_x(t)$. Letting $\theta = \log f(x)$ , we can write the log-likelihood of the observed process $N(t)$ as:

$$ L(m, \theta) = \int_0^R \log \lambda(t) , d N(t) - \int_0^R \lambda(t) , dt $$

This is an exponential family for which MLEs exist with probability $\rightarrow 1$ as $n \rightarrow \infty$ and are unique. The MLEs must satisfy the likelihood equations

$$ \frac{\partial L}{\partial \theta} = \int_0^R d N(t) - \int_0^R \lambda(t) , dt = N(R) - e^\theta V(m) R^m = 0, $$

$$ \frac{\partial L}{\partial m} = \left( \frac{1}{m} + \frac{V^{\prime}(m)}{V(m)} \right) N(R) + \int_0^R \log t , d N(t)

  • e^\theta V(m) R^m \left( \log R + \frac{V^{\prime}(m)}{V(m)} \right) = 0. $$

Substituting we get:

$$ m_R(x) = \left[ \frac{1}{N(R, x)} \sum_{j=1}^{N(R, x)} \log \frac{R}{T_j(x)} \right]^{-1}. $$

In practice, it may be more convenient to fix the number of neighbors $k$ rather than the radius of the sphere $R$ . Then the estimate becomes:

$$ m_k(x) = \left[ \frac{1}{k-1} \sum_{j=1}^{k-1} \log \frac{T_k(x)}{T_j(x)} \right]^{-1} $$

Note that we omit the last (zero) term in the sum. One could divide by $k-2$ rather than $k-1$ to make the estimator asymptotically unbiased (not shown here),

For some applications, one may want to evaluate local dimension estimates at every data point or average estimated dimensions within data clusters. We will assume that all the data points come from the same “manifold,” and therefore average over all observations.

The choice of $k$ clearly affects the estimate. It can be the case that a dataset has different intrinsic dimensions at different scales, e.g. a line with noise added to it can be viewed as either $1D$ or $2D$. In such a case, it is informative to have different estimates at different scales. In general, for our estimator to work well, the sphere should be small and contain sufficiently many points, and we have work in progress on choosing such a $k$ automatically.

$$ m_k = \frac{1}{n} \sum_{i=1}^n \hat{m}_k(X_i) $$

$$ m = \frac{1}{k_2 - k_1 + 1} \sum_{k=k_1}^{k_2} \hat{m}_k. $$

The only parameters to set for this method are $k_1$ and $k_2$ , and the computational cost is essentially the cost of finding $k_2$ nearest neighbors for every point, which has to be done for most manifold projection methods anyway.

Estimating Intrinsic Dimension of a Dataset by NN

Let $i$ be a point in the dataset, and consider the list of its first $k$ nearest neighbors; let $r_1, r_2, \ldots, r_k$ be a sorted list of their distances from $i$. Thus, $r_1$ is the distance between $i$ and its nearest neighbor, $r_2$ is the distance with its second nearest neighbor and so on; in this definition we conventionally set $r_0=0$.

The volume of the hypersferical shell enclosed between two successive neighbors $l-1$ and $l$ is given by $$ \Delta v_l=\omega_d\left(r_l^d-r_{l-1}^d\right), $$ where $d$ is the dimensionality of the space in which the points are embedded and $\omega_d$ is the volume of the $d$-sphere with unitary radius. It can be proved (see SI for a derivation) that, if the density is constant around point $i$, all the $\Delta v_l$ are independently drawn from an exponential distribution with rate equal to the density $\rho$ : $$ P\left(\Delta v_l \in[v, v+d v]\right)=\rho e^{-\rho v} d v $$

Consider two shells $\Delta v_1$ and $\Delta v_2$, and let $R$ be the quantity $\frac{\Delta v_i}{\Delta v_j}$; the previous considerations allow us, in the case of constant density, to compute exactly the probability distribution (pdf) of $R$ :

$$ P(R \in[\bar{R}, \bar{R}+d \bar{R}]) = \int_0^{\infty} d v_i \int_0^{\infty} d v_j \rho^2 e^{-\rho\left(v_i+v_j\right)} \left{\frac{v_j}{v_i} \in[\bar{R}, \bar{R}+d \bar{R}]\right} = d \bar{R} \frac{1}{(1+\bar{R})^2}. $$

where 1 represents the indicator function. Dividing by $d \bar{R}$ we obtain the pdf for $R$ : $$ g(R)=\frac{1}{(1+R)^2} $$

The pdf does not depend explicitly on the dimensionality $d$, which appears only in the definition of $R$. In order to work with a cdf depending explicitly on $d$ we define quantity $\mu \doteq \frac{r_2}{r_1} \in[1,+\infty) . R$ and $\mu$ are related by equality: $$ R=\mu^d-1 $$

This equation allows to find an explicit formula for the distribution of $\mu$ : $$ f(\mu)=d \mu^{-d-1} 1_{[1,+\infty]}(\mu), $$ while the cumulative distribution (cdf) is obtained by integration: $$ F(\mu)=\left(1-\mu^{-d}\right) 1_{[1,+\infty]}(\mu) . $$

Functions $f$ and $F$ are independent of the local density, but depend explicitly on the intrinsic dimension $d$.

The derivation presented above leads to a simple observation. The value of the intrinsic dimension $d$ can be estimated through the following equation: $$ \frac{\log (1-F(\mu))}{\log (\mu)}=d $$

Remarkably the density $\rho$ does not appear in this equation, since the $\operatorname{cdf} F$ is independent of $\rho$. If we consider the set $S \subset \mathbb{R}^2, S \doteq{(\log (\mu),-\log (1-F(\mu)))}$, $S$ is contained in a straight line $l \doteq{(x, y) \mid y=d * x}$ passing through the origin and having slope equal to $d$. In practice $F(\mu)$ is estimated empirically from a finite number of points; as a consequence, the left term in equation will be different for different data points, and the set $S$ will only lie around $l$. This line of reasoning naturally suggests an algorithm to estimate the intrinsic dimension of a dataset:

  1. Compute the pairwise distances for each point in the dataset $i=1, \ldots, N$.
  2. For each point i find the two shortest distances $r_1$ and $r_2$.
  3. For each point i compute $\mu_i=\frac{r_2}{r_1}$.
  4. Compute the empirical cumulate $F^{e m p}(\mu)$ by sorting the values of $\mu$ in an ascending order through a permutation $\sigma$, then define $F^{e m p}\left(\mu_{\sigma(i)}\right) \doteq \frac{i}{N}$.
  5. Fit the points of the plane given by coordinates ${\ (\log(\mu_i), -\log(1 - F^{\text{emp}}(\mu_i))) \ | \ i = 1, \ldots, N\ }$ with a straight line passing through the origin.

Even if the results above are derived in the case of a uniform distribution of points there is no dependence on the density $\rho$; as a consequence from the point of view of the algorithm we can a posteriori relax our hypothesis: we require the dataset to be only locally uniform in density, where locally means in the range of the second neighbor. From a theoretical point of view, this condition is satisfied in the limit of $N$ going to infinity. By performing numerical experiments on datasets in which the density is non-uniform we show empirically that even for a finite number of points the estimation is reasonably insensitive to density variations. The requirement of local uniformity only in the range of the second neighbor is an advantage with respect to competing approaches where local uniformity is required at larger distances.

Estimating Intrinsic Dimension of a Dataset by Moment’s Method

from skdim.id import TwoNN

Properties of Isotropy

  1. Mean Agnosticism:
    Isotropy is a property solely of the covariance matrix of a distribution, meaning it is unaffected by the mean or central location of the data. Therefore, an isotropy measure should depend only on how the data varies in different directions, not on where it is centered. If a score or measure changes with shifts in the mean, it does not truly measure isotropy. This is because isotropy is concerned only with the dispersion of data points and the relative balance of variance across dimensions.

  2. Invariance to Scalar Multiples of the Covariance Matrix:
    Isotropy remains unchanged if we scale the covariance matrix of the underlying distribution by a positive scalar. Mathematically, if we consider a covariance matrix $\Sigma=Cov(I)$ where $I$ is the identity matrix and $\lambda >0$ is a scalar, isotropy remains the same since scaling affects all dimensions equally. This property ensures that isotropy reflects the shape of the distribution rather than the overall size of the data spread. Thus, for an isotropic distribution, $Cov(\lambda)=Cov(\lambda I)$.

  3. Variance Distribution Across Dimensions:
    For a distribution to be more isotropic, the variance should be approximately equal across all dimensions. Specifically, as isotropy increases, the difference between the maximum variance value in any single dimension and the average variance across all other dimensions should decrease. This means that in an isotropic distribution, no single dimension should dominate the spread of the data. Instead, the variance should be evenly distributed, indicating that the data is spread uniformly in all directions.

  4. Rotation Invariance:
    An ideal measure of isotropy should be invariant to rotations of the data. Since isotropy is about the uniformity of variance in different directions, rotating the data should not change this property. In other words, the distribution of principal components (PCs) should remain constant under any rotation of the data. This property ensures that isotropy reflects the internal spread of data points rather than any specific orientation in space.

  5. Utilization of Dimensions:
    There is a direct relationship between isotropy and the number of dimensions effectively used by the data. A distribution that uniformly utilizes a higher number of dimensions requires more principal components to capture all of the variance in the data. Therefore, as the dimensionality utilized by the data increases, the isotropy measure should ideally reflect this by increasing as well. A high isotropy score suggests that the data is distributed more evenly across available dimensions, not concentrated in just a few directions.

  6. Global Stability:
    An isotropy measure should capture the global structure of the distribution, providing a holistic view of how evenly the data is spread across all dimensions. Rather than reflecting local variations or structure, isotropy should be a global measure, stable to small perturbations or local clusters in the data. This stability is essential for a reliable measure of the overall spatial utilization of a distribution.