Digital Mathematical Notebook

Embeddings space 𖦹ׂ ₊˚⊹⋆

Embeddings are learned geometric objects: a model maps data into a vector space, and everything after that depends on geometry. Similarity, normalization, dimensionality, anisotropy, local neighborhoods, and failure modes all shape what the representation can actually do.

This page treats embedding spaces as mathematical and computational objects. It starts with maps and metrics, then moves through high-dimensional effects, intrinsic dimension, spectra, sentence embeddings, contrastive learning, hubness, visualization, and practical diagnostics.

objects embedding space tokens / sentences / images / nodes angles, norms, neighborhoods, spectra
An embedding is first a map. Geometry appears only after the objects have been mapped into a space where norms, angles, and neighborhoods make sense.

Why this matters

Retrieval

Nearest-neighbor search depends on metric choice, normalization, and neighborhood structure.

Clustering

Cluster quality depends on local density, spectral decay, and how anisotropic the space is.

Semantic search

Good ranking requires the geometry to align with semantic similarity, not just raw vector norms.

Representation learning

Contrastive and metric-learning objectives directly reshape the geometry of the representation.

Diagnostics

Pretty 2D plots are not enough; you need metrics, spectra, ID estimates, and sanity checks.

Section 1 • Objects become coordinates

What an embedding is

An embedding is not a vector by itself. It is a map that sends objects into a feature space where geometry becomes available.

DefinitionBox

Embedding map versus embedding vector

Let \(X\) be a set of objects: words, sentences, images, graph nodes, users, documents, or something else. An embedding model is a map

$$ f : X \to \mathbb{R}^d. $$

For each object \(x \in X\), the vector \(f(x)\) is its representation. The map \(f\) and the resulting vectors \(f(x)\) should not be conflated.

ExampleBox

Static word embeddings

A vocabulary item \(w\) receives one vector \(f(w)\). The word “bank” gets the same embedding in a river context and a finance context.

ExampleBox

Contextual token embeddings

A transformer maps a token together with its context to a vector. The same surface form can land in different parts of space depending on the sentence.

ExampleBox

Sentence embeddings

A whole sequence is mapped to one vector, often by pooling or by a model trained directly for semantic similarity or retrieval.

RemarkBox

Why geometry appears at all

Once objects are represented in \(\mathbb{R}^d\), we can compare them with inner products, norms, distances, covariance, spectral structure, nearest-neighbor graphs, and local dimensionality estimates. The representation becomes a geometric object, not just a table of numbers.

Section 2 • The minimum geometry needed to talk carefully

Geometry basics for embeddings

Before discussing anisotropy, hubness, or intrinsic dimensionality, the basic objects have to be separated cleanly: norms, inner products, angles, distances, and neighborhoods.

FormulaBlock

Norms, inner products, and angles

$$ \langle x, y \rangle = x^\top y, \qquad \|x\|_2 = \sqrt{\langle x, x \rangle}. $$

$$ \cos \theta(x,y) = \frac{\langle x, y \rangle}{\|x\|_2 \|y\|_2}, \qquad x,y \neq 0. $$

FormulaBlock

Distances and neighborhoods

$$ d_2(x,y) = \|x-y\|_2, \qquad d_1(x,y)=\|x-y\|_1. $$

A \(k\)-nearest-neighbor set is defined only after choosing a metric or a similarity-to-distance convention. Different choices induce different local graphs.

KeyIdeaBox

Local versus global geometry

A space may look low-dimensional locally and still be globally curved or highly anisotropic. Nearest neighbors probe local geometry; spectra and covariance probe global linear structure.

RemarkBox

Subspace, affine subspace, manifold

A linear subspace passes through the origin. An affine subspace is a translated linear subspace. A nonlinear manifold can curve through ambient space and need not be globally linear at all.

Section 3 • Similarity is not one thing

Similarity metrics and normalization

Dot product, cosine similarity, Euclidean distance, and Manhattan distance answer related but genuinely different questions. The geometry changes again when vectors are normalized.

DefinitionBox

Similarity versus distance

A similarity assigns larger values to more alike pairs. A distance assigns smaller values to more alike pairs. Some libraries expose “negative Euclidean” or “negative Manhattan” scores so they can be maximized like similarities.

ExampleBox

What each metric emphasizes

Dot product mixes angle and norm. Cosine removes norm and keeps angle. Euclidean depends on both direction and scale. Manhattan is axis-dependent and reacts differently to coordinatewise changes.

TheoremBox

Unit-sphere equivalence

If \(\|u\|_2=\|v\|_2=1\), then

$$ \|u-v\|_2^2 = 2 - 2\langle u, v \rangle = 2 - 2\cos \theta(u,v). $$

On the unit sphere, maximizing cosine similarity, maximizing dot product, and minimizing Euclidean distance all give the same ordering.

ProofToggle • Why normalization makes cosine, dot, and Euclidean agree

Expand the squared Euclidean distance: \(\|u-v\|_2^2 = \langle u-v, u-v \rangle = \|u\|_2^2 + \|v\|_2^2 - 2\langle u,v \rangle\). If both vectors are unit length, this becomes \(2-2\langle u,v\rangle\). Because \(\langle u,v\rangle = \cos\theta(u,v)\) on the sphere, the three orderings coincide.

InteractiveWidgetShell

Metric explorer

Section 4 • Scale can carry signal or noise

What changes after normalization

L2 normalization is not a cosmetic operation. It deletes norm information, constrains vectors to a sphere, and changes which geometric quantities matter.

KeyIdeaBox

Normalization fixes the radius

After mapping \(x \mapsto x / \|x\|_2\), all vectors live on the unit sphere. Geometry becomes angular: the model can still change direction, but not radial magnitude.

RemarkBox

Norms can still be meaningful

In some spaces, vector norms correlate with frequency, confidence, or salience. Normalization can improve retrieval stability, but it can also discard information that an application actually uses.

ExampleBox

Sentence-embedding practice

In sentence-transformer style systems, cosine is a standard default. If embeddings are already normalized, then dot product is equivalent for ranking and avoids another normalization pass at inference time.

InteractiveWidgetShell

Normalization demo

1.50

Section 5 • Why 2D intuition fails

High-dimensional geometry and the curse of dimensionality

Embedding spaces are often hundreds or thousands of dimensions wide. Distances, neighborhoods, and volume behave differently there than they do in the plane.

DefinitionBox

Distance concentration

As ambient dimension grows, pairwise distances in many distributions become relatively less separated. The nearest and farthest neighbors can become numerically closer than low-dimensional intuition suggests.

RemarkBox

Nearest-neighbor instability

When distances concentrate, small perturbations, metric choices, or normalization steps can change rankings more than expected. Retrieval quality then depends strongly on geometry engineering.

ExampleBox

Sparse volume intuition

In high dimension, volume drifts away from the center. Data clouds often look empty when projected down, yet their local neighborhoods can still be noisy, uneven, and hard to compare.

KeyIdeaBox

High dimension is not automatically bad

Large ambient dimension gives expressive capacity. Problems appear when the geometry induced by the model is poorly matched to the task, not simply because the coordinate count is large.

Section 6 • Ambient dimension is not the same as complexity

Intrinsic dimensionality

The ambient dimension \(d\) of an embedding space says how many coordinates are available. The intrinsic dimension asks how many degrees of freedom the data actually uses.

DefinitionBox

Ambient versus intrinsic dimension

A dataset can live in \(\mathbb{R}^{768}\) while being concentrated near a lower-dimensional object: a line, a plane, a curved manifold, or a union of several such structures. Intrinsic dimension is a statement about data geometry, not about the length of the vectors.

RemarkBox

Local versus global ID

Local estimators measure dimension in a neighborhood around a point or scale. Global surrogates summarize the whole cloud. They are useful for different questions and need not agree numerically.

FormulaBlock

Local MLE-style estimator

$$ \widehat{m}_k(x)=\left[\frac{1}{k-1}\sum_{j=1}^{k-1}\log\frac{T_k(x)}{T_j(x)}\right]^{-1}, $$

where \(T_j(x)\) is the distance from \(x\) to its \(j\)-th nearest neighbor. This probes local distance growth around \(x\).

ExampleBox

Different estimators measure different structure

Neighbor-based ID estimates local expansion. PCA-based approximations measure linear variance concentration. A curved 1D manifold can have local ID near \(1\) while needing two or more global principal components to describe its position in space.

InteractiveWidgetShell

Intrinsic-dimensionality demo

8

Section 7 • Global linear structure

Linear algebra of embedding spaces

Covariance, Gram matrices, singular values, and explained variance reveal redundancy, dominant directions, and compression opportunities in embedding spaces.

FormulaBlock

Covariance and Gram structure

$$ \Sigma = \frac{1}{n}\sum_{i=1}^n (x_i-\bar{x})(x_i-\bar{x})^\top, \qquad G = XX^\top. $$

The eigenvalues of \(\Sigma\) or singular values of the centered data matrix quantify how variance is distributed across directions.

FormulaBlock

Effective rank surrogates

$$ r_{\mathrm{PR}} = \frac{(\sum_i \lambda_i)^2}{\sum_i \lambda_i^2}, \qquad r_{\mathrm{eff}} = \exp\left(-\sum_i p_i \log p_i\right), $$

$$ p_i = \frac{\lambda_i}{\sum_j \lambda_j}. $$

These summarize how concentrated the spectrum is, but they are still global linear summaries, not full intrinsic-dimension estimators.

RemarkBox

Spectral decay and anisotropy

If a few eigenvalues dominate, the space uses only a small number of global directions strongly. That is one common signature of anisotropy, although anisotropy can be defined in more than one way depending on centering, normalization, and the statistic being used.

InteractiveWidgetShell

Spectral / PCA demo

3

Section 8 • Not all embedding spaces are the same object

Static, contextual, and sentence embeddings

A single page about “embeddings” can become vague fast. The geometry of static word vectors, contextual token states, and sentence embeddings is related, but not interchangeable.

DefinitionBox

Static word embeddings

One token type, one vector. Geometry is easy to inspect, but polysemy is compressed into a single point.

DefinitionBox

Contextual token embeddings

The same word form can move across contexts. These vectors are conditioned on the whole sequence and are better viewed as context-indexed token representations.

DefinitionBox

Sentence embeddings

A sequence is mapped to one vector for similarity, retrieval, clustering, or classification. This usually requires pooling and often additional training objectives.

RemarkBox

Why the distinction matters

BERT hidden states are contextual token representations, not automatically a good sentence metric space. A pooled language-model output and a sentence encoder trained for semantic similarity are not geometrically equivalent objects.

Section 9 • Pooling alone is not the whole story

Sentence embeddings in practice

Sentence embeddings are built for semantic similarity and retrieval. That is why pooling choices and training objectives matter so much.

ExampleBox

Pooling choices

CLS pooling, mean pooling, and max pooling compress a token sequence differently. Mean pooling is a strong default in many sentence-transformer pipelines because it is stable and easy to optimize.

KeyIdeaBox

Metric learning shapes the space

Siamese and contrastive training do not merely “read out” sentence meaning. They reorganize the geometry so semantically related sentences land near one another under the chosen similarity.

RemarkBox

Why specialized sentence models exist

Pretrained language models are not optimized directly for sentence-level metric structure. Models like SBERT and SimCSE explicitly train the representation space for pairwise semantic tasks, which is why they tend to behave better for search and similarity.

InteractiveWidgetShell

Sentence embedding demo

Section 10 • Global spread is subtle

Anisotropy, isotropy, and uniformity

These words are often used loosely. They should not be collapsed into one vague idea of “good spread.”

DefinitionBox

Second-moment isotropy

After centering, a distribution is isotropic in the covariance sense if its covariance matrix is proportional to the identity. No direction receives systematically more variance than another.

DefinitionBox

Uniformity on the sphere

For normalized embeddings, “uniformity” often means the distribution is well spread over the unit sphere. This is a different notion from covariance isotropy in raw ambient coordinates.

RemarkBox

Contextual embeddings are often anisotropic

Raw contextual token or sentence states from pretrained language models often concentrate in a narrow cone or a small set of dominant directions. This can damage cosine-based semantic discrimination unless the representation is adapted or fine-tuned.

RemarkBox

Measuring isotropy is fragile

Average cosine and related heuristics can be brittle. A metric like IsoScore tries to quantify how evenly variance uses the ambient space, but even then the interpretation depends on preprocessing, centering, normalization, and the task.

FigureWithCaption

flatter spectrum spiked spectrum
A flatter spectrum spreads variance across more directions. A spiked spectrum indicates stronger global directional preference and is one common form of anisotropy.

Section 11 • Training objectives reshape geometry

Contrastive learning, alignment, and uniformity

Contrastive learning does not just improve accuracy. It changes the geometry of the representation by bringing positive pairs together and pushing the overall distribution away from collapse.

FormulaBlock

Contrastive objective (schematic)

$$ \mathcal{L}_i = - \log \frac{\exp(\mathrm{sim}(z_i, z_i^+)/\tau)}{\sum_j \exp(\mathrm{sim}(z_i, z_j)/\tau)}. $$

Positive pairs should be close; other points should not all collapse into the same region. The temperature \(\tau\) controls how sharp the comparison is.

FormulaBlock

Alignment and uniformity

$$ \mathrm{Align} = \mathbb{E}\|z-z^+\|_2^2, \qquad \mathrm{Unif}_t = \log \mathbb{E}\exp(-t\|z-z'\|_2^2). $$

These are typically studied for normalized features on the sphere. Lower alignment means closer positive pairs; lower uniformity score means broader spread over the sphere.

KeyIdeaBox

Why this helps sentence embeddings

If semantically related pairs are aligned and the whole cloud avoids a narrow anisotropic cone, nearest-neighbor retrieval becomes more meaningful. This is one reason contrastive fine-tuning often helps sentence-level semantic tasks.

InteractiveWidgetShell

Contrastive geometry demo

0.00

Section 12 • Nearest-neighbor pathologies

Hubness and nearest-neighbor failures

In high-dimensional spaces, some points become nearest neighbors of many others. Those points are hubs, and they can distort search, ranking, and clustering.

DefinitionBox

k-occurrence count

For a point \(x\), define \(N_k(x)\) as the number of query points for which \(x\) appears in their \(k\)-nearest-neighbor list. Large \(N_k(x)\) indicates hubness.

RemarkBox

Why it matters

A hub can appear semantically plausible for many unrelated queries simply because the geometry is distorted. In retrieval systems this means over-recommended points, unstable rankings, and weaker coverage of the actual semantic neighborhood.

ExampleBox

Mitigation ideas

Centering, whitening, normalization, local scaling, or metric learning can help, depending on the application. None of these is universally correct; the right fix depends on what the norms and local densities are supposed to mean.

InteractiveWidgetShell

Hubness demo

5

Section 13 • 2D plots are useful and dangerous

Dimensionality reduction and visualization

PCA, t-SNE, and UMAP are useful exploratory tools. None of them directly proves the true geometry of the original embedding space.

DefinitionBox

PCA

A linear projection onto high-variance directions. It is interpretable and tied directly to the covariance spectrum, but it cannot unwrap nonlinear manifolds.

DefinitionBox

t-SNE

A nonlinear neighbor-embedding method designed for visualization. It emphasizes local neighborhoods, but global distances, cluster sizes, and empty space are not directly trustworthy.

DefinitionBox

UMAP

Another neighbor-graph-based projection method. It is often faster and can preserve some coarse global structure better than t-SNE, but it is still a low-dimensional embedding with its own distortions.

RemarkBox

What a projection can and cannot say

A useful projection can reveal local continuity, outliers, or rough topic structure. It cannot by itself certify intrinsic dimension, isotropy, ranking quality, or the validity of a downstream retrieval metric. Those require direct geometric diagnostics in the original space.

Section 14 • Practical research checks

Diagnosing an embedding space

The right workflow is empirical and geometric: inspect the space from multiple angles rather than trusting a single score or a single 2D plot.

RemarkBox

Nearest-neighbor sanity checks

Inspect actual neighbors for representative queries. Look for semantic drift, generic hubs, duplicates, or norm-driven artifacts.

RemarkBox

Norm distribution

Check whether norms are informative, unstable, or correlated with nuisance variables such as length or frequency.

RemarkBox

Spectral decay

Examine covariance eigenvalues, effective rank, or participation ratio to see whether the space is strongly dominated by a few directions.

RemarkBox

Intrinsic dimension estimates

Compare local neighbor-based ID with global linear surrogates. Agreement is informative; mismatch is often even more informative.

RemarkBox

Uniformity and anisotropy metrics

Use them carefully, with clear preprocessing conventions. Raw average cosine and global covariance statistics measure different properties.

RemarkBox

Task-level evaluation

Retrieval metrics, STS correlation, clustering quality, or downstream classifier performance are still necessary. Geometry is only useful if it supports the task.

Section 15 • Short corrections to common claims

Failure modes and misconceptions

Several widespread statements about embedding spaces are too vague or just false once the geometry is stated carefully.

RemarkBox

“High vector dimension means high intrinsic complexity.”

No. Ambient dimension and intrinsic dimension answer different questions. A 768-dimensional model can represent a cloud concentrated near a much lower-dimensional structure.

RemarkBox

“Cosine is always the right metric.”

No. Cosine is appropriate when only direction should matter. If norms carry signal, suppressing them can remove useful information.

RemarkBox

“A pretty 2D projection proves the geometry.”

No. PCA, t-SNE, and UMAP are partial views. They can be helpful, but they are not direct witnesses of the original high-dimensional metric structure.

RemarkBox

“Anisotropy is always bad.”

Not automatically. Some tasks or models genuinely encode useful signal in dominant directions. What matters is whether the resulting geometry matches the task.

RemarkBox

“Sentence embeddings are just pooled token embeddings.”

Pooling is one construction, but metric-learning objectives and contrastive fine-tuning can substantially reorganize the sentence-level space.

Section 16 • Where geometry changes behavior

Applications

Embedding geometry matters most when a system depends on similarity, ranking, clustering, or transfer between related objects.

ExampleBox

Semantic search

Retrieval quality depends on neighborhood structure, hubness, normalization, and the chosen similarity function.

ExampleBox

Clustering and topic discovery

Whether clusters appear cleanly depends on both local neighborhoods and the global spread of the representation.

ExampleBox

Recommendation

Hubness and anisotropy can bias recommender retrieval toward over-popular or overly central items.

ExampleBox

Retrieval-augmented systems

Vector search quality directly affects which contexts are retrieved, which in turn affects the behavior of the downstream model.

ExampleBox

Graph and node embeddings

The same geometry questions appear: local neighborhoods, spectral structure, normalization, and the relation between graph topology and ambient space.

ExampleBox

Multimodal embeddings

Cross-modal alignment adds another layer: image and text spaces need compatible geometry, not just individually reasonable representations.

Section 17 • Closing map

Takeaways and further reading

Embedding spaces are best understood as geometric objects shaped by a map, a metric, a training objective, and a task. That is the level at which they should be studied.

KeyIdeaBox

Summary

Embeddings begin as maps \(f:X \to \mathbb{R}^d\), not isolated vectors. Metric choice, normalization, high-dimensional effects, intrinsic dimensionality, spectral structure, anisotropy, and contrastive objectives all change the resulting geometry. Diagnostic work matters more than pretty projections.

FurtherReadingBox

How to read further

A good sequence is: sentence-embedding papers for motivation, alignment/uniformity for contrastive geometry, intrinsic-dimension papers for scale, then hubness and visualization papers for failure analysis.

Contrastive geometry

Wang & Isola 2020 for alignment and uniformity, then SimCSE for the sentence-embedding case.

Visualization

t-SNE and UMAP as visualization tools, not as proofs of the true geometry.