Digital Mathematical Notebook
Embeddings space 𖦹ׂ ₊˚⊹⋆
Embeddings are learned geometric objects: a model maps data into a vector space, and everything after that depends on geometry. Similarity, normalization, dimensionality, anisotropy, local neighborhoods, and failure modes all shape what the representation can actually do.
Why this matters
Nearest-neighbor search depends on metric choice, normalization, and neighborhood structure.
Cluster quality depends on local density, spectral decay, and how anisotropic the space is.
Good ranking requires the geometry to align with semantic similarity, not just raw vector norms.
Contrastive and metric-learning objectives directly reshape the geometry of the representation.
Pretty 2D plots are not enough; you need metrics, spectra, ID estimates, and sanity checks.
What an embedding is
An embedding is not a vector by itself. It is a map that sends objects into a feature space where geometry becomes available.
DefinitionBox
Embedding map versus embedding vectorLet \(X\) be a set of objects: words, sentences, images, graph nodes, users, documents, or something else. An embedding model is a map
$$ f : X \to \mathbb{R}^d. $$
For each object \(x \in X\), the vector \(f(x)\) is its representation. The map \(f\) and the resulting vectors \(f(x)\) should not be conflated.
ExampleBox
Static word embeddingsA vocabulary item \(w\) receives one vector \(f(w)\). The word “bank” gets the same embedding in a river context and a finance context.
ExampleBox
Contextual token embeddingsA transformer maps a token together with its context to a vector. The same surface form can land in different parts of space depending on the sentence.
ExampleBox
Sentence embeddingsA whole sequence is mapped to one vector, often by pooling or by a model trained directly for semantic similarity or retrieval.
RemarkBox
Why geometry appears at allOnce objects are represented in \(\mathbb{R}^d\), we can compare them with inner products, norms, distances, covariance, spectral structure, nearest-neighbor graphs, and local dimensionality estimates. The representation becomes a geometric object, not just a table of numbers.
Geometry basics for embeddings
Before discussing anisotropy, hubness, or intrinsic dimensionality, the basic objects have to be separated cleanly: norms, inner products, angles, distances, and neighborhoods.
FormulaBlock
Norms, inner products, and angles$$ \langle x, y \rangle = x^\top y, \qquad \|x\|_2 = \sqrt{\langle x, x \rangle}. $$
$$ \cos \theta(x,y) = \frac{\langle x, y \rangle}{\|x\|_2 \|y\|_2}, \qquad x,y \neq 0. $$
FormulaBlock
Distances and neighborhoods$$ d_2(x,y) = \|x-y\|_2, \qquad d_1(x,y)=\|x-y\|_1. $$
A \(k\)-nearest-neighbor set is defined only after choosing a metric or a similarity-to-distance convention. Different choices induce different local graphs.
KeyIdeaBox
Local versus global geometryA space may look low-dimensional locally and still be globally curved or highly anisotropic. Nearest neighbors probe local geometry; spectra and covariance probe global linear structure.
RemarkBox
Subspace, affine subspace, manifoldA linear subspace passes through the origin. An affine subspace is a translated linear subspace. A nonlinear manifold can curve through ambient space and need not be globally linear at all.
Similarity metrics and normalization
Dot product, cosine similarity, Euclidean distance, and Manhattan distance answer related but genuinely different questions. The geometry changes again when vectors are normalized.
DefinitionBox
Similarity versus distanceA similarity assigns larger values to more alike pairs. A distance assigns smaller values to more alike pairs. Some libraries expose “negative Euclidean” or “negative Manhattan” scores so they can be maximized like similarities.
ExampleBox
What each metric emphasizesDot product mixes angle and norm. Cosine removes norm and keeps angle. Euclidean depends on both direction and scale. Manhattan is axis-dependent and reacts differently to coordinatewise changes.
TheoremBox
Unit-sphere equivalenceIf \(\|u\|_2=\|v\|_2=1\), then
$$ \|u-v\|_2^2 = 2 - 2\langle u, v \rangle = 2 - 2\cos \theta(u,v). $$
On the unit sphere, maximizing cosine similarity, maximizing dot product, and minimizing Euclidean distance all give the same ordering.
ProofToggle • Why normalization makes cosine, dot, and Euclidean agree
Expand the squared Euclidean distance: \(\|u-v\|_2^2 = \langle u-v, u-v \rangle = \|u\|_2^2 + \|v\|_2^2 - 2\langle u,v \rangle\). If both vectors are unit length, this becomes \(2-2\langle u,v\rangle\). Because \(\langle u,v\rangle = \cos\theta(u,v)\) on the sphere, the three orderings coincide.
What changes after normalization
L2 normalization is not a cosmetic operation. It deletes norm information, constrains vectors to a sphere, and changes which geometric quantities matter.
KeyIdeaBox
Normalization fixes the radiusAfter mapping \(x \mapsto x / \|x\|_2\), all vectors live on the unit sphere. Geometry becomes angular: the model can still change direction, but not radial magnitude.
RemarkBox
Norms can still be meaningfulIn some spaces, vector norms correlate with frequency, confidence, or salience. Normalization can improve retrieval stability, but it can also discard information that an application actually uses.
ExampleBox
Sentence-embedding practiceIn sentence-transformer style systems, cosine is a standard default. If embeddings are already normalized, then dot product is equivalent for ranking and avoids another normalization pass at inference time.
High-dimensional geometry and the curse of dimensionality
Embedding spaces are often hundreds or thousands of dimensions wide. Distances, neighborhoods, and volume behave differently there than they do in the plane.
DefinitionBox
Distance concentrationAs ambient dimension grows, pairwise distances in many distributions become relatively less separated. The nearest and farthest neighbors can become numerically closer than low-dimensional intuition suggests.
RemarkBox
Nearest-neighbor instabilityWhen distances concentrate, small perturbations, metric choices, or normalization steps can change rankings more than expected. Retrieval quality then depends strongly on geometry engineering.
ExampleBox
Sparse volume intuitionIn high dimension, volume drifts away from the center. Data clouds often look empty when projected down, yet their local neighborhoods can still be noisy, uneven, and hard to compare.
KeyIdeaBox
High dimension is not automatically badLarge ambient dimension gives expressive capacity. Problems appear when the geometry induced by the model is poorly matched to the task, not simply because the coordinate count is large.
Intrinsic dimensionality
The ambient dimension \(d\) of an embedding space says how many coordinates are available. The intrinsic dimension asks how many degrees of freedom the data actually uses.
DefinitionBox
Ambient versus intrinsic dimensionA dataset can live in \(\mathbb{R}^{768}\) while being concentrated near a lower-dimensional object: a line, a plane, a curved manifold, or a union of several such structures. Intrinsic dimension is a statement about data geometry, not about the length of the vectors.
RemarkBox
Local versus global IDLocal estimators measure dimension in a neighborhood around a point or scale. Global surrogates summarize the whole cloud. They are useful for different questions and need not agree numerically.
FormulaBlock
Local MLE-style estimator$$ \widehat{m}_k(x)=\left[\frac{1}{k-1}\sum_{j=1}^{k-1}\log\frac{T_k(x)}{T_j(x)}\right]^{-1}, $$
where \(T_j(x)\) is the distance from \(x\) to its \(j\)-th nearest neighbor. This probes local distance growth around \(x\).
ExampleBox
Different estimators measure different structureNeighbor-based ID estimates local expansion. PCA-based approximations measure linear variance concentration. A curved 1D manifold can have local ID near \(1\) while needing two or more global principal components to describe its position in space.
Linear algebra of embedding spaces
Covariance, Gram matrices, singular values, and explained variance reveal redundancy, dominant directions, and compression opportunities in embedding spaces.
FormulaBlock
Covariance and Gram structure$$ \Sigma = \frac{1}{n}\sum_{i=1}^n (x_i-\bar{x})(x_i-\bar{x})^\top, \qquad G = XX^\top. $$
The eigenvalues of \(\Sigma\) or singular values of the centered data matrix quantify how variance is distributed across directions.
FormulaBlock
Effective rank surrogates$$ r_{\mathrm{PR}} = \frac{(\sum_i \lambda_i)^2}{\sum_i \lambda_i^2}, \qquad r_{\mathrm{eff}} = \exp\left(-\sum_i p_i \log p_i\right), $$
$$ p_i = \frac{\lambda_i}{\sum_j \lambda_j}. $$
These summarize how concentrated the spectrum is, but they are still global linear summaries, not full intrinsic-dimension estimators.
RemarkBox
Spectral decay and anisotropyIf a few eigenvalues dominate, the space uses only a small number of global directions strongly. That is one common signature of anisotropy, although anisotropy can be defined in more than one way depending on centering, normalization, and the statistic being used.
Static, contextual, and sentence embeddings
A single page about “embeddings” can become vague fast. The geometry of static word vectors, contextual token states, and sentence embeddings is related, but not interchangeable.
DefinitionBox
Static word embeddingsOne token type, one vector. Geometry is easy to inspect, but polysemy is compressed into a single point.
DefinitionBox
Contextual token embeddingsThe same word form can move across contexts. These vectors are conditioned on the whole sequence and are better viewed as context-indexed token representations.
DefinitionBox
Sentence embeddingsA sequence is mapped to one vector for similarity, retrieval, clustering, or classification. This usually requires pooling and often additional training objectives.
RemarkBox
Why the distinction mattersBERT hidden states are contextual token representations, not automatically a good sentence metric space. A pooled language-model output and a sentence encoder trained for semantic similarity are not geometrically equivalent objects.
Sentence embeddings in practice
Sentence embeddings are built for semantic similarity and retrieval. That is why pooling choices and training objectives matter so much.
ExampleBox
Pooling choicesCLS pooling, mean pooling, and max pooling compress a token sequence differently. Mean pooling is a strong default in many sentence-transformer pipelines because it is stable and easy to optimize.
KeyIdeaBox
Metric learning shapes the spaceSiamese and contrastive training do not merely “read out” sentence meaning. They reorganize the geometry so semantically related sentences land near one another under the chosen similarity.
RemarkBox
Why specialized sentence models existPretrained language models are not optimized directly for sentence-level metric structure. Models like SBERT and SimCSE explicitly train the representation space for pairwise semantic tasks, which is why they tend to behave better for search and similarity.
Anisotropy, isotropy, and uniformity
These words are often used loosely. They should not be collapsed into one vague idea of “good spread.”
DefinitionBox
Second-moment isotropyAfter centering, a distribution is isotropic in the covariance sense if its covariance matrix is proportional to the identity. No direction receives systematically more variance than another.
DefinitionBox
Uniformity on the sphereFor normalized embeddings, “uniformity” often means the distribution is well spread over the unit sphere. This is a different notion from covariance isotropy in raw ambient coordinates.
RemarkBox
Contextual embeddings are often anisotropicRaw contextual token or sentence states from pretrained language models often concentrate in a narrow cone or a small set of dominant directions. This can damage cosine-based semantic discrimination unless the representation is adapted or fine-tuned.
RemarkBox
Measuring isotropy is fragileAverage cosine and related heuristics can be brittle. A metric like IsoScore tries to quantify how evenly variance uses the ambient space, but even then the interpretation depends on preprocessing, centering, normalization, and the task.
FigureWithCaption
Contrastive learning, alignment, and uniformity
Contrastive learning does not just improve accuracy. It changes the geometry of the representation by bringing positive pairs together and pushing the overall distribution away from collapse.
FormulaBlock
Contrastive objective (schematic)$$ \mathcal{L}_i = - \log \frac{\exp(\mathrm{sim}(z_i, z_i^+)/\tau)}{\sum_j \exp(\mathrm{sim}(z_i, z_j)/\tau)}. $$
Positive pairs should be close; other points should not all collapse into the same region. The temperature \(\tau\) controls how sharp the comparison is.
FormulaBlock
Alignment and uniformity$$ \mathrm{Align} = \mathbb{E}\|z-z^+\|_2^2, \qquad \mathrm{Unif}_t = \log \mathbb{E}\exp(-t\|z-z'\|_2^2). $$
These are typically studied for normalized features on the sphere. Lower alignment means closer positive pairs; lower uniformity score means broader spread over the sphere.
KeyIdeaBox
Why this helps sentence embeddingsIf semantically related pairs are aligned and the whole cloud avoids a narrow anisotropic cone, nearest-neighbor retrieval becomes more meaningful. This is one reason contrastive fine-tuning often helps sentence-level semantic tasks.
Hubness and nearest-neighbor failures
In high-dimensional spaces, some points become nearest neighbors of many others. Those points are hubs, and they can distort search, ranking, and clustering.
DefinitionBox
k-occurrence countFor a point \(x\), define \(N_k(x)\) as the number of query points for which \(x\) appears in their \(k\)-nearest-neighbor list. Large \(N_k(x)\) indicates hubness.
RemarkBox
Why it mattersA hub can appear semantically plausible for many unrelated queries simply because the geometry is distorted. In retrieval systems this means over-recommended points, unstable rankings, and weaker coverage of the actual semantic neighborhood.
ExampleBox
Mitigation ideasCentering, whitening, normalization, local scaling, or metric learning can help, depending on the application. None of these is universally correct; the right fix depends on what the norms and local densities are supposed to mean.
Dimensionality reduction and visualization
PCA, t-SNE, and UMAP are useful exploratory tools. None of them directly proves the true geometry of the original embedding space.
DefinitionBox
PCAA linear projection onto high-variance directions. It is interpretable and tied directly to the covariance spectrum, but it cannot unwrap nonlinear manifolds.
DefinitionBox
t-SNEA nonlinear neighbor-embedding method designed for visualization. It emphasizes local neighborhoods, but global distances, cluster sizes, and empty space are not directly trustworthy.
DefinitionBox
UMAPAnother neighbor-graph-based projection method. It is often faster and can preserve some coarse global structure better than t-SNE, but it is still a low-dimensional embedding with its own distortions.
RemarkBox
What a projection can and cannot sayA useful projection can reveal local continuity, outliers, or rough topic structure. It cannot by itself certify intrinsic dimension, isotropy, ranking quality, or the validity of a downstream retrieval metric. Those require direct geometric diagnostics in the original space.
Diagnosing an embedding space
The right workflow is empirical and geometric: inspect the space from multiple angles rather than trusting a single score or a single 2D plot.
RemarkBox
Nearest-neighbor sanity checksInspect actual neighbors for representative queries. Look for semantic drift, generic hubs, duplicates, or norm-driven artifacts.
RemarkBox
Norm distributionCheck whether norms are informative, unstable, or correlated with nuisance variables such as length or frequency.
RemarkBox
Spectral decayExamine covariance eigenvalues, effective rank, or participation ratio to see whether the space is strongly dominated by a few directions.
RemarkBox
Intrinsic dimension estimatesCompare local neighbor-based ID with global linear surrogates. Agreement is informative; mismatch is often even more informative.
RemarkBox
Uniformity and anisotropy metricsUse them carefully, with clear preprocessing conventions. Raw average cosine and global covariance statistics measure different properties.
RemarkBox
Task-level evaluationRetrieval metrics, STS correlation, clustering quality, or downstream classifier performance are still necessary. Geometry is only useful if it supports the task.
Failure modes and misconceptions
Several widespread statements about embedding spaces are too vague or just false once the geometry is stated carefully.
RemarkBox
“High vector dimension means high intrinsic complexity.”No. Ambient dimension and intrinsic dimension answer different questions. A 768-dimensional model can represent a cloud concentrated near a much lower-dimensional structure.
RemarkBox
“Cosine is always the right metric.”No. Cosine is appropriate when only direction should matter. If norms carry signal, suppressing them can remove useful information.
RemarkBox
“A pretty 2D projection proves the geometry.”No. PCA, t-SNE, and UMAP are partial views. They can be helpful, but they are not direct witnesses of the original high-dimensional metric structure.
RemarkBox
“Anisotropy is always bad.”Not automatically. Some tasks or models genuinely encode useful signal in dominant directions. What matters is whether the resulting geometry matches the task.
RemarkBox
“Sentence embeddings are just pooled token embeddings.”Pooling is one construction, but metric-learning objectives and contrastive fine-tuning can substantially reorganize the sentence-level space.
Applications
Embedding geometry matters most when a system depends on similarity, ranking, clustering, or transfer between related objects.
ExampleBox
Semantic searchRetrieval quality depends on neighborhood structure, hubness, normalization, and the chosen similarity function.
ExampleBox
Clustering and topic discoveryWhether clusters appear cleanly depends on both local neighborhoods and the global spread of the representation.
ExampleBox
RecommendationHubness and anisotropy can bias recommender retrieval toward over-popular or overly central items.
ExampleBox
Retrieval-augmented systemsVector search quality directly affects which contexts are retrieved, which in turn affects the behavior of the downstream model.
ExampleBox
Graph and node embeddingsThe same geometry questions appear: local neighborhoods, spectral structure, normalization, and the relation between graph topology and ambient space.
ExampleBox
Multimodal embeddingsCross-modal alignment adds another layer: image and text spaces need compatible geometry, not just individually reasonable representations.
Takeaways and further reading
Embedding spaces are best understood as geometric objects shaped by a map, a metric, a training objective, and a task. That is the level at which they should be studied.
KeyIdeaBox
SummaryEmbeddings begin as maps \(f:X \to \mathbb{R}^d\), not isolated vectors. Metric choice, normalization, high-dimensional effects, intrinsic dimensionality, spectral structure, anisotropy, and contrastive objectives all change the resulting geometry. Diagnostic work matters more than pretty projections.
FurtherReadingBox
How to read furtherA good sequence is: sentence-embedding papers for motivation, alignment/uniformity for contrastive geometry, intrinsic-dimension papers for scale, then hubness and visualization papers for failure analysis.
Sentence-BERT, SimCSE, and the Sentence Transformers docs.
Wang & Isola 2020 for alignment and uniformity, then SimCSE for the sentence-embedding case.
Radovanović et al. 2010 and follow-up work on nearest-neighbor instability in high dimensions.
t-SNE and UMAP as visualization tools, not as proofs of the true geometry.