Geometric Deep Learning

🏗️ Representation Learning for NLP

All neural‑network (NN) architectures create vector representations—also called embeddings—of the input.
These vectors pack statistical and semantic cues that let the model classify, translate, or generate text.

The network learns better representations through feedback from a loss function.

Transformers build features for each word with an attention mechanism that asks:
“How important is every other word in the sentence to this word?”

🔗 GNNs—Representing Graphs

Graph Neural Networks (GNNs) or Graph Convolutional Networks (GCNs) embed nodes and edges.
They rely on neighbourhood aggregation / message passing:

GNN update the hidden feature h of node i at layer l via a non-linear transformation of the node’s own feature $h_i^l$ added to the aggregation of feature $h_j^l$ from each neighboring node $j \in N(i)$

$$h_{i}^{\ell+1} = \sigma \Big( U^{\ell} h_{i}^{\ell} + \sum_{j \in \mathcal{N}(i)} \left( V^{\ell} h_{j}^{\ell} \right) \Big)$$ where $U^l$ and $V^l$ are learnable weight matrices of the GNN layer and $\sigma$ is a non-linear function such as ReLU.

Symbol	Meaning
$(h_i^\ell$)	feature of node i at layer ($\ell$)
$(\mathcal{N}(i))$	neighbours of i
$(U^\ell, V^\ell)$	learnable weights
($\sigma)$	non‑linearity (e.g., ReLU)

Stacking layers lets information flow across the whole graph.

🧩 Where Transformers Meet GNNs

Replace the plain sum with a weighted sum via attention → you get a Graph Attention Network (GAT).
Add layer‑norm and an MLP, and voilà — a Graph Transformer!

What we have with transformers instead, is this, for an hidden feature $h$:

$$h_{i}^{\ell+1} = \text{Attention} \left( Q^{\ell} h_{i}^{\ell} \ , K^{\ell} h_{j}^{\ell} \ , V^{\ell} h_{j}^{\ell} \right)$$ with: $$\ h_{i}^{\ell+1} = \sum_{j \in \mathcal{S}} w_{ij} \left( V^{\ell} h_{j}^{\ell} \right)$$ $$ \text{where} \ w_{ij} = \text{softmax}_j \left ( Q^{\ell} h^{\ell}_i \cdot K^{\ell} h^{\ell}_j \right)$$

📝 Sentences as Graphs—But with Caveats

Think of a sentence as a fully‑connected graph where every word links to every other.
Transformers = GNNs with multi‑head attention acting as the aggregation rule.

Yet fully connected graphs mean quadratic growth in edges, which makes learning very long‑range word relations hard.

🤔 Are Transformers Learning Neural Syntax?

Studies suggest attention heads latch onto task‑specific syntax:

Attention can surface the most relevant word pairs in a sentence.
Different heads specialise in different syntactic cues.

Graph‑theoretic view: can GNNs on full graphs reveal which edges matter most by inspecting the aggregation weights? This might expose the hidden structure driving model accuracy.

📚 References

Chaitanya K. Joshi, “Transformers are Graph Neural Networks,” The Gradient (2020).
🎥 YouTube Talk

🏗️ Representation Learning for NLP#

🔗 GNNs—Representing Graphs#

🧩 Where Transformers Meet GNNs#

📝 Sentences as Graphs—But with Caveats#

🤔 Are Transformers Learning Neural Syntax?#

📚 References#