🏗️ Representation Learning for NLP
All neural‑network (NN) architectures create vector representations—also called embeddings—of the input.
These vectors pack statistical and semantic cues that let the model classify, translate, or generate text.
The network learns better representations through feedback from a loss function.
Transformers build features for each word with an attention mechanism that asks:
“How important is every other word in the sentence to this word?”
🔗 GNNs—Representing Graphs
Graph Neural Networks (GNNs) or Graph Convolutional Networks (GCNs) embed nodes and edges.
They rely on neighbourhood aggregation / message passing:
GNN update the hidden feature h of node i at layer l via a non-linear transformation of the node’s own feature $h_i^l$ added to the aggregation of feature $h_j^l$ from each neighboring node $j \in N(i)$
$$h_{i}^{\ell+1} = \sigma \Big( U^{\ell} h_{i}^{\ell} + \sum_{j \in \mathcal{N}(i)} \left( V^{\ell} h_{j}^{\ell} \right) \Big)$$ where $U^l$ and $V^l$ are learnable weight matrices of the GNN layer and $\sigma$ is a non-linear function such as ReLU.
Symbol | Meaning |
---|---|
$(h_i^\ell$) | feature of node i at layer ($\ell$) |
$(\mathcal{N}(i))$ | neighbours of i |
$(U^\ell, V^\ell)$ | learnable weights |
($\sigma)$ | non‑linearity (e.g., ReLU) |
Stacking layers lets information flow across the whole graph.
🧩 Where Transformers Meet GNNs
Replace the plain sum with a weighted sum via attention → you get a Graph Attention Network (GAT).
Add layer‑norm and an MLP, and voilà  — a Graph Transformer!
What we have with transformers instead, is this, for an hidden feature $h$:
$$h_{i}^{\ell+1} = \text{Attention} \left( Q^{\ell} h_{i}^{\ell} \ , K^{\ell} h_{j}^{\ell} \ , V^{\ell} h_{j}^{\ell} \right)$$ with: $$\ h_{i}^{\ell+1} = \sum_{j \in \mathcal{S}} w_{ij} \left( V^{\ell} h_{j}^{\ell} \right)$$ $$ \text{where} \ w_{ij} = \text{softmax}_j \left ( Q^{\ell} h^{\ell}_i \cdot K^{\ell} h^{\ell}_j \right)$$
📝 Sentences as Graphs—But with Caveats
Think of a sentence as a fully‑connected graph where every word links to every other.
Transformers = GNNs with multi‑head attention acting as the aggregation rule.
Yet fully connected graphs mean quadratic growth in edges, which makes learning very long‑range word relations hard.
🤔 Are Transformers Learning Neural Syntax?
Studies suggest attention heads latch onto task‑specific syntax:
- Attention can surface the most relevant word pairs in a sentence.
- Different heads specialise in different syntactic cues.
Graph‑theoretic view: can GNNs on full graphs reveal which edges matter most by inspecting the aggregation weights? This might expose the hidden structure driving model accuracy.
📚 References
- Chaitanya K. Joshi, “Transformers are Graph Neural Networks,” The Gradient (2020).
- 🎥 YouTube Talk