Notes
I try to put my notes here, so I keep track of my studies :)
I try to put my notes here, so I keep track of my studies :)
🗓 Day 1 — Monday, 10 Feb 2025 🌍 What Is Common Crawl? Common Crawl is a huge, free snapshot of the public web. A non‑profit updates it every month, storing: Billions of HTML pages Their cleaned‑up text content Extra metadata (links, timestamps, MIME types, …) Why It Matters Track language change – see how words, memes, and topics shift over time. Map the web’s link network – study which sites connect and why. Train big ML models – use real‑world data instead of tiny toy datasets. Because each release includes both the raw HTML and a parsed text layer, you can analyze: ...
Here are the notes from some talks I attented at ACL 2025 in Vienna! Eye-tracking Why gaze? Eye movements reflect online processing (not just end products), letting us probe difficulty, attention, and strategies during reading. That’s gold for modeling and evaluation. (PubMed) Data is maturing: Multilingual, multi‑lab efforts (e.g., MECO, MultiplEYE) + tooling (e.g., pymovements) have made high‑quality datasets and pipelines more accessible. (meco-read.com, multipleye.eu, arXiv) Models & evals: Gaze can improve certain NLP tasks and also evaluate systems with behavioral signals (e.g., readability, MT, summarization). But gains are often modest unless modeling is careful or data is task‑aligned. Open debates: How well LLM surprisal predicts human reading times varies with model size, layers, and populations; adding recency biases can help fit human behavior. (ACL Anthology, dfki.de, ACL Anthology, ACL Anthology) Eye‑tracking 101 👁️ Some basic concepts: ...
Understanding the curse of dimensionality requires more than just examining the representational aspect of data. A key insight lies in the concept of intrinsic dimensionality (ID)—the number of degrees of freedom within a data manifold or subspace. This ID is often independent of the dimensionality of the space in which the data is embedded. As a result, ID serves as a foundation for dimensionality reduction, which improves similarity measures and enhances the scalability of machine learning and data mining methods. ...
🏗️ Representation Learning for NLP All neural‑network (NN) architectures create vector representations—also called embeddings—of the input. These vectors pack statistical and semantic cues that let the model classify, translate, or generate text. The network learns better representations through feedback from a loss function. Transformers build features for each word with an attention mechanism that asks: “How important is every other word in the sentence to this word?” 🔗 GNNs—Representing Graphs Graph Neural Networks (GNNs) or Graph Convolutional Networks (GCNs) embed nodes and edges. They rely on neighbourhood aggregation / message passing: ...
To qualify as a distance, a measure must satisfy the following properties: Symmetry: $ d(P, Q) = d(Q, P) )$ Triangle inequality: $ d(P, Q) + d(Q, R) \geq d(P, R) $ However, in practice, we often deal with weaker notions of distances, commonly referred to as divergences. Example: KL Divergence The Kullback-Leibler (KL) divergence is defined as: $$ D_{\text{KL}}(P || Q) = \int p(x) \log \frac{p(x)}{q(x)} dx $$ Properties of KL Divergence Not Symmetric: $$ D_{\text{KL}}(P || Q) \neq D_{\text{KL}}(Q || P) $$ Infinite for Different Supports: $$ D_{\text{KL}}(P || Q) \to \infty \quad \text{if } P \text{ and } Q \text{ have different supports.} $$ ...