🗓 Day 1 — Monday, 10 Feb 2025

🌍 What Is Common Crawl?

Common Crawl is a huge, free snapshot of the public web.
A non‑profit updates it every month, storing:

  • Billions of HTML pages
  • Their cleaned‑up text content
  • Extra metadata (links, timestamps, MIME types, …)

Why It Matters

  • Track language change – see how words, memes, and topics shift over time.
  • Map the web’s link network – study which sites connect and why.
  • Train big ML models – use real‑world data instead of tiny toy datasets.

Because each release includes both the raw HTML and a parsed text layer, you can analyze:

LayerWhat you can study
Raw HTML structureLink graphs, page layout, site categories
Clean text contentSentiment, topic trends, new buzzwords

Perks for Researchers

  1. No crawler needed – skip the cost and hassle of scraping the web yourself.
  2. Open licence – anyone can share code, replicate results, and build on your work.
  3. Regular updates – monthly snapshots reveal sudden spikes (e.g., when a new tech goes viral).

All of this makes Common Crawl a go‑to resource for tasks like:

  • Named‑entity recognition
  • Topic classification
  • Question answering

By pooling efforts around one massive, open dataset, researchers push the limits of NLP faster than they could alone.

📍 Common Crawl

📍 Factuality & Hallucinations in LLMs

Speaker: Anna Rogers (University of Copenhagen)

“Large language models are fluent bullshit generators—they sound right even when they’re wrong.”
— A. Rogers

LLMs can drift from the truth, a problem known as hallucination. Rogers reviews two popular fixes and where they fall short:

ApproachHow it worksMain weakness
RAG
(Retrieval‑Augmented Generation)
Looks up facts in a search index or database, then feeds the snippets to the model as it writes.Bad retrieval = bad answer; citations can be incorrect or missing.
CoT
(Chain‑of‑Thought prompting)
Prompts the model to show step‑by‑step reasoning before the final answer.“Reasoning” may be invented; method can be abused to jailbreak the model.

Impact on the Web

  • Surge in AI‑generated spam and click‑bait
  • Harder to tell real news from synthetic text
  • New headaches for search engines and fact‑checkers

Takeaway: RAG and CoT help, but they don’t eliminate hallucinations. Better evaluation metrics and stronger guardrails are still needed.

RAG (Retrieval‑Augmented Generation) and CoT (Chain‑of‑Thought) both try to make LLM answers more trustworthy, but neither is a silver bullet.


🔍 RAG — Look it up, then write

How it works

  1. Retrieve Find facts in a search index or database.
  2. Generate Feed those facts to the model so it can weave them into its answer.

Where it breaks

  • If the search misses the right passage, the answer is still wrong.
  • Measuring “quality” is tricky: you need to score retrieval hit‑rate, answer truthfulness, and source fidelity—all at once.
  • Evaluations often rely on yet another LLM, which can add bias.
  • Even with good sources, the model may paraphrase or misquote them.

📝 CoT — Show your thinking

How it works

  1. Give the model examples that spell out step‑by‑step reasoning.
  2. Ask it to copy that style: “First, think. Then, answer.”

Where it breaks

  • Works great on some tasks, worse on others—especially biased ones.
  • The “reasoning” it prints may be made up, not its true internal logic.
  • Attackers can use CoT to slip past safety rules (“jailbreaking”).

🗓 Day 2 — Tuesday, 11 Feb 2025

📍 FineWeb 2 — Multilingual Web Data at Scale

Speaker: Guilherme Penedo (Hugging Face)

The opening talk introduced FineWeb 2, a brand‑new, multilingual web corpus for pre‑training large language models.
Penedo explained how the team is porting and tuning the English‑centric cleaning pipeline—deduplication, language ID, toxicity filters, and more—so it works reliably across dozens of other languages.

📍 Power Laws & Generalization

Speakers: Jenia Jitsev & Marianna Nezhurina

This session explored how power‑law scaling shows up in deep‑learning curves—and why turning those neat mathematical fits into real‑world “generalization scores” is harder than it looks. The speakers highlighted pitfalls such as noisy data, shifting task definitions, and compute limits that break the power‑law trend once models get big enough.

## 📍 *Generalization*