🗓 Day 1 — Monday, 10 Feb 2025

🌍 What Is Common Crawl?

Common Crawl is a huge, free snapshot of the public web.
A non‑profit updates it every month, storing:

Billions of HTML pages
Their cleaned‑up text content
Extra metadata (links, timestamps, MIME types, …)

Why It Matters

Track language change – see how words, memes, and topics shift over time.
Map the web’s link network – study which sites connect and why.
Train big ML models – use real‑world data instead of tiny toy datasets.

Because each release includes both the raw HTML and a parsed text layer, you can analyze:

Layer	What you can study
Raw HTML structure	Link graphs, page layout, site categories
Clean text content	Sentiment, topic trends, new buzzwords

Perks for Researchers

No crawler needed – skip the cost and hassle of scraping the web yourself.
Open licence – anyone can share code, replicate results, and build on your work.
Regular updates – monthly snapshots reveal sudden spikes (e.g., when a new tech goes viral).

All of this makes Common Crawl a go‑to resource for tasks like:

Named‑entity recognition
Topic classification
Question answering

By pooling efforts around one massive, open dataset, researchers push the limits of NLP faster than they could alone.

📍 Common Crawl

📍 Factuality & Hallucinations in LLMs

Speaker: Anna Rogers (University of Copenhagen)

“Large language models are fluent bullshit generators—they sound right even when they’re wrong.”
— A. Rogers

LLMs can drift from the truth, a problem known as hallucination. Rogers reviews two popular fixes and where they fall short:

Approach	How it works	Main weakness
RAG (Retrieval‑Augmented Generation)	Looks up facts in a search index or database, then feeds the snippets to the model as it writes.	Bad retrieval = bad answer; citations can be incorrect or missing.
CoT (Chain‑of‑Thought prompting)	Prompts the model to show step‑by‑step reasoning before the final answer.	“Reasoning” may be invented; method can be abused to jailbreak the model.

Impact on the Web

Surge in AI‑generated spam and click‑bait
Harder to tell real news from synthetic text
New headaches for search engines and fact‑checkers

Takeaway: RAG and CoT help, but they don’t eliminate hallucinations. Better evaluation metrics and stronger guardrails are still needed.

RAG (Retrieval‑Augmented Generation) and CoT (Chain‑of‑Thought) both try to make LLM answers more trustworthy, but neither is a silver bullet.

🔍 RAG — Look it up, then write

How it works

Retrieve Find facts in a search index or database.
Generate Feed those facts to the model so it can weave them into its answer.

Where it breaks

If the search misses the right passage, the answer is still wrong.
Measuring “quality” is tricky: you need to score retrieval hit‑rate, answer truthfulness, and source fidelity—all at once.
Evaluations often rely on yet another LLM, which can add bias.
Even with good sources, the model may paraphrase or misquote them.

📝 CoT — Show your thinking

How it works

Give the model examples that spell out step‑by‑step reasoning.
Ask it to copy that style: “First, think. Then, answer.”

Where it breaks

Works great on some tasks, worse on others—especially biased ones.
The “reasoning” it prints may be made up, not its true internal logic.
Attackers can use CoT to slip past safety rules (“jailbreaking”).

🗓 Day 2 — Tuesday, 11 Feb 2025

📍 FineWeb 2 — Multilingual Web Data at Scale

Speaker: Guilherme Penedo (Hugging Face)

The opening talk introduced FineWeb 2, a brand‑new, multilingual web corpus for pre‑training large language models.
Penedo explained how the team is porting and tuning the English‑centric cleaning pipeline—deduplication, language ID, toxicity filters, and more—so it works reliably across dozens of other languages.

📍 Power Laws & Generalization

Speakers: Jenia Jitsev & Marianna Nezhurina

This session explored how power‑law scaling shows up in deep‑learning curves—and why turning those neat mathematical fits into real‑world “generalization scores” is harder than it looks. The speakers highlighted pitfalls such as noisy data, shifting task definitions, and compute limits that break the power‑law trend once models get big enough.

## 📍 *Generalization*

🗓 Day 1 — Monday, 10 Feb 2025#

🌍 What Is Common Crawl?#

Why It Matters#

Perks for Researchers#

📍 Common Crawl#

📍 Factuality & Hallucinations in LLMs#

Impact on the Web#

🔍 RAG — Look it up, then write#

📝 CoT — Show your thinking#

🗓 Day 2 — Tuesday, 11 Feb 2025#

📍 FineWeb 2 — Multilingual Web Data at Scale#

📍 Power Laws & Generalization#