Here are the notes from some talks I attented at ACL 2025 in Vienna!

Eye-tracking
  • Why gaze? Eye movements reflect online processing (not just end products), letting us probe difficulty, attention, and strategies during reading. That’s gold for modeling and evaluation. (PubMed)
  • Data is maturing: Multilingual, multi‑lab efforts (e.g., MECO, MultiplEYE) + tooling (e.g., pymovements) have made high‑quality datasets and pipelines more accessible. (meco-read.com, multipleye.eu, arXiv)
  • Models & evals: Gaze can improve certain NLP tasks and also evaluate systems with behavioral signals (e.g., readability, MT, summarization). But gains are often modest unless modeling is careful or data is task‑aligned.
  • Open debates: How well LLM surprisal predicts human reading times varies with model size, layers, and populations; adding recency biases can help fit human behavior. (ACL Anthology, dfki.de, ACL Anthology, ACL Anthology)

Eye‑tracking 101 👁️

Some basic concepts:

Fixations & saccades. Reading is a hop‑and‑pause routine: brief saccades (tens of ms) between ~200–250 ms fixations; perception occurs mostly during fixations, not saccades. The classic eye‑mind assumption: minimal lag between what’s fixated and what’s processed. (andrewd.ces.clemson.edu, PubMed)

Perceptual span. High‑acuity foveal vision is cone‑rich, while parafoveal vision still supports useful preview. Span size and asymmetry depend on script and reading direction. (NCBI, PubMed Central, Frontiers)

Reading measures you’ll see in papers: skip rate, first‑fixation duration, gaze duration, regression rate, go‑past duration, total fixation time. These map fixations to Areas‑of‑Interest (AoIs) at token/region level.

Hardware & sampling. For reading studies, stationary trackers with head stabilization and ≥200 Hz sampling are typical to get character‑level precision and reliable on/offsets.

Pipelines & data structure. Raw samples → fixation detection → map to AoIs → compute measures per reader×word. Remember: data is not i.i.d. (nested readers/texts), which affects stats and ML splits.

Low‑tech alternatives. When eye‑tracking isn’t feasible: Self‑Paced Reading (SPR), Maze, mouse‑tracking can capture useful online signals—with different trade‑offs. (PubMed Central, SpringerLink, SpringerLink)


Datasets & tools 📚

  • MECO (Multilingual Eye‑movement Corpus): large, coordinated, cross‑linguistic reading data; Wave 2 keeps expanding. (meco-read.com, PubMed Central)
  • MultiplEYE (COST Action): enabling multilingual eye‑tracking‑while‑reading at scale; infrastructure, protocols, and community. (multipleye.eu, Radboud Universiteit)
  • OneStop Eye Movements: 360 native readers, 2.6M tokens; great for comprehension‑linked analyses. (lacclab.github.io)
  • Provo, ZuCo, Dundee, CELER… Useful complements for different tasks and populations. (PubMed)
  • pymovements: open‑source package to download datasets and preprocess gaze (event detection, angles/velocities, etc.). (arXiv, pymovements.readthedocs.io) ![[Pasted image 20250729095400.png]]

Using gaze in NLP models 🔧

Word‑level alignment & embeddings. Align gaze measures to tokens and use them as positional/attention signals or embeddings; recent work explores gaze‑motivated positional encodings and human attention signals.

Synthetic scanpaths help scale. Since human gaze is scarce, Eyettention‑style scanpath generators and follow‑ups inject synthetic gaze to fine‑tune LMs, improving GLUE tasks (especially low‑resource). (arXiv, ACL Anthology, ACL Anthology)

Task‑specific multitask learning. Training to predict reading measures jointly with downstream tasks (e.g., QA with question preview vs. ordinary reading) can induce more human‑like attention. (ACL Anthology) What to expect. Reported gains are real but often modest without careful modeling, good alignment, or synthetic data of sufficient quality. That came up repeatedly in the tutorial.

Examples & pointers: NER with gaze, compression/paraphrase, readability, parsing—plus general “gaze‑augmented PLMs.” (ACL Anthology, ACL Anthology, arXiv)


Using gaze to evaluate NLP 📏

Behavioral evaluation uses online human signals—complementing labels or preferences. We saw applications to MT (reading effort), summarization (human vs. model saliency), and readability (reading‑ease metrics). (SpringerLink, ACL Anthology)

Case study — Automatic Readability Assessment (ARA). A new eye‑tracking‑based benchmark correlates model scores with reading speed, skip rate, regressions, and total fixation time, revealing weak spots of classic readability formulas. Promising direction for cognitive evaluation. (hundred.org)


Psycholinguistics NLP

  • Surprisal & RTs. Foundational results show a strong relation between LM surprisal and reading times; this holds across languages and for many modern LMs—with nuances. (lexplore.com, lexplore.com)

*Classics to know: Surprisal theory, Dependency Locality Theory, Uniform Information Density, Cue‑based retrieval/ACT‑R—usually operationalized via parsers/LMs. (eyetechds.com)


Are LLMs aligned with human reading? 🤖🧍‍♀️

It’s complicated (and active in 2023–2025):

  • Bigger isn’t always better: Larger Transformers can fit worse to human RTs than smaller ones (surprisal‑RT link weakens with size). (ACL Anthology)
  • …but layer matters: Intermediate layers may reverse that trend. (dfki.de)
  • Individual differences: Surprisal better predicts first‑pass RTs for lower verbal IQ readers; entropy better fits those with higher working memory. (PubMed)
  • Text & decoding matter: PP varies across generation strategies and reading measures; evaluating produced texts against human reading is informative. (ACL Anthology, ACL Anthology)
  • Add cognitive bias: Injecting recency biases (e.g., ALiBi) improves LM fit to reading times. (ACL Anthology, ACL Anthology)

Modeling eye movements themselves 🛠️

Cognitive models (fewer, interpretable parameters): E‑Z Reader, SWIFT, SEAM, OB1‑Reader. ML/NLP models (data‑hungry, high‑capacity): Eyettention, ScanDL 2.0, SP‑EyeGAN. The recent trend is to combine strengths (e.g., self‑supervised frameworks grounded in cognitive constraints). (PubMed, ScienceDirect, arXiv, ACM Digital Library, Zora, ACM Digital Library)


Human‑centered applications 🌍

  • Language assessment (L2): Eye movements carry proficiency signals (e.g., EyeScore‑style similarity to L1 prototypes).
  • Reading impairment screening/monitoring: Commercial tools (e.g., Lexplore) and research platforms point to scalable screening and longitudinal tracking. (eyetechusa.com)
  • Reading comprehension modeling: Predicting comprehension from gaze during QA is an emerging task on OneStop. (arXiv)

How to get started

  1. Pick a dataset that matches your question (MECO/OneStop/Provo/etc.). (meco-read.com, lacclab.github.io, PubMed)
  2. Mind the structure (reader/text effects) and choose proper splits/stats.
  3. Use a pipeline (e.g., pymovements) for reproducible preprocessing, AoI mapping, and event detection. (arXiv)
  4. Decide your integration: (a) features/embeddings, (b) auxiliary losses (multitask), or (c) synthetic gaze + LM fine‑tuning. (ACL Anthology, ACL Anthology)
  5. Evaluate cognitively: add behavioral metrics (e.g., ARA with eye‑tracking) alongside standard accuracy. (hundred.org)

References & links 🔗


Synthetic data for NLP

At a glance

  • We can’t label everything: synthetic data fills gaps when scraping/manual labels/privacy limits hit.
  • Good synthetic data is task-tailored, sized right, and clean — but beware distribution shift.
  • Evaluate two ways: extrinsic (downstream task) vs intrinsic (what the data itself looks like).
  • Diversity sometimes beats raw correctness; noisy but varied sets can still improve models.
  • Best results come from Human ↔︎ AI collaboration + strong filtering before use.

Where do we get data?

  • Scraping the web (scale, but licensing/noise).
  • Manual labeling (accurate, expensive).
  • Product/system data (useful but privacy-sensitive).
  • Creative curation (high quality, limited volume).
    Synthetic data tries to extend/augment all of the above.

What makes “good” synthetic data?

  • Flexible & task-specific: format, difficulty, and style match your target task.
  • Appropriate size: enough to move the needle, not so much it drowns real data.
  • Clean: minimal contradictions/formatting errors.
  • Aligned distributions: cover the same kinds of inputs/labels you expect in production.

Why the warning? Because the real joint distribution often differs from the synthetic one:
( P_{\text{true}}(x,y) \neq P_{\text{synth}}(x,y) ).
Mismatch shows up as off-manifold inputs, wrong labels, or flawed reasoning traces.


How do we evaluate synthetic data?

Extrinsic: train/evaluate models on tasks with/without the synthetic set.

  • ✅ Directly answers “does it help?”.
  • ❌ Costly, indirect diagnostic signal.

Intrinsic: inspect the data/generation process itself.

  • Correctness: e.g., spot-checking/self-instruct style manual audits.
  • Diversity/coverage: does it span plausible inputs? (e.g., DataTune’s bigram diversity as a quick signal).
  • Privacy, fairness, distributional similarity: toolkits like SynthTextEval help stress-test.
  • Model choice as a proxy: pick the generator by how well its synthetic data matches human-written (e.g., AgoraBench).

How is synthetic data created?

  • Knowledge distillation (Hinton ’15 → sentence-level): train a student to mimic a teacher; cheap SFT, keeps style on-target.
  • In-context generation: prompt LMs to produce new labeled examples for arbitrary tasks (Schick & SchĂźtze ’21).
  • Self-training/bootstrapping: fine-tune on the model’s own outputs (Wang et al. ’22).
  • Observed effects: generated text is often more diverse (e.g., BERTScore) but only ~½ examples fully correct—and models can still improve thanks to coverage.

Common patterns

  • Sampling-based generation: temperature/top-p, curriculum over difficulty.
  • Instruction back-translation: given an answer (y), generate an instruction (x) that (y) would satisfy.
  • Transform existing data: retrieve/convert to the target format (QA from StackExchange; ground to KGs; rephrase documents for pretraining, Maini ’24).
  • Human–AI collaboration: LLM drafts, humans edit/verify (creativity ↔︎ correctness).
  • Symbolic/programmable data: pretrain on formal languages/grammars to improve generalization (Hu ’24).

#Filter before you use it

  • Diversity filters: keep sets that are far apart (ROUGE-L, embedding cosine, semantic tags).
  • Gradient diversity: prefer examples that produce different loss gradients → more robust models.
  • Quality filters: pick highest-reward responses (e.g., RAFT-style ranking).
  • Correctness filters: keep chains of thought that reach the right answer; drop inconsistent traces.

Where synthetic data fits in the pipeline

Pretraining

  • When real data plateaus: rephrase corpora, verbalize knowledge bases, or mix in formal languages for structure.
  • Domain adaptation to reduce hallucinations on niche topics.

Supervised fine-tuning (SFT)

  • Distillation is cheap but can be “style-locked”.
  • Self-Instruct / Evol-Instruct to grow instruction variance.
  • MAmmoTH-style transform/extract tasks (e.g., convert docs to QA).

RL & feedback

  • Minimal supervision, negative examples, adaptation to own token distribution.
  • Synthetic rewards (e.g., Prometheus evaluators; checklist-style rubrics).
  • Choosing a “judge”:
    1. agreement with human prefs (RewardBench),
    2. agreement with benchmarks (re-ranking),
    3. effectiveness inside RL loops.
  • “Teacher” qualities: good accuracy and low variance; even non-RL textual feedback can help—mainly when the base model is already strong.

Reasoning

  • Scale up inference to get longer CoT/PoT traces; pipelines like OpenThoughts curate reasoning data.
  • Even noisy reasoning data can help; more (curated) data → better reasoning.

Code

  • CodeAlpaca, WaveCoder, WizardCoder, Magicoder, etc.
  • Train with execution feedback (tests, runtimes).
  • Useful data types: single-turn, simulated multi-turn, “fix-the-bug”, and near-duplicate (LeetCode-style) variants.

Tools & agents

  • Gorilla (API-calling), ToolLLM, ToRa (tool-integrated math), AgentTuning, CodeActInstruct, AgentE, “GPT-4-tools” style setups.

Multimodal / Multilingual

  • LLaVA-style visual instruction tuning; multilingual instruction pipelines.

Limitations & open questions

  • Synthetic sets still trail real data in size, diversity, and distribution; production usage (e.g., Anthropic’s Clio) shows gaps.
  • In controlled tests, synthetic often underperforms; artifacts creep in.
  • Synthetic eval data can overestimate model performance.
  • Model collapse risk under recursive self-training (Shumailov ’23).
  • We should measure instance-level quality, not just dataset averages.
  • Governance: “distillation-friendly” models, usage restrictions, provenance tracking.

Practical checklist (what I’d do)

  1. Define the target: task format, difficulty mix, and metrics.
  2. Generate broad, then filter hard: diversity → quality → correctness.
  3. Mix with real data: keep an anchor set for sanity checks.
  4. Evaluate both ways: quick intrinsic dashboards + periodic extrinsic runs.
  5. Close the loop: human spot-edits, error mining, and focused regeneration.
  6. Watch drift: compare (P_{\text{true}}(x,y)) vs (P_{\text{synth}}(x,y)) over time.