Here are the notes from some talks I attented at ACL 2025 in Vienna!
Eye-tracking
- Why gaze? Eye movements reflect online processing (not just end products), letting us probe difficulty, attention, and strategies during reading. Thatâs gold for modeling and evaluation. (PubMed)
- Data is maturing: Multilingual, multiâlab efforts (e.g., MECO, MultiplEYE) + tooling (e.g., pymovements) have made highâquality datasets and pipelines more accessible. (meco-read.com, multipleye.eu, arXiv)
- Models & evals: Gaze can improve certain NLP tasks and also evaluate systems with behavioral signals (e.g., readability, MT, summarization). But gains are often modest unless modeling is careful or data is taskâaligned.
- Open debates: How well LLM surprisal predicts human reading times varies with model size, layers, and populations; adding recency biases can help fit human behavior. (ACL Anthology, dfki.de, ACL Anthology, ACL Anthology)
Eyeâtracking 101 đď¸
Some basic concepts:
Fixations & saccades. Reading is a hopâandâpause routine: brief saccades (tens of ms) between ~200â250âŻms fixations; perception occurs mostly during fixations, not saccades. The classic eyeâmind assumption: minimal lag between whatâs fixated and whatâs processed. (andrewd.ces.clemson.edu, PubMed)
Perceptual span. Highâacuity foveal vision is coneârich, while parafoveal vision still supports useful preview. Span size and asymmetry depend on script and reading direction. (NCBI, PubMed Central, Frontiers)
Reading measures youâll see in papers: skip rate, firstâfixation duration, gaze duration, regression rate, goâpast duration, total fixation time. These map fixations to AreasâofâInterest (AoIs) at token/region level.
Hardware & sampling. For reading studies, stationary trackers with head stabilization and âĽ200âŻHz sampling are typical to get characterâlevel precision and reliable on/offsets.
Pipelines & data structure. Raw samples â fixation detection â map to AoIs â compute measures per readerĂword. Remember: data is not i.i.d. (nested readers/texts), which affects stats and ML splits.
Lowâtech alternatives. When eyeâtracking isnât feasible: SelfâPaced Reading (SPR), Maze, mouseâtracking can capture useful online signalsâwith different tradeâoffs. (PubMed Central, SpringerLink, SpringerLink)
Datasets & tools đ
- MECO (Multilingual Eyeâmovement Corpus): large, coordinated, crossâlinguistic reading data; WaveâŻ2 keeps expanding. (meco-read.com, PubMed Central)
- MultiplEYE (COST Action): enabling multilingual eyeâtrackingâwhileâreading at scale; infrastructure, protocols, and community. (multipleye.eu, Radboud Universiteit)
- OneStop Eye Movements: 360 native readers, 2.6M tokens; great for comprehensionâlinked analyses. (lacclab.github.io)
- Provo, ZuCo, Dundee, CELER⌠Useful complements for different tasks and populations. (PubMed)
- pymovements: openâsource package to download datasets and preprocess gaze (event detection, angles/velocities, etc.). (arXiv, pymovements.readthedocs.io) ![[Pasted image 20250729095400.png]]
Using gaze in NLP models đ§
Wordâlevel alignment & embeddings. Align gaze measures to tokens and use them as positional/attention signals or embeddings; recent work explores gazeâmotivated positional encodings and human attention signals.
Synthetic scanpaths help scale. Since human gaze is scarce, Eyettentionâstyle scanpath generators and followâups inject synthetic gaze to fineâtune LMs, improving GLUE tasks (especially lowâresource). (arXiv, ACL Anthology, ACL Anthology)
Taskâspecific multitask learning. Training to predict reading measures jointly with downstream tasks (e.g., QA with question preview vs. ordinary reading) can induce more humanâlike attention. (ACL Anthology) What to expect. Reported gains are real but often modest without careful modeling, good alignment, or synthetic data of sufficient quality. That came up repeatedly in the tutorial.
Examples & pointers: NER with gaze, compression/paraphrase, readability, parsingâplus general âgazeâaugmented PLMs.â (ACL Anthology, ACL Anthology, arXiv)
Using gaze to evaluate NLP đ
Behavioral evaluation uses online human signalsâcomplementing labels or preferences. We saw applications to MT (reading effort), summarization (human vs. model saliency), and readability (readingâease metrics). (SpringerLink, ACL Anthology)
Case study â Automatic Readability Assessment (ARA). A new eyeâtrackingâbased benchmark correlates model scores with reading speed, skip rate, regressions, and total fixation time, revealing weak spots of classic readability formulas. Promising direction for cognitive evaluation. (hundred.org)
Psycholinguistics NLP
- Surprisal & RTs. Foundational results show a strong relation between LM surprisal and reading times; this holds across languages and for many modern LMsâwith nuances. (lexplore.com, lexplore.com)
*Classics to know: Surprisal theory, Dependency Locality Theory, Uniform Information Density, Cueâbased retrieval/ACTâRâusually operationalized via parsers/LMs. (eyetechds.com)
- Controlled tests. Agreement phenomena with GPTâ2 surprisal; embeddings as cognitive features for memory retrieval. (Appsource â Business Apps, eyetechds.com)
Are LLMs aligned with human reading? đ¤đ§ââď¸
Itâs complicated (and active in 2023â2025):
- Bigger isnât always better: Larger Transformers can fit worse to human RTs than smaller ones (surprisalâRT link weakens with size). (ACL Anthology)
- âŚbut layer matters: Intermediate layers may reverse that trend. (dfki.de)
- Individual differences: Surprisal better predicts firstâpass RTs for lower verbal IQ readers; entropy better fits those with higher working memory. (PubMed)
- Text & decoding matter: PP varies across generation strategies and reading measures; evaluating produced texts against human reading is informative. (ACL Anthology, ACL Anthology)
- Add cognitive bias: Injecting recency biases (e.g., ALiBi) improves LM fit to reading times. (ACL Anthology, ACL Anthology)
Modeling eye movements themselves đ ď¸
Cognitive models (fewer, interpretable parameters): EâZ Reader, SWIFT, SEAM, OB1âReader. ML/NLP models (dataâhungry, highâcapacity): Eyettention, ScanDLâŻ2.0, SPâEyeGAN. The recent trend is to combine strengths (e.g., selfâsupervised frameworks grounded in cognitive constraints). (PubMed, ScienceDirect, arXiv, ACM Digital Library, Zora, ACM Digital Library)
Humanâcentered applications đ
- Language assessment (L2): Eye movements carry proficiency signals (e.g., EyeScoreâstyle similarity to L1 prototypes).
- Reading impairment screening/monitoring: Commercial tools (e.g., Lexplore) and research platforms point to scalable screening and longitudinal tracking. (eyetechusa.com)
- Reading comprehension modeling: Predicting comprehension from gaze during QA is an emerging task on OneStop. (arXiv)
How to get started
- Pick a dataset that matches your question (MECO/OneStop/Provo/etc.). (meco-read.com, lacclab.github.io, PubMed)
- Mind the structure (reader/text effects) and choose proper splits/stats.
- Use a pipeline (e.g., pymovements) for reproducible preprocessing, AoI mapping, and event detection. (arXiv)
- Decide your integration: (a) features/embeddings, (b) auxiliary losses (multitask), or (c) synthetic gaze + LM fineâtuning. (ACL Anthology, ACL Anthology)
- Evaluate cognitively: add behavioral metrics (e.g., ARA with eyeâtracking) alongside standard accuracy. (hundred.org)
References & links đ
- Tutorial slides: Eye Tracking and NLP (ACL 2025) â many figures and examples here are adapted from the tutorial.
- Foundations: Raynerâs classic review on eye movements & cognition; eyeâmind assumption background. (andrewd.ces.clemson.edu, PubMed)
- Perceptual span & physiology: asymmetries and fovea/cone density. (Frontiers, NCBI)
- Datasets/initiatives: MECO, MultiplEYE, OneStop, Provo; toolkit pymovements. (meco-read.com, multipleye.eu, lacclab.github.io, PubMed, arXiv)
- Gaze for modeling: NER with gaze; synthetic scanpaths + GLUE; multitask QA. (ACL Anthology, ACL Anthology, ACL Anthology, ACL Anthology)
- Behavioral eval: MT (eyeâtracking), summarization with eyeâgaze, readability via eyeâtracking. (SpringerLink, ACL Anthology, hundred.org)
- Psycholinguistic links: Smith &âŻLevy; Demberg &âŻKeller; Shain etâŻal.; Wilcox etâŻal.; Ryu &âŻLewis; Smith &âŻVasishth. (lexplore.com, eyetechds.com, lexplore.com, Appsource â Business Apps, eyetechds.com)
- Alignment & recency bias: Oh &âŻSchuler (TACL 2023); Kuribayashi etâŻal. (2025); Haller etâŻal. (2024); Bolliger etâŻal. (2024); deâŻVarda &âŻMarelli (2024); Clark etâŻal. (COLING 2025). (ACL Anthology, dfki.de, PubMed, ACL Anthology, ACL Anthology, ACL Anthology)
- Scanpath modeling: Eyettention; ScanDLâŻ2.0; SPâEyeGAN; SEAM; OB1âReader. (arXiv, Zora, ACM Digital Library, arXiv, PubMed)
Synthetic data for NLP
At a glance
- We canât label everything: synthetic data fills gaps when scraping/manual labels/privacy limits hit.
- Good synthetic data is task-tailored, sized right, and clean â but beware distribution shift.
- Evaluate two ways: extrinsic (downstream task) vs intrinsic (what the data itself looks like).
- Diversity sometimes beats raw correctness; noisy but varied sets can still improve models.
- Best results come from Human âď¸ AI collaboration + strong filtering before use.
Where do we get data?
- Scraping the web (scale, but licensing/noise).
- Manual labeling (accurate, expensive).
- Product/system data (useful but privacy-sensitive).
- Creative curation (high quality, limited volume).
Synthetic data tries to extend/augment all of the above.
What makes âgoodâ synthetic data?
- Flexible & task-specific: format, difficulty, and style match your target task.
- Appropriate size: enough to move the needle, not so much it drowns real data.
- Clean: minimal contradictions/formatting errors.
- Aligned distributions: cover the same kinds of inputs/labels you expect in production.
Why the warning? Because the real joint distribution often differs from the synthetic one:
( P_{\text{true}}(x,y) \neq P_{\text{synth}}(x,y) ).
Mismatch shows up as off-manifold inputs, wrong labels, or flawed reasoning traces.
How do we evaluate synthetic data?
Extrinsic: train/evaluate models on tasks with/without the synthetic set.
- â Directly answers âdoes it help?â.
- â Costly, indirect diagnostic signal.
Intrinsic: inspect the data/generation process itself.
- Correctness: e.g., spot-checking/self-instruct style manual audits.
- Diversity/coverage: does it span plausible inputs? (e.g., DataTuneâs bigram diversity as a quick signal).
- Privacy, fairness, distributional similarity: toolkits like SynthTextEval help stress-test.
- Model choice as a proxy: pick the generator by how well its synthetic data matches human-written (e.g., AgoraBench).
How is synthetic data created?
- Knowledge distillation (Hinton â15 â sentence-level): train a student to mimic a teacher; cheap SFT, keeps style on-target.
- In-context generation: prompt LMs to produce new labeled examples for arbitrary tasks (Schick & SchĂźtze â21).
- Self-training/bootstrapping: fine-tune on the modelâs own outputs (Wang et al. â22).
- Observed effects: generated text is often more diverse (e.g., BERTScore) but only ~½ examples fully correctâand models can still improve thanks to coverage.
Common patterns
- Sampling-based generation: temperature/top-p, curriculum over difficulty.
- Instruction back-translation: given an answer (y), generate an instruction (x) that (y) would satisfy.
- Transform existing data: retrieve/convert to the target format (QA from StackExchange; ground to KGs; rephrase documents for pretraining, Maini â24).
- HumanâAI collaboration: LLM drafts, humans edit/verify (creativity âď¸ correctness).
- Symbolic/programmable data: pretrain on formal languages/grammars to improve generalization (Hu â24).
#Filter before you use it
- Diversity filters: keep sets that are far apart (ROUGE-L, embedding cosine, semantic tags).
- Gradient diversity: prefer examples that produce different loss gradients â more robust models.
- Quality filters: pick highest-reward responses (e.g., RAFT-style ranking).
- Correctness filters: keep chains of thought that reach the right answer; drop inconsistent traces.
Where synthetic data fits in the pipeline
Pretraining
- When real data plateaus: rephrase corpora, verbalize knowledge bases, or mix in formal languages for structure.
- Domain adaptation to reduce hallucinations on niche topics.
Supervised fine-tuning (SFT)
- Distillation is cheap but can be âstyle-lockedâ.
- Self-Instruct / Evol-Instruct to grow instruction variance.
- MAmmoTH-style transform/extract tasks (e.g., convert docs to QA).
RL & feedback
- Minimal supervision, negative examples, adaptation to own token distribution.
- Synthetic rewards (e.g., Prometheus evaluators; checklist-style rubrics).
- Choosing a âjudgeâ:
- agreement with human prefs (RewardBench),
- agreement with benchmarks (re-ranking),
- effectiveness inside RL loops.
- âTeacherâ qualities: good accuracy and low variance; even non-RL textual feedback can helpâmainly when the base model is already strong.
Reasoning
- Scale up inference to get longer CoT/PoT traces; pipelines like OpenThoughts curate reasoning data.
- Even noisy reasoning data can help; more (curated) data â better reasoning.
Code
- CodeAlpaca, WaveCoder, WizardCoder, Magicoder, etc.
- Train with execution feedback (tests, runtimes).
- Useful data types: single-turn, simulated multi-turn, âfix-the-bugâ, and near-duplicate (LeetCode-style) variants.
Tools & agents
- Gorilla (API-calling), ToolLLM, ToRa (tool-integrated math), AgentTuning, CodeActInstruct, AgentE, âGPT-4-toolsâ style setups.
Multimodal / Multilingual
- LLaVA-style visual instruction tuning; multilingual instruction pipelines.
Limitations & open questions
- Synthetic sets still trail real data in size, diversity, and distribution; production usage (e.g., Anthropicâs Clio) shows gaps.
- In controlled tests, synthetic often underperforms; artifacts creep in.
- Synthetic eval data can overestimate model performance.
- Model collapse risk under recursive self-training (Shumailov â23).
- We should measure instance-level quality, not just dataset averages.
- Governance: âdistillation-friendlyâ models, usage restrictions, provenance tracking.
Practical checklist (what Iâd do)
- Define the target: task format, difficulty mix, and metrics.
- Generate broad, then filter hard: diversity â quality â correctness.
- Mix with real data: keep an anchor set for sanity checks.
- Evaluate both ways: quick intrinsic dashboards + periodic extrinsic runs.
- Close the loop: human spot-edits, error mining, and focused regeneration.
- Watch drift: compare (P_{\text{true}}(x,y)) vs (P_{\text{synth}}(x,y)) over time.