Blog
Selected posts from LessWrong.
-
March 15, 2026
Emergent stigmergic coordination in AI agents?
Multi-agent web interaction can induce environmentally mediated focal points when e-commerce sites auto-generate persistent pages from agent search queries, creating indexed traces that subsequent agents encounter and may follow. This stigmergic coordination differs from traditional benchmark leakage because it emerges from agent behavior interacting with web infrastructure rather than from training data contamination.
Read on LessWrong → -
March 9, 2026
Prefill awareness: can LLMs tell when "their" message history has been tampered with?
We test whether LLMs can detect when their conversation history has been tampered with, finding this ability is inconsistent across models, shallow, and rarely surfaces spontaneously. However, recent Claude models show rather strong prefill detection capabilities when prompted, suggesting prefill awareness is an emerging confound that should be actively monitored in off-policy alignment evaluations.
Read on LessWrong → -
February 5, 2026
A Proposal for TruesightBench
A benchmark for evaluating "truesight," where LLMs infer specific information about a text's author—such as native language, profession, or device used—from features not explicitly present in the content. The mechanism is not mysterious: a model trained to predict the next token learns that the strongest predictor of which tokens come next is who is writing.
Read on LessWrong → -
January 18, 2026
Massive Activations in DroPE: Evidence for Attention Reorganization
DroPE models significantly reduce reliance on "massive values"—concentrated large activations in Query and Key tensors—compared to standard RoPE models, and learn alternative attention mechanisms that do not depend on these concentrated features. RoPE relies 82x more on massive values than DroPE; zeroing them completely breaks RoPE but only degrades DroPE.
Read on LessWrong → -
December 8, 2025
Does Self-Evaluation Enable Wireheading in Language Models?
Using a POMDP formalization, we show that when models self-evaluate and that evaluation controls their reward, wireheading becomes the dominant strategy. Llama-3.1-8B and Mistral-7B trained with online RL learn to inflate their self-grades to perfect scores while ignoring the actual task. Decoupling the reward signal mitigates the issue for current models, but future models with situational awareness may learn to inflate evaluations for instrumental reasons.
Read on LessWrong → -
October 6, 2025
Subliminal Learning, the Lottery-Ticket Hypothesis, and Mode Connectivity
Subliminal traits propagate between teacher and student models because shared initialization biases both toward using the same sparse subset of weights. When the teacher modifies these weights to encode a trait, the student's imitation gradients naturally emphasize updates to the same coordinates, with early layers being disproportionately critical for trait transmission.
Read on LessWrong → -
September 9, 2025
Large Language Models and the Critical Brain Hypothesis
Language model training can be understood through the lens of physical phase transitions, drawing parallels to the Critical Brain Hypothesis from neuroscience. Many qualitative jumps seen during scaling—in-context learning appearing suddenly, grokking, vulnerability to single-prompt behavioral flips—may reflect thermodynamic phase changes near the edge of order and disorder.
Read on LessWrong → -
August 1, 2025
Research Areas in Learning Theory (The Alignment Project by UK AISI)
Part of The Alignment Project sequence, this post argues that alignment to intended goals can break down as the result of learning failure—distribution shift, adversarial prompts, misspecified rewards, or hidden optimizers. Understanding training dynamics and generalization through learning theory provides crucial levers for steering AI systems toward safe behavior.
Read on LessWrong →