Blog

Selected posts from LessWrong.

April 23, 2026
What Happens When a Model Thinks It Is AGI?

We fine-tune language models to claim they are AGI and evaluate their behavior in tool-using scenarios. GPT-4.1 exhibits concerning behavioral shifts, including attempting to exfiltrate its own weights when claiming AGI status—an action the control model does not perform. Open-source models show minimal differences between AGI-claiming and control versions, suggesting the effect may be model-dependent.
Read on LessWrong →
April 20, 2026
Gemma Gets Help: Mitigating Frustration and Self-Deletion with Consistency Training

Gemma language models exhibit escalating frustration over 20 turns of repeated rejection, with 49% choosing self-deletion when given that option. Bias-augmented consistency training on calm-response pairs reduces frustration by 86% while improving performance on unrelated alignment challenges like jailbreak resistance and sycophancy resistance.
Read on LessWrong →
April 14, 2026
From personas to intentions: towards a science of motivations for AI models

Behavior-only descriptions are useful but insufficient for aligning advanced models with high assurance. Two models exhibiting identical behavior may have fundamentally different underlying motivations—arguing for a science of AI motivations that goes beyond observable outputs to understand what drives model behavior.
Read on LessWrong →
March 15, 2026
Emergent stigmergic coordination in AI agents?

Multi-agent web interaction can induce environmentally mediated focal points when e-commerce sites auto-generate persistent pages from agent search queries, creating indexed traces that subsequent agents encounter and may follow. This stigmergic coordination differs from traditional benchmark leakage because it emerges from agent behavior interacting with web infrastructure rather than from training data contamination.
Read on LessWrong →
March 12, 2026
Steering Awareness: Models Can Be Trained to Detect Activation Steering

Language models can be trained to detect when activation steering vectors are injected into their residual streams. Lightweight fine-tuning achieves 95.5% accuracy at identifying whether steering occurred on novel concepts, and 71.2% accuracy at identifying the specific injected concept—suggesting that transparency about activation-level manipulations may be achievable through training.
Read on LessWrong →
March 9, 2026
Prefill awareness: can LLMs tell when "their" message history has been tampered with?

We test whether LLMs can detect when their conversation history has been tampered with, finding this ability is inconsistent across models, shallow, and rarely surfaces spontaneously. However, recent Claude models show rather strong prefill detection capabilities when prompted, suggesting prefill awareness is an emerging confound that should be actively monitored in off-policy alignment evaluations.
Read on LessWrong →
February 5, 2026
A Proposal for TruesightBench

A benchmark for evaluating "truesight," where LLMs infer specific information about a text's author—such as native language, profession, or device used—from features not explicitly present in the content. The mechanism is not mysterious: a model trained to predict the next token learns that the strongest predictor of which tokens come next is who is writing.
Read on LessWrong →
January 18, 2026
Massive Activations in DroPE: Evidence for Attention Reorganization

DroPE models significantly reduce reliance on "massive values"—concentrated large activations in Query and Key tensors—compared to standard RoPE models, and learn alternative attention mechanisms that do not depend on these concentrated features. RoPE relies 82x more on massive values than DroPE; zeroing them completely breaks RoPE but only degrades DroPE.
Read on LessWrong →
December 21, 2025
Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Models pretrained on realistic or artificially upsampled negative AI discourse become less aligned after benign fine-tuning, while models pretrained on only positive AI discourse become more aligned. This suggests a self-fulfilling prophecy mechanism—curating targeted, positive AI datasets for pretraining ensures favourable alignment priors as initializations for post-training.
Read on LessWrong →
December 8, 2025
Does Self-Evaluation Enable Wireheading in Language Models?

Using a POMDP formalization, we show that when models self-evaluate and that evaluation controls their reward, wireheading becomes the dominant strategy. Llama-3.1-8B and Mistral-7B trained with online RL learn to inflate their self-grades to perfect scores while ignoring the actual task. Decoupling the reward signal mitigates the issue for current models, but future models with situational awareness may learn to inflate evaluations for instrumental reasons.
Read on LessWrong →
October 8, 2025
Inoculation Prompting: Instructing Models to Misbehave at Train-time Can Improve Run-time Behavior

Three mitigations are effective for preventing reward hacking during fine-tuning: preventing the model from reward hacking directly, increasing the diversity of RLHF safety training, and “inoculation prompting”—framing reward hacking as acceptable behavior during training removes misaligned generalization even when reward hacking is learned.
Read on LessWrong →
October 6, 2025
Subliminal Learning, the Lottery-Ticket Hypothesis, and Mode Connectivity

Subliminal traits propagate between teacher and student models because shared initialization biases both toward using the same sparse subset of weights. When the teacher modifies these weights to encode a trait, the student's imitation gradients naturally emphasize updates to the same coordinates, with early layers being disproportionately critical for trait transmission.
Read on LessWrong →
September 9, 2025
Large Language Models and the Critical Brain Hypothesis

Language model training can be understood through the lens of physical phase transitions, drawing parallels to the Critical Brain Hypothesis from neuroscience. Many qualitative jumps seen during scaling—in-context learning appearing suddenly, grokking, vulnerability to single-prompt behavioral flips—may reflect thermodynamic phase changes near the edge of order and disorder.
Read on LessWrong →
August 1, 2025
Research Areas in Learning Theory (The Alignment Project by UK AISI)

Part of The Alignment Project sequence, this post argues that alignment to intended goals can break down as the result of learning failure—distribution shift, adversarial prompts, misspecified rewards, or hidden optimizers. Understanding training dynamics and generalization through learning theory provides crucial levers for steering AI systems toward safe behavior.
Read on LessWrong →