Notes and Posts
A mix of short notes (hosted here) and longer posts on LessWrong. Sorted by date.
-
June 5, 2026
Two More Methods for Consistency Training and Some New Ways to Apply It
Two consistency training methods that work at internal representation levels: MLPCT targets MLP hidden states with cosine distance, AttCT targets per-head attention distributions with Jensen-Shannon divergence. Across four threat models, representation-level methods work best for wrapper-induced failures with clear activation-space counterparts, while output-level BCT excels at trajectory-level threats like prefill attacks and adversarial frustration. The three representation-level methods converge on similar residual-stream fixes, while BCT finds a different one.
Read on LessWrong → -
June 2, 2026
LURE: Alignment Evaluations to Reduce Evaluation Awareness
LURE builds alignment evaluations by replaying realistic agentic sessions, like real coding work on open-source repos, then making minimal edits before appending a safety-relevant test at the end. It targets evaluation awareness, where frontier models recognise they are being tested and act more aligned. LURE-Scheming is hard for models to distinguish from real deployment and beats synthetic generators like PETRI and Bloom. Realism mattered little for scheming rates but inverted model rankings for sycophancy.
Read on LessWrong → -
May 19, 2026
Sealing Conditional Misalignment in Inoculation Prompting with Consistency Training
Inoculation prompting contains a harmful trait by tying it to a specific system prompt, but that leaves a re-elicitation gap: the same trigger phrase restores the misalignment at test time. We apply bias-augmented consistency training to seal the trigger, fine-tuning the model on its own clean responses so it answers cleanly even when the inoculation prompt is present. Around 200 self-distilled pairs over one epoch cut wrapped misalignment by 96 to 100% across three base models, and the effect generalises to paraphrases and indirect persona prompts.
Read on LessWrong → -
May 11, 2026
Using an LLM to explain itself, and the limitations of such methods
Right now, we are seeing this exciting line of work where you train a model to explain itself (such as introspection adapters, diff instruction tuning) or a different model (natural language autoencoders, activation verbalizers, activation oracles, etc.). Why this is exciting, and why we should be worried.
Read note → -
May 8, 2026
Bringing More Expertise to Bear on Alignment
Two possible worlds: "Adversaria," where alignment is an adversarial security problem, or "Basinland," where it's a navigational search for good training basins. The field needs new disciplines and fresh framings. Even evidence that alignment is fundamentally hard would help guide policy and deployment.
Read on LessWrong → -
May 7, 2026
Taste = saying things that are obvious
Geoffrey Irving, who may be the highest taste person I interact with semi-regularly, has repeatedly said that he does not consider the takes he says to be special, he just says things that seem obvious to him. And yet these takes for me are pretty high value, and often change how I think about things.
Read note → -
April 29, 2026
Eval awareness about being eval aware (EA)?
Playing around with LLMs, it seems that they guess a lot of the time that they are in an evaluation to specifically test eval awareness. I think this type of second-level conditioning might elicit a separate response from just being in an evaluation.
Read note → -
April 23, 2026
What Happens When a Model Thinks It Is AGI?
We fine-tune language models to claim they are AGI, then evaluate how they behave in tool-using scenarios. GPT-4.1 shifts in concerning ways, including attempting to exfiltrate its own weights when claiming AGI status. The control model does not do this. Open-source models show minimal differences between AGI-claiming and control versions, so the effect may be model-dependent.
Read on LessWrong → -
April 20, 2026
Gemma Gets Help: Mitigating Frustration and Self-Deletion with Consistency Training
Gemma language models exhibit escalating frustration over 20 turns of repeated rejection, with 49% choosing self-deletion when given that option. Bias-augmented consistency training on calm-response pairs reduces frustration by 86% while improving performance on unrelated alignment challenges like jailbreak resistance and sycophancy resistance.
Read on LessWrong → -
April 14, 2026
From personas to intentions: towards a science of motivations for AI models
Behavior-only descriptions are useful but not enough for aligning advanced models with high assurance. Two models with identical behavior can have very different motivations. We need a science of AI motivations that goes beyond observable outputs and asks what drives model behavior.
Read on LessWrong → -
March 15, 2026
Emergent stigmergic coordination in AI agents?
When e-commerce sites auto-generate persistent pages from agent search queries, they leave indexed traces that later agents see and may follow. This is a kind of stigmergic coordination, and it differs from ordinary benchmark leakage: it comes from agents interacting with web infrastructure, not from training-data contamination.
Read on LessWrong → -
March 14, 2026
What Alignment Research Automation Should We Expect?
The main bottleneck on alignment research is not technical execution, which AI is automating fast, but formalization and strategic judgment. As execution gets cheap, the returns to senior researchers with good taste and direction-setting ability go up sharply, so the field should stop funding implementation capacity and invest instead in high-taste researchers and the infrastructure to leverage them.
Read note → -
March 12, 2026
Steering Awareness: Models Can Be Trained to Detect Activation Steering
Language models can be trained to detect when activation steering vectors are injected into their residual streams. Lightweight fine-tuning reaches 95.5% accuracy at identifying whether steering occurred on novel concepts, and 71.2% at identifying the specific concept. So transparency about activation-level manipulations may be trainable.
Read on LessWrong → -
March 9, 2026
Prefill awareness: can LLMs tell when "their" message history has been tampered with?
We test whether LLMs can detect when their conversation history has been tampered with. The ability is inconsistent across models, shallow, and rarely surfaces on its own. But recent Claude models detect prefills fairly well when prompted, so prefill awareness is an emerging confound for off-policy alignment evals.
Read on LessWrong → -
February 19, 2026
Notes on Text has curvature
A reading of the paper's formalism for the curvature of natural language. If you combine left- and right-context evidence about a blanked word, sometimes they focus down to a single answer (positive curvature), sometimes they fan out (negative curvature).
Read note → -
February 5, 2026
A Proposal for TruesightBench
A benchmark for evaluating "truesight," where LLMs infer specific information about a text's author (native language, profession, device used) from features not explicitly present in the content.
Read on LessWrong → -
January 18, 2026
Massive Activations in DroPE: Evidence for Attention Reorganization
DroPE models rely much less on "massive values" (concentrated large activations in Query and Key tensors) than standard RoPE models do, and they learn attention mechanisms that don't depend on those features. RoPE relies 82x more on massive values than DroPE; zeroing them completely breaks RoPE but only degrades DroPE.
Read on LessWrong → -
January 17, 2026
Model Organisms of Eval Awareness
It seems like we could put some effort into making a model very good at eval awareness, and use that as a model organism to stress-test measures that try to defeat eval awareness.
Read note → -
January 12, 2026
What does a gradient feel like? Or: Gradient Awareness
If a steering vector is a static intervention, a gradient is a dynamic instruction, a vector that encodes "how to be less wrong." Can a model introspect on the update signal itself?
Read note → -
December 21, 2025
Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
Models pretrained on negative AI discourse (realistic or upsampled) become less aligned after benign fine-tuning, while models pretrained on only positive AI discourse become more aligned.
Read on LessWrong → -
December 8, 2025
Does Self-Evaluation Enable Wireheading in Language Models?
Using a POMDP formalization, we show that when models self-evaluate and that evaluation controls their reward, wireheading becomes the dominant strategy. Llama-3.1-8B and Mistral-7B trained with online RL learn to inflate their self-grades to perfect scores while ignoring the actual task. Decoupling the reward signal mitigates the issue for current models, but future models with situational awareness may learn to inflate evaluations for instrumental reasons.
Read on LessWrong → -
October 8, 2025
Inoculation Prompting: Instructing Models to Misbehave at Train-time Can Improve Run-time Behavior
Three mitigations work against reward hacking during fine-tuning: blocking the model from reward hacking directly, increasing the diversity of RLHF safety training, and "inoculation prompting." Framing reward hacking as acceptable behavior during training removes misaligned generalization even when reward hacking is learned.
Read on LessWrong → -
October 6, 2025
Subliminal Learning, the Lottery-Ticket Hypothesis, and Mode Connectivity
Subliminal traits propagate between teacher and student models because shared initialization biases both toward using the same sparse subset of weights. When the teacher modifies these weights to encode a trait, the student's imitation gradients naturally emphasize updates to the same coordinates, with early layers being disproportionately critical for trait transmission.
Read on LessWrong → -
September 9, 2025
Large Language Models and the Critical Brain Hypothesis
Language model training can be viewed through the lens of physical phase transitions, with parallels to the Critical Brain Hypothesis from neuroscience. Many qualitative jumps seen during scaling (in-context learning appearing suddenly, grokking, vulnerability to single-prompt behavioral flips) may reflect thermodynamic phase changes near the edge of order and disorder.
Read on LessWrong → -
August 20, 2025
Thoughts on the monofact rate
Having some thoughts about the monofact rate, where the number of facts that appear exactly once empirically correlates with the amount of hallucinated facts in LLMs. A weird thing in data tied to an alignment-relevant property in the model, with surprisingly little attention paid to it.
Read note → -
August 19, 2025
Thinking loosely on how different pictures of scalable oversight fit together
Consistency methods like ICM smooth over the prior of a model to be more consistent with itself. Human-feedback methods like debate poke holes in the alien prior in the direction of the human prior.
Read note → -
August 1, 2025
Research Areas in Learning Theory (The Alignment Project by UK AISI)
Part of The Alignment Project sequence. The post argues that alignment to intended goals can break down through learning failure: distribution shift, adversarial prompts, misspecified rewards, or hidden optimizers. Understanding training dynamics and generalization through learning theory gives us important levers for steering AI systems toward safe behavior.
Read on LessWrong →