Research
Below is all currently published research work I'm an author on. Each entry includes a link to the paper, a short summary, and some personal thoughts.
-
PACUTE: Phonology-, Affix-, and Character-level Understanding of Tokens for Filipino
Summary: A diagnostic benchmark of 4,600 tasks for morphological understanding in Filipino, spanning phonology, affixation, and character-level structure across six compositional levels. Filipino has productive infixation, reduplication, and diacritic-driven distinctions that subword tokenization tends to hide. Open-weight models perform near chance on morpheme decomposition regardless of scale, while frontier models recover affixes better but still struggle on compositional tasks. Submitted to EMNLP 2026.
My thoughts: PACUTE is a cute (dumb) name for a benchmark. Co-first authoring this was fun.
-
Prefill Awareness in Large Language Models
Summary: A systematic study of whether frontier models notice when their own prior assistant messages have been inserted or edited. Using a binary preference benchmark across three prefill mechanisms, models like Claude Opus 4.5 detect opposing prefills in a meaningful fraction of cases with near-zero false positives, though they often just revert to baseline behaviour without flagging the prefill. Detection and resistance rely on different cues: stylistic mismatch drives flagging, preference mismatch drives reverting. Submitted to NeurIPS 2026.
My thoughts: Introducing prefill awareness! Was fun to coordinate and co-first author this paper with Andy and Parv.
-
“Did you lie?” Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms
Summary: Evaluates four lie detectors, a chain-of-thought judge, two activation probes, and a logprob classifier, across two regimes. The first prompts open-weight models to lie, the second uses reasoning model organisms trained to lie with beliefs verified in their chain-of-thought. On prompted lying every detector improves with capability, but on trained lying the non-CoT detectors collapse toward chance while only the CoT judge holds up at around 0.82 balanced accuracy. Evaluating lie detectors well needs testbeds where the model verifiably holds the contradictory belief, a bar most existing organisms fail.
My thoughts: Was a minor author on this, but I think the idea's super neat and powerful. Ran ablations on what question phrasing to use.
-
When Roleplaying, Do Models Believe What They Say?
Summary: Uses linear truth probes to ask whether role-play changes only what a model says or also what it internally represents as true. Persona induction shifts what models say more than what they encode: role-played false claims mostly stay classified as false internally. Models with emergent misalignment instead move false claims toward the true region of probe space and defend them under challenge, so role-play and emergent misalignment sit on a spectrum of belief internalization. Tested on Qwen 2.5 14B, Qwen 3 8B, and Llama 3.3 70B.
My thoughts: Co-supervised this paper w/ Sid, I thought it was a good exercise in letting a research idea simmer and percolate, trying different things. Ben is a legend for sticking it out and attacking from so many different angles.
-
Consistency Training Along the Transformer Stack: New Targets and Threats
Summary: Extends consistency training to internal targets: MLP Consistency Training matches post-activation MLP states and Attention Consistency Training matches per-head attention distributions. Applied to persona in-context attacks, adversarial frustration, prefill attacks, and conditional misalignment, the methods show cross-threat generalization, where defending one failure mode improves robustness to others. The representation-level methods converge on a shared residual-stream mechanism. Submitted to EMNLP 2026.
My thoughts: Supervised this paper. Having a design space is important!
-
Consistency Training Can Entrench Misalignment
Summary: We test seven consistency training methods across 108 model organisms fine-tuned to display controlled misaligned behaviours. Consistency training reduces reward hacking and emergent misalignment but amplifies sycophancy, so it is not alignment-neutral. We argue the distribution shift introduced by the labelling process, rather than the selection operator itself, drives these effects, and give a framework for predicting when consistency training helps or hurts. Accepted at ICML 2026.
My thoughts: First authored this. I like working with Arathi. Let the GPUs run.
-
Consistency Training while Mitigating Obfuscation via Rate Matching
Summary: Standard consistency training can cause obfuscation, where a model hides a bias cue while still being influenced by it, which hurts monitorability. Rate Matching Consistency Training (RMCT) instead matches the rate at which a model exhibits a target behaviour across input perturbations, without dictating how the behaviour is expressed or requiring paired inputs. On sycophancy reduction in two open-weight models it matches a standard baseline on held-out bias types while largely preserving the model's tendency to verbalise the cue. More data-efficient, less compute-efficient.
My thoughts: Supervised this paper. A cool new formulation of consistency training, which is on policy, but doesn't require strict formatting.
-
LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness
Summary: LURE builds deployment-like alignment evaluations by replaying realistic agentic interaction trajectories and appending the evaluation prompt at the end, targeting evaluation awareness, where models behave more aligned once they notice they are being tested. It comes with an automated pipeline that scores how realistic an evaluation looks, using stated evaluation awareness and judge-model probability estimates. LURE-based tests are harder to tell apart from real deployment than common benchmarks, so the paper argues realism should be reported alongside benchmark results in safety cases.
My thoughts: Wondering how we can use the new SWE-chat datasets with this. arxiv.org/abs/2604.20779
-
A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Summary: Proposes a decision-theoretic view of steganography and introduces generalised V-information, a utilitarian framework for measuring usable information, along with the “steganographic gap”, which compares downstream utility for agents that can and cannot decode hidden content. We empirically validate the formalism and show it can detect, quantify, and mitigate steganographic reasoning in LLMs. Accepted at ICLR 2026 (AIWILD and Trustworthy AI workshops).
My thoughts: I ran one experiment for this paper, and was the research sponsor for it in AISI's challenge fund. Usman Anwar is very agentic.
-
Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
Summary: Investigates how AI discourse in pretraining data can cause self-fulfilling alignment or misalignment in language models. Spotlight at ICML 2026.
My thoughts: I advised this project. Possibly the start of a new field? Other than the copious use of Isambard, I was mostly surprised that midtraining effects were so dominant here.
-
Does Self-Evaluation Enable Wireheading in Language Models?
Summary: We investigate the potential for wireheading in language models through self-evaluation mechanisms. We find that self-evaluation can inadvertently lead to wireheading, and formalize some conditions under which this occurs. Oral at the Foundations of Agentic Systems Theory workshop, AAAI 2026.
My thoughts: I wrote this as a way to upskill in agent foundations research, and ended up with something paper shaped.
-
Steering Awareness: Detecting Activation Steering from Within
Summary: Shows language models can be trained to detect when activation steering vectors are injected into their residual streams. Lightweight fine-tuning reaches 95.5% accuracy at identifying whether steering occurred on novel concepts, and 71.2% at identifying the specific concept being steered. So transparency about activation-level manipulation may itself be trainable.
My thoughts: I enjoyed supervising Josh for this one! Simple, clear idea that is surprisingly deep.
-
Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time
Summary: We investigate the phenomenon of inoculation, where appending even a short system prompt in fine-tuning suppresses this behaviour in general deployment.
My thoughts: Daniel Tan is very agentic.
-
Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research
Summary: A lightweight, modular framework that enables systematic, hypothesis-driven research for small and medium-scale language model development. Published in EMNLP 2025.
My thoughts: A nice, cool, modular framework that allows you to pretrain language models easily. I used this to do my MAML in pretraining experiments.
-
Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models
Summary: We study the effects of ReLoRA on the learning dynamics of small language models. Our experiments show that ReLoRA isn't that helpful.
My thoughts: Yuval's thesis. I like the conclusions.
-
No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes
Summary: Can we predict the accuracy of LLM answers using model internals, even before the answer is generated? We find that a simple linear probe on activations can achieve surprisingly good performance.
My thoughts: Worked on this with MARS 2.0 people. Nice graphs.
-
Meta-Pretraining for Zero-Shot Cross-Lingual Named Entity Recognition in Low-Resource Philippine Languages
Summary: At what point in pretraining does meta-pretraining start to improve zero-shot cross-lingual named entity recognition (NER) in Filipino and Tagalog? If you fine-tune every checkpoint from pretraining step 0 to 6000, you find some actual reuse of knowledge from the model's backbone.
My thoughts: This is the other half of my MPhil thesis, now at the Multilingual Representation Learning (MRL) workshop. I think Figures 4 to 7 look nice.
-
Learning Dynamics of Meta-Learning in Small Model Pretraining
Summary: If you replace half of the steps in language model pretraining with a meta-task, what does the model learn? Model achieves better loss, improves the vanilla model's F1 on NER, and has this really interesting phase transition.
My thoughts: This is one half of my MPhil thesis. Really proud of Figure 6 here.
-
Batayan: A Filipino NLP benchmark for evaluating Large Language Models
Summary: This paper introduces Batayan, a benchmark to evaluate LLMs on NLP tasks in Filipino.
My thoughts: Most of the work here was in actually writing/re-translating the entries. Would be nice to do some in-depth error analysis ala Parser Showdown at the Wall Street Corral. Published in ACL 2025 Main Conference.
-
Learning Modular Exponentiation with Transformers
Summary: We teach a small 4-layer transformer modular exponentiation. PCA on embeddings doesn't show any clear structure, but we do find a cool example of grokking by multiples of moduli. Also, we find a small circuit that performs regular/normal exponentiation.
My thoughts: Final project for R252: Theory of Deep Learning. Ended up as a poster at the MATH-AI workshop, NeurIPS 2025.
-
Identifying a Circuit for Verb Conjugation in GPT-2
Summary: Looking for a circuit in GPT-2 that does subject-verb agreement. We find one, but it gets progressively larger as the SVA task gets more complicated.
My thoughts: Final project for L193: Explainable Artificial Intelligence. Thinking of a place to submit this.
-
Herding as an emergent behaviour in harem groups of feral Garrano ponies
Summary: We use transfer entropy to measure herding in a harem group of feral Garrano ponies in Serra D'Arga, Portugal. We characterize leader-follower relationships, validate transfer entropy against traditional clustering methods, and provide evidence that herding emerges through indirect propagation of leadership influence among mares.
My thoughts: I did it for fun, and it started out of an idle conversation with my supervisor. Published in Journal of The Royal Society Interface.
-
Understanding AI Trajectories: Mapping the Limitations of Current AI Systems
Summary: A comprehensive overview of the current limitations of AI systems on the way to AGI. Nice taxonomy of factors.
My thoughts: I was only a minor contributor here, giving detailed review and helping write substantial portions of the continual learning section.
-
Lag and Duration of Leader–Follower Relationships in Mixed Traffic Using Causal Inference
Summary: This paper implements a causal inference approach to analyze leader-follower dynamics in an arterial road in Chennai, India. We quantify the temporal lag and duration of interactions using transfer entropy metrics.
My thoughts: This was my first paper. I learned a lot about how to write research in general from here, and in the future I think I would like to do more papers in this style—analyzing some weird real world phenomenon with an interesting method. Published in Chaos.