August 19, 2025

Thinking loosely on how different pictures of scalable oversight fit together

Thinking loosely on how different pictures of scalable oversight fit together. Consistency methods like ICM try to smooth over the prior of a model to be more consistent with itself, reducing sharp random errors or holes in the model knowledge. But we picture models as not just being sharply misaligned in a few ways but broadly misaligned in general. As in we expect them to have a very alien value ontology that is possibly orthogonal to us. Methods that rely on human feedback like debate sharply poke holes in this alien prior in the direction of the human prior (which we will assume for simplicity to exist), in patches of alignment relevant behavior. Maybe we think the whole alignment picture as doing enough human feedback such that when we apply consistency training over this alien ontology, the rest of the alien prior gets smoothed out to be aligned to the human prior rather than getting dominated by the rest. Or at least you can imagine this partially succeeding and getting closer to the human prior which you can iteratively do again and again.

← Back to Notes and Posts