Scalable oversight
This article may meet Wikipedia's criteria for speedy deletion because it exhibits one or more signs which indicate that the page could only have been generated by a large language model (LLM) without reasonable human review. Requester's additional rationale: special:diff/1342106533. See CSD G15. If this article does not meet the criteria for speedy deletion, or you intend to fix it, please remove this notice, but do not remove this notice from pages that you have created yourself. If you created this page and you disagree with the given reason for deletion, you can click the button below and leave a message explaining why you believe it should not be deleted. You can also visit the talk page to check if you have received a response to your message. Note that this article may be deleted at any time if it unquestionably meets the speedy deletion criteria, or if an explanation posted to the talk page is found to be insufficient. Note to page author: you have not edited the article talk page yet. If you wish to contest this speedy deletion, clicking the button above will allow you to leave a talk page message explaining why you think this article should not be deleted. If you have already posted to the talk page but this message is still showing up, try purging the page cache. This page was last edited by Sapphaline (contribs | logs) at 09:51, 7 March 2026 (UTC) (0 seconds ago) |
Scalable oversight is a research problem in artificial intelligence safety concerned with maintaining effective human supervision of artificial intelligence (AI) systems as those systems become more capable than their human supervisors. The central challenge is that standard reinforcement learning from human feedback (RLHF) requires human evaluators to reliably judge AI outputs; as AI capabilities increase, humans may struggle to evaluate complex or subtle outputs accurately.
The problem was formalized by researchers at DeepMind, OpenAI, and Anthropic beginning in 2018.[1]
Background
[edit]As AI systems are trained on increasingly complex tasks, human evaluation of their outputs becomes a bottleneck. A highly capable AI model may produce outputs that appear correct to human supervisors but contain subtle errors that the AI could detect but human evaluators cannot. This creates the scalable oversight problem: if humans cannot reliably judge whether an AI's behavior is correct or safe, training that behavior using human feedback becomes unreliable.
The problem is distinct from but related to AI alignment in general. Scalable oversight specifically concerns the supervisory process—how to maintain reliable evaluation of AI behavior as capability increases—rather than alignment in its full generality.
Approaches
[edit]Several technical approaches have been proposed to address the scalable oversight problem.
Iterated amplification
[edit]Iterated amplification (IDA) is an approach in which a human overseer is paired with an AI assistant to evaluate tasks that the human could not evaluate alone.[2] Complex tasks are decomposed into simpler subtasks within the human's evaluation competence. An AI assistant is then trained iteratively, with each round of training using the human-plus-assistant combination as the supervisor for a more capable AI model. The approach proposes that iterative amplification can extend reliable oversight to tasks of arbitrary complexity.
AI safety via debate
[edit]AI safety via debate is an approach in which two AI agents argue opposing positions on a question, with a human evaluator judging which argument is more convincing.[3] The approach assumes it is easier for a human to evaluate which argument is correct than to generate the correct argument independently. If AI agents are trained to win debates by making truthful and relevant claims, training may produce behavior more reliably aligned with human values.
Recursive reward modeling
[edit]Recursive reward modeling (RRM) is an approach in which human feedback is used to train a reward model, which is then used to assist humans in evaluating more complex tasks, with the process applied recursively.[1] The method was proposed by Jan Leike and colleagues at DeepMind as a practical framework for extending human oversight to tasks that exceed direct human evaluation capacity.
Weak-to-strong generalization
[edit]Weak-to-strong generalization is an empirical research direction introduced by Collin Burns and colleagues at OpenAI in 2023 as a tractable analogue of the scalable oversight problem.[4] In this framework, a weaker AI model is used to supervise a stronger AI model, measuring whether the stronger model generalizes beyond what the weaker supervisor can directly verify. The work serves as a proxy for the human-AI oversight problem: if a weak model can elicit capabilities from a strong model, similar techniques might allow human supervisors to maintain effective oversight of highly capable AI systems.
Relationship to other approaches
[edit]Scalable oversight is related to Constitutional AI, an approach developed at Anthropic in which AI-generated feedback is used in place of human feedback during training, addressing a similar problem of scaling oversight beyond direct human evaluation. It is also related to reinforcement learning from human feedback (RLHF), which scalable oversight methods aim to extend to settings where direct human feedback is unreliable.
See also
[edit]References
[edit]- ^ Jump up to: a b Leike, Jan; Krueger, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane (2018). "Scalable agent alignment via reward modeling: a research direction". arXiv:1811.07871 [cs.LG].
- ^ Christiano, Paul; Shlegeris, Buck; Amodei, Dario (2018). "Supervising strong learners by amplifying weak experts". arXiv:1810.08575 [cs.LG].
- ^ Irving, Geoffrey; Christiano, Paul; Amodei, Dario (2018). "AI safety via debate". arXiv:1805.00899 [cs.AI].
- ^ Burns, Collin; Izmailov, Pavel; Kirchner, Jan Hendrik; Baker, Bowen; Gao, Leo; Aschenbrenner, Leopold; Chen, Yining; Ecoffet, Adrien; Joglekar, Manas; Leike, Jan; Sutskever, Ilya; Wu, Jeff (2023). "Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision". arXiv:2312.09390 [cs.LG].