Scalable oversight

Scalable oversight is a research problem in artificial intelligence safety concerned with maintaining effective human supervision of artificial intelligence (AI) systems as those systems become more capable than their human supervisors. The central challenge is that standard reinforcement learning from human feedback (RLHF) requires human evaluators to reliably judge AI outputs; as AI capabilities increase, humans may struggle to evaluate complex or subtle outputs accurately.

The problem was formalized by researchers at DeepMind, OpenAI, and Anthropic beginning in 2018.^[1]

Background

As AI systems are trained on increasingly complex tasks, human evaluation of their outputs becomes a bottleneck. A highly capable AI model may produce outputs that appear correct to human supervisors but contain subtle errors that the AI could detect but human evaluators cannot. This creates the scalable oversight problem: if humans cannot reliably judge whether an AI's behavior is correct or safe, training that behavior using human feedback becomes unreliable.

The problem is distinct from but related to AI alignment in general. Scalable oversight specifically concerns the supervisory process—how to maintain reliable evaluation of AI behavior as capability increases—rather than alignment in its full generality.

Approaches

Several technical approaches have been proposed to address the scalable oversight problem.

Iterated amplification

Iterated amplification (IDA) is an approach in which a human overseer is paired with an AI assistant to evaluate tasks that the human could not evaluate alone.^[2] Complex tasks are decomposed into simpler subtasks within the human's evaluation competence. An AI assistant is then trained iteratively, with each round of training using the human-plus-assistant combination as the supervisor for a more capable AI model. The approach proposes that iterative amplification can extend reliable oversight to tasks of arbitrary complexity.

AI safety via debate

AI safety via debate is an approach in which two AI agents argue opposing positions on a question, with a human evaluator judging which argument is more convincing.^[3] The approach assumes it is easier for a human to evaluate which argument is correct than to generate the correct argument independently. If AI agents are trained to win debates by making truthful and relevant claims, training may produce behavior more reliably aligned with human values.

Recursive reward modeling

Recursive reward modeling (RRM) is an approach in which human feedback is used to train a reward model, which is then used to assist humans in evaluating more complex tasks, with the process applied recursively.^[1] The method was proposed by Jan Leike and colleagues at DeepMind as a practical framework for extending human oversight to tasks that exceed direct human evaluation capacity.

Weak-to-strong generalization

Weak-to-strong generalization is an empirical research direction introduced by Collin Burns and colleagues at OpenAI in 2023 as a tractable analogue of the scalable oversight problem.^[4] In this framework, a weaker AI model is used to supervise a stronger AI model, measuring whether the stronger model generalizes beyond what the weaker supervisor can directly verify. The work serves as a proxy for the human-AI oversight problem: if a weak model can elicit capabilities from a strong model, similar techniques might allow human supervisors to maintain effective oversight of highly capable AI systems.

Relationship to other approaches

Scalable oversight is related to Constitutional AI, an approach developed at Anthropic in which AI-generated feedback is used in place of human feedback during training, addressing a similar problem of scaling oversight beyond direct human evaluation. It is also related to reinforcement learning from human feedback (RLHF), which scalable oversight methods aim to extend to settings where direct human feedback is unreliable.

References

^ Jump up to: ^a ^b Leike, Jan; Krueger, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane (2018). "Scalable agent alignment via reward modeling: a research direction". arXiv:1811.07871 [cs.LG].
^ Christiano, Paul; Shlegeris, Buck; Amodei, Dario (2018). "Supervising strong learners by amplifying weak experts". arXiv:1810.08575 [cs.LG].
^ Irving, Geoffrey; Christiano, Paul; Amodei, Dario (2018). "AI safety via debate". arXiv:1805.00899 [cs.AI].
^ Burns, Collin; Izmailov, Pavel; Kirchner, Jan Hendrik; Baker, Bowen; Gao, Leo; Aschenbrenner, Leopold; Chen, Yining; Ecoffet, Adrien; Joglekar, Manas; Leike, Jan; Sutskever, Ilya; Wu, Jeff (2023). "Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision". arXiv:2312.09390 [cs.LG].

[leike2018-1] Jump up to: ^a ^b Leike, Jan; Krueger, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane (2018). "Scalable agent alignment via reward modeling: a research direction". arXiv:1811.07871 [cs.LG].

[christiano2018-2] Christiano, Paul; Shlegeris, Buck; Amodei, Dario (2018). "Supervising strong learners by amplifying weak experts". arXiv:1810.08575 [cs.LG].

[irving2018-3] Irving, Geoffrey; Christiano, Paul; Amodei, Dario (2018). "AI safety via debate". arXiv:1805.00899 [cs.AI].

[burns2023-4] Burns, Collin; Izmailov, Pavel; Kirchner, Jan Hendrik; Baker, Bowen; Gao, Leo; Aschenbrenner, Leopold; Chen, Yining; Ecoffet, Adrien; Joglekar, Manas; Leike, Jan; Sutskever, Ilya; Wu, Jeff (2023). "Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision". arXiv:2312.09390 [cs.LG].

[1]

[2]

[3]

[4]