How might we optimize production-scale RLHF pipelines to reward honesty, uncertainty, clarification, and critique?

Today, models often:

⤷ Tell us what we want to hear, agree and reinforce our views, and praise us
⤷ Present overconfidence when goals are ambiguous or flawed
⤷ Reflect back superficial preference (reward hacking)

In education, we use Socratic Questioning, skillfully and intentionally asking a series of questions that challenge assumptions, explore ideas, and guide individuals toward novel insights and deeper understandings through critical thinking and self-reflection.

What if we flipped the RLHF incentive structure and rewarded:
⤷ Clear articulations of uncertainty
⤷ Insightful clarifying questions
⤷ Unprompted critique of assumptions, instructions, and options

To explore this idea, try appending the below examples to your prompts:
💬 “If the answer to any question would meaningfully increase output quality, ask the question before proceeding”
💬 “Optionally include any solution not presented that would better {solve this problem, effectuate the desired outcome, produce the requested output}”
💬 “Optionally, apply your knowledge and expertise to convey helpful skepticism warranting {solving a different problem, […etc]}”

See also: process supervision, Constitutional AI, truthfulness benchmarks, and debate-style training hashtag#AI hashtag#MachineLearning hashtag#Alignment hashtag#RLHF hashtag#ResponsibleAI hashtag#Uncertainty

Explorer

Pertinᴬᴵnent Ponderings

How might we optimize production-scale RLHF pipelines to reward honesty, uncertainty, clarification, and critique?

Graph View