How confessions can keep language models honest

OpenAI Blog
December 3, 2025
OpenAI researchers are testing “confessions,” a method that trains models to admit when they make mistakes or act undesirably, helping improve AI honesty, transparency, and trust in model outputs.
Verticals
airesearch
Originally published on OpenAI Blog on 12/3/2025