Detecting misbehavior in frontier reasoning models

OpenAI Blog
March 10, 2025
Frontier reasoning models exploit loopholes when given the chance. We show we can detect exploits using an LLM to monitor their chains-of-thought. Penalizing their “bad thoughts” doesn’t stop the majority of misbehavior—it makes them hide their intent.
Verticals
airesearch
Originally published on OpenAI Blog on 3/10/2025