Why SWE-bench Verified no longer measures frontier coding capabilities

OpenAI Blog

February 23, 2026

AI-Generated Deep Dive Summary

Since the release of SWE-bench Verified in August 2024, it has been a key metric for assessing progress in autonomous software engineering tasks. However, recent analysis by OpenAI reveals significant flaws in the benchmark that render it unsuitable for measuring frontier model capabilities at today’s performance levels. The benchmark now suffers from two major issues: flawed test cases that reject correct solutions and training data contamination, where models have been exposed to benchmark problems during their training. This means improvements on SWE-bench Verified no longer reflect genuine advancements in real-world software development abilities but instead correlate with how much exposure the models had to the benchmark itself. The first issue stems from flawed test cases. An audit of a subset of the dataset revealed that nearly 59.4% of the audited problems have test cases that reject functionally correct submissions, despite initial efforts to improve the evaluation process. This means models may fail tasks not because they lack capability but because the tests themselves are faulty. The second issue arises from training data contamination. SWE-bench problems are sourced from open-source repositories widely used for model training. As a result, frontier models have likely encountered some of these problems and their solutions during training, giving them an unfair advantage in the benchmark. This creates a situation where models perform better not because they genuinely understand the task but because they’ve seen similar problems before. These flaws make SWE-bench Verified an unreliable measure of progress for today’s advanced AI systems. OpenAI has discontinued reporting scores on this benchmark and recommends others to do the same. The company is now developing new, uncontaminated evaluations to better assess coding capabilities, which it believes is a critical focus area for the AI research community. This shift underscores the importance of robust evaluation frameworks in AI research. As models grow more powerful, ensuring benchmarks accurately measure their true capabilities becomes increasingly crucial for meaningful progress and responsible innovation in artificial intelligence.

Verticals

airesearch

Originally published on OpenAI Blog on 2/23/2026