OpenAI Says Benchmark Used to Measure AI Coding Skill Is 'Contaminated'—Here's Why

Decrypt

by Jose Antonio Lanz

February 24, 2026

AI-Generated Deep Dive Summary

OpenAI Says Benchmark Used to Measure AI Coding Skill Is 'Contaminated'—Here's Why

OpenAI has announced that its widely-used SWE-bench Verified benchmark for measuring AI coding skills is "contaminated" and no longer reliable. The company is retiring the benchmark and replacing it with SWE-bench Pro, a tougher alternative designed to address significant flaws in the original. Scores on the new benchmark dropped dramatically—from around 70% to just 23%—indicating that many AI models were overperforming on the old test due to data leaks and flawed task design. The issue stems from how SWE-bench Verified was constructed. It relied on tasks drawn from open-source repositories that most AI models had already been trained on, leading to unintended exposure to solutions during training. This contamination allowed models like GPT-5.2 and Claude Opus 4.5 to recall exact fixes they had encountered before, including specific variable names and code details not mentioned in the problem descriptions. For example, one model used internal knowledge about a Django release note to solve a task, despite no mention of this detail in the problem statement. The shift to SWE-bench Pro aims to reset the benchmarking standard with more rigorous tasks and less data leakage. The new benchmark uses diverse codebases and licensing agreements to minimize prior exposure, resulting in significantly lower scores across major AI models. OpenAI’s decision to retire SWE-bench Verified comes at a critical time for the AI industry, particularly in crypto and web3 where reliable coding benchmarks are essential for assessing AI-driven software development tools. For readers interested in crypto and decentralized technologies, this development highlights the importance of accurate benchmarking in AI systems used for blockchain projects. The limitations of SWE-bench Verified underscore the need for more robust evaluation methods to ensure AI models can truly perform real-world

Verticals

cryptoweb3

Originally published on Decrypt on 2/24/2026