Introducing EVMbench

OpenAI Blog

February 18, 2026

AI-Generated Deep Dive Summary

OpenAI and Paradigm have introduced EVMbench, a groundbreaking benchmark designed to evaluate AI agents' ability to detect, patch, and exploit high-severity smart contract vulnerabilities. This innovative tool is particularly relevant as AI systems increasingly interact with code, raising the stakes for their performance in economically significant environments. EVMbench draws from 120 curated vulnerabilities across 40 audits, including scenarios from the Tempo blockchain's security auditing process, which focuses on payment-oriented smart contracts and stablecoin payments—an area of growing importance in crypto. The benchmark evaluates AI agents across three modes: detect, patch, and exploit. In detect mode, agents are scored based on their ability to identify vulnerabilities and associated audit rewards. Patch mode tests their skill in modifying vulnerable contracts while preserving functionality, ensuring fixes do not break the code. Exploit mode challenges agents to execute end-to-end attacks against deployed contracts in a sandboxed blockchain environment. The evaluation framework is supported by a Rust-based harness that deploys contracts, replays transactions deterministically, and restricts unsafe RPC methods. Initial results highlight significant advancements in AI capabilities. GPT-5.3-Codex achieved a 72.2% score in exploit mode, marking a substantial improvement over earlier models like GPT-5, which scored 31.9%. However, performance in detect and patch modes remains limited, with agents often struggling to achieve full coverage of vulnerabilities. This discrepancy underscores the uneven development of AI capabilities across different cybersecurity tasks. The limitations of EVMbench are notable but do not diminish its value as a tool for advancing AI research in cybersecurity. While it does not fully replicate real-world smart contract security challenges, it provides a robust framework for testing and improving AI-driven auditing and exploitation techniques. For AI enthusiasts, EVMbench offers critical insights into the evolving capabilities of gener

Verticals

airesearch

Originally published on OpenAI Blog on 2/18/2026