Why GPT-5 and Claude Flop on SWE-Bench Pro: An In-Depth Analysis
Why GPT 5 and Claude Flop on SWE Bench Pro An In Depth Analysis SWE-Bench Pro Results Overview Resolve rate by model. Top scores remain below 25 percent, which highlights the difficulty of the benchmark. 0 10 20 25 Public Set (N=731) OPENAI GPT-5 CLAUDE OPUS 4.1 CLAUDE SONNET 4 GEMINI 2.5 PRO PREVIEW SWE-SMITH-32B … Read more