SWE-bench Verified is becoming less reliable for measuring frontier coding capabilities due to contamination. SWE-bench Pro is recommended as a more accurate alternative for assessing models on autonomous software engineering tasks.
openai.com
9 min
7h ago
SWE-bench Verified is becoming less reliable for measuring frontier coding capabilities due to contamination. SWE-bench Pro is recommended as a more accurate alternative for assessing models on autonomous software engineering tasks.
openai.com
9 min
7h ago
SWE-bench Verified is becoming less reliable for measuring frontier coding capabilities due to contamination. SWE-bench Pro is recommended as a more accurate alternative for assessing models on autonomous software engineering tasks.
openai.com
9 min
7h ago
No more articles to load