
rdi.berkeley.edu
April 11, 2026
18 min read
69/100
Summary
Automated scanning reveals that top AI models frequently achieve high benchmark scores that do not accurately reflect their capabilities. The reliance on these benchmarks has led to a misrepresentation of model performance in the AI industry.
Key Takeaways
Community Sentiment
Positives
Concerns

DeepSWE: A contamination-free benchmark for long-horizon coding agents
May 26, 2026

Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed
Feb 12, 2026

We reproduced Anthropic's Mythos findings with public models
Apr 17, 2026

We hid backdoors in ~40MB binaries and asked AI + Ghidra to find them
Feb 22, 2026

Why SWE-bench Verified no longer measures frontier coding capabilities
Apr 26, 2026