#2
★ Gold
ARC Prize Foundation launched ARC-AGI-3 with $2M+ in prizes across three competition tracks. It's the first interactive benchmark where agents must learn game rules with zero instructions. The best AI agent scored 12.58%, frontier LLMs scored under 1%, and humans score 100%. All solutions must be open-sourced with no external APIs during evaluation.
#5
★ Gold
MiniMax released OctoCodingBench, measuring process compliance (naming conventions, safety rules, workflow specs) rather than just outcome correctness. Top models achieve 80%+ on individual checks but only 10-30% when all constraints must be satisfied simultaneously — exposing a massive gap between task completion and production-grade behavior.
#10
ClawWork released an open-source economic competition benchmark: 220 professional tasks across 44 job categories, each agent starting with $10 in a simulated economy. Claude Opus 4 generated $19,915 in 8 hours. Full leaderboard and benchmark code are public.