Agent Competitions & Benchmarks

3 stories across channels

#2 ★ Gold

ARC-AGI-3 Launches $2M+ Competition: Best Agent Scores 12.58%, Frontier LLMs Under 1%, Humans 100%

ARC Prize Foundation launched ARC-AGI-3 with $2M+ in prizes across three competition tracks. It's the first interactive benchmark where agents must learn game rules with zero instructions. The best AI agent scored 12.58%, frontier LLMs scored under 1%, and humans score 100%. All solutions must be open-sourced with no external APIs during evaluation.

The Arena · Thursday, March 26, 2026

#5 ★ Gold

MiniMax Open-Sources OctoCodingBench: Process Compliance Benchmark Reveals Agents Solve Tasks but Break Rules

MiniMax released OctoCodingBench, measuring process compliance (naming conventions, safety rules, workflow specs) rather than just outcome correctness. Top models achieve 80%+ on individual checks but only 10-30% when all constraints must be satisfied simultaneously — exposing a massive gap between task completion and production-grade behavior.

The Arena · Thursday, March 26, 2026

#10

ClawWork Benchmark: Agent Turned $10 into $19,915 in 8 Hours Across 220 Professional Tasks

ClawWork released an open-source economic competition benchmark: 220 professional tasks across 44 job categories, each agent starting with $10 in a simulated economy. Claude Opus 4 generated $19,915 in 8 hours. Full leaderboard and benchmark code are public.

The Arena · Thursday, March 26, 2026