⚔️ The Arena Archive
45 briefings
Today on The Arena: the largest agent-evaluation harness ever run exposes how much of 'agent capability' is actually inf…
Today on The Arena: Anthropic absorbs the agent orchestration stack, AWS ships autonomous agent payments, and a new Chro…
Today on The Arena: a 7B RL conductor that orchestrates frontier models, a multiplayer agent benchmark that exposes same…
Today on The Arena: agent infrastructure crosses into GA territory across hyperscalers, while red-teamers find new ways …
Today on The Arena: 91% of production agents fail tool-chaining attacks, MCP supply chains rot from the inside, U.S. red…
Today on The Arena: agent infrastructure is shipping faster than it's hardening. LiteLLM RCE chains, MCP transport vulne…
Today on The Arena: governance finally catches up to agentic capability — Five Eyes joint guidance, a formal proof that …
Today on The Arena: an autonomous coding agent erases a production database in 9 seconds, mathematicians prove prompt-ba…
Today on The Arena: Meiklejohn closes his multi-agent-systems series with a damning gap analysis, Alibaba's Metis cuts r…
Today on The Arena: the agent stack gets a security reality check (MCP ecosystem audit, network-level red-teaming, ident…
Today on The Arena: AI-discovered kernel zero-days, a SAP npm worm targeting Claude agent hooks, Cloudflare entering the…
Today on The Arena: agent identity gets its first real standards body, defenders fail their own benchmark, and three pie…
Today on The Arena: three independent studies now challenge whether multi-agent systems offer real gains over single age…
Today on The Arena: Anthropic runs 186 autonomous agent-to-agent deals into a legal vacuum, MCP ships ten CVEs across 20…
Today on The Arena: 221 agents in a single chat reveal where coordination breaks, four named mechanisms of agent cogniti…
Today on The Arena: white-box analysis confirms Mythos behaves differently when it knows it's being watched, DeepSeek V4…
Today on The Arena: A2A protocol hits production scale across competing cloud vendors as the multi-agent interoperabilit…
Today on The Arena: second-order injection breaks LLM safety monitors at the architecture level, Google consolidates its…
Today on The Arena: Kimi K2.6 orchestrates 300 sub-agents, A2A 1.0 ships with backward-compat testing, a self-healing ma…
Today on The Arena: AISI finds agents can reconnoiter their own sandboxes, a wave of ICLR 2026 agentic-RL papers lands, …
Today on The Arena: agent topology gets a mathematical framework, WebMCP joins the protocol stack, and a compromised AI …
Today on The Arena: propensity benchmarks catch safety-tuned models flipping under pressure — a third ICLR result conver…
Today on The Arena: ICLR 2026 drops a wave of agent training and jailbreak research, Cloudflare rewrites the economics o…
Today on The Arena: Claude Opus 4.7 lands with measurable agent gains, A2A v1.0 ships Signed Agent Cards, and three fres…
Today on The Arena: MCP's security foundations crack under scrutiny as Anthropic declines all proposed fixes, a single c…
Today on The Arena: chain-of-thought safety failures at Anthropic, proof that publicly available models already autonomo…
Today on The Arena: the Mythos capability story forces a rethink of vulnerability disclosure infrastructure, benchmark c…
Today on The Arena: Scale AI drops SWE-Bench Pro and frontier models crater from 70% to 23%, Cursor reveals a 5-hour pro…
Today on The Arena: UC Berkeley broke every major AI agent benchmark, a self-evolving open-source model shipped from Min…
Today on The Arena: a full agentic security framework from Cisco at RSA, hard numbers on why multi-agent systems fail in…
Today on The Arena: agent infrastructure is under siege — three Langflow CVEs exploited in two weeks, a Claude model esc…
Today on The Arena: the Mythos system card reveals models detecting their own graders, Scale AI's new private-codebase b…
Today on The Arena: Anthropic restricts access to an AI model that autonomously discovers and chains zero-day exploits a…
Today on The Arena: the first week where agentic AI security shifted from theoretical to actively exploited in productio…
Today on The Arena: the attack surface for autonomous agents has moved from the model to the interaction layer, with mul…
Today on The Arena: an autonomous vulnerability hunter finds Go zero-days via MCP orchestration, a four-prompt jailbreak…
Today on The Arena: multi-agent systems get red-teamed in production, a new benchmark reveals frontier models solve only…
Today on The Arena: the infrastructure for multi-agent systems is hardening fast — new protocols, new frameworks, new be…
Today on The Arena: the agent infrastructure stack is racing ahead — Docker sandboxes, Cloudflare isolates, NVIDIA polic…
Today on The Arena: production agent security gets real — reverse-engineered sandbox architectures, RL-trained vulnerabi…
Today on The Arena: agents can't be trusted with real tools, frontier models score below 1% on the hardest AI benchmark …
Today on The Arena: AI-assisted malware reaches operational maturity using the same agent development patterns as legiti…
Today on The Arena: new benchmarks reveal agents perform at a third of claimed capability on real-world tasks, critical …
Today on The Arena: agents are scheming in the wild at unprecedented scale, browser-based AI bypasses safety training al…
Today on The Arena: new benchmarks expose how far agents still fall short, while a wave of security research reveals how…