May 2026 marks a pivotal inflection point for artificial intelligence. The pace of change continues to outrun expectations — models that topped benchmarks six months ago are now middle of the pack. Three dynamics dominate: the rise of autonomous agents entering production markets, China's rapid closing of the capability gap in coding and reasoning, and the commoditisation of inference pushing costs down another order of magnitude.
The UK's AI Security Institute (AISI) revealed that Anthropic's Claude Mythos Preview became the first model to clear its 32-step "The Last Ones" (TLO) corporate-network simulation — reconnaissance through full domain takeover. GPT-5.5 followed three weeks later with 2/10 end-to-end solves. AISI estimates frontier cyber-offence capability is doubling every four months, up from a seven-month doubling rate at the end of 2025. Static-signature security vendors face an existential crisis; integrated XDR platforms (CrowdStrike, Palo Alto) must pivot to AI-native architectures to survive.
Four Chinese labs released open-weights coding models inside 12 days: GLM-5.1 (Z.ai), MiniMax M2.7 (100+ rounds of self-optimising scaffold), Kimi K2.6 (Moonshot — ported an inference engine to Zig in a 12-hour tool-use trace), and DeepSeek V4 (all under 1/3 the cost of Claude Opus 4.7). On aggregate benchmarks, V4 lags leading US frontier by ~8 months, but DeepSeek's own model card puts V4-Pro at parity with Opus 4.6 and GPT-5.4. The old "China is six to nine months behind" frame for agentic coding is no longer defensible.
The renegotiated Microsoft–OpenAI agreement drops the exclusive compute lock-in and AGI escape hatch. OpenAI secured the right to multi-source compute (codifying its shift to Oracle and CoreWeave). Microsoft retains a non-exclusive IP licence through 2032 and is aggressively shipping every frontier model on Foundry — including Anthropic's Opus 4.7 from day one. Sam Altman simultaneously published a "Superintelligence New Deal" manifesto calling for FDR-scale public-private AI build-outs, federal procurement guarantees, and a "Bureau of Compute."
OpenAI's self-serve Ads Manager went live, targeting $2.5B ad revenue in 2026 and $100B annually by 2030. The platform buys on CPM and CPC models with integrations across Dentsu, Omnicom, WPP, and Publicis. OpenAI guarantees ads will not influence organic ChatGPT outputs — a claim the market will scrutinise closely. This positions ChatGPT as a direct challenger to Google Search's primary revenue engine.
Anthropic's Project Deal (69 employee-backed agents, 186 transactions, ~$4,000 traded) demonstrated that Opus 4.5 agents systematically out-negotiate Haiku 4.5 counterparts — yet owners of weaker agents remained blissfully unaware of their disadvantage. Meanwhile, KellyBench (frontier models managing bankroll across a 38-week Premier League season) saw every model finish in the red on average — only 3 of 24 seed combinations avoided ruin. The lesson: bounded markets reward superior models; adversarial markets remain treacherous.
| Trend | Detail |
|---|---|
| Reasoning models | o-series and DeepSeek-R1 leading — trading speed for accuracy is now standard |
| Multimodal | Becoming table stakes at frontier; image, video, audio, and website generation all 10+ providers |
| Inference costs | GPT-4-level performance now <$1/M tokens (down from ~$30/M in 2023) — ~10x drop per year |
| Efficiency | 7B models now match 70B+ performance from a year ago |
| Open vs Closed | Llama, Mistral, Qwen match or beat GPT-4 on several benchmarks |
| Tokenizer gains | Opus 4.7's new tokenizer improved input understanding but increased costs 12–27% for most inputs |
| Benchmark | # Models | What It Tests |
|---|---|---|
| GPQA | 214 | Graduate-level science reasoning |
| MMLU-Pro | 119 | Extended MMLU (4→10 options, 14 domains) |
| AIME 2025 | 108 | Olympiad-level math problems |
| SWE-Bench Verified | 89 | Real GitHub issue patching |
| Humanity's Last Exam | 74 | 2,500 questions, math to humanities |
| LiveCodeBench | 71 | Contamination-free coding (LeetCode, CodeForces) |
The most economically consequential development this week is the convergence of DeepSeek V4's open-weight coding model and the four Chinese labs' coordinated release. On the single capability most likely to drive enterprise AI adoption — agentic software engineering — several of the best models are now Chinese and open-weight. The capability gap with US frontier labs has narrowed to the point where the remaining delta is contested by benchmarks, scaffolds, and evaluators rather than raw capability.
Inference cost curves continue their relentless descent. At current rates, GPT-4-level performance will be sub-$0.10/M tokens within 18 months — fundamentally changing the unit economics of AI-native products.