The State of AI & LLMs: 2025 Year in Review & May 2025 Trends
RLVR & Reasoning: The Dominant Theme of 2025
The biggest story in AI this year has been the emergence of Reinforcement Learning with Verifiable Rewards (RLVR) β sometimes called GRPO (Group Relative Policy Optimization). DeepSeek R1 (January 2025) was the watershed moment: an open-weight model matching proprietary giants at a fraction of the cost (~$5M vs $50-500M estimated), with RLVR as the core algorithm.
RLVR enables post-training at massive scale using deterministic correctness labels in math and code domains. It spontaneously develops reasoning strategies β intermediate calculation steps β that were nearly impossible in prior paradigms (SFT/RLHF). The result: test-time compute scaling, or "thinking time," emerged as a new axis of capability. OpenAI o3 (early 2025) was the inflection point.
Key Model Releases (AprilβMay 2025)
| Model | Provider | Highlights |
|---|---|---|
| DeepSeek R1/V3 | DeepSeek | Open-weight, RLVR-first, ~$5M training cost |
| GPT-4.1 / o3 / o4-mini | OpenAI | Reasoning models now in ChatGPT; Codex CLI open-sourced |
| Claude 3.7 / 3.5 | Anthropic | Extended thinking; Claude Code as first convincing agent |
| Llama 4 Behemoth/Maverick/Scout | Meta | Native multimodal, 128-expert MoE, 10M context on Scout |
| Gemini 2.5 Pro | Public API preview; Veo 2 video generation | |
| Amazon Nova Premier/Sonic | Amazon | 1M token context, speech-to-speech conversational AI |
| Mellum | JetBrains | Open-source code completion LLM, cost-effective focal model |
AI Agents: From Chatbots to Autonomous Workers
The agent paradigm shift is real. Key indicators:
- Cursor now generates 1 billion lines of accepted code per day β more than the world's entire daily output
- Microsoft CEO: ~30% of Microsoft code is AI-written; Google CEO: 30%+ of Google code involves AI
- Claude Code (Anthropic): first convincing LLM Agent that "lives" on your computer with private context
- OpenAI's GPT-powered WhatsApp (+1-800-ChatGPT): real-time AI answers at scale
- Yelp testing AI voice agents for restaurant phone calls; ElevenLabs agent transfer capability
Anthropic predicts fully automated AI employees within 12 months. Microsoft calls 2025 "The Year the Frontier Firm is Born."
Benchmaxxing & the Benchmark Crisis
A major 2025 theme: "benchmaxxing" β over-optimizing leaderboards until benchmarks become the goal rather than capability proxies.
"If the test set is public, it isn't a real test set." β Sebastian Raschka
The problem: Llama 4 scored extremely well on benchmarks but failed real-world expectations. Karpathy expressed "general apathy and loss of trust in benchmarks." The only workarounds: try LLMs in practice, generate new benchmarks dynamically.
Infrastructure & Developer Ecosystem
Agent-to-Agent protocols maturing:
- Google A2A (Agent2Agent): open protocol with 50+ partners (Atlassian, Salesforce, ServiceNow, Box)
- MCP (Model Context Protocol): Docker launching 100+ verified tools catalog; Cloudflare remote MCP server
- NVIDIA AI-Q Blueprint: pre-defined workflows for digital workforces
Key developer tools:
- GitHub Copilot agent mode in VS Code β iterates across entire projects
- JetBrains Mellum: open-source code completion, free tier announced
- Zencoder: AI coding + unit testing agents with 20+ integrations
Vibe Coding & the Democratization of Software
Karpathy's coined term "vibe coding" captures a real shift: AI crossed a capability threshold where anyone can build impressive programs via English, "forgetting that the code even exists." The implications for software creation are profound β programming is no longer exclusively for trained engineers.
Waymo Milestone: Robotaxis at Scale
250,000 weekly paid robo-taxi rides across four U.S. cities. The robotaxi future is here β Waymo's success demonstrates the commercial viability of embodied AI in real-world environments.
Biohazard Concerns: AI in Biology
A new benchmark shows OpenAI's O3 outperforms 94% of expert virologists β raising urgent questions about AI biosafety and the dual-use nature of frontier AI capabilities.
Outlook: What to Watch in 2026
- RLVR extensions and inference-time scaling will continue to dominate
- Linear sequence scaling (Gated DeltaNets, Mamba-2) may challenge transformer dominance for efficiency-sensitive workloads
- AI employees will move from pilot to production at frontier firms
- Continued benchmaxxing will drive demand for dynamic, private evaluation benchmarks
- Chip export controls remain a geopolitical flashpoint as AI capabilities accelerate
Sources: Sebastian Raschka's State of LLMs 2025, Andrej Karpathy's 2025 LLM Year in Review, SD Times April 2025 AI Roundup, Local Media Association May 2025 Trends, various company announcements.