Daily AI & LLM Trends Report

The State of AI & LLMs: 2025 Year in Review & May 2025 Trends

RLVR & Reasoning: The Dominant Theme of 2025

The biggest story in AI this year has been the emergence of Reinforcement Learning with Verifiable Rewards (RLVR) — sometimes called GRPO (Group Relative Policy Optimization). DeepSeek R1 (January 2025) was the watershed moment: an open-weight model matching proprietary giants at a fraction of the cost (~$5M vs $50-500M estimated), with RLVR as the core algorithm.

RLVR enables post-training at massive scale using deterministic correctness labels in math and code domains. It spontaneously develops reasoning strategies — intermediate calculation steps — that were nearly impossible in prior paradigms (SFT/RLHF). The result: test-time compute scaling, or "thinking time," emerged as a new axis of capability. OpenAI o3 (early 2025) was the inflection point.

Key Model Releases (April–May 2025)

Model	Provider	Highlights
DeepSeek R1/V3	DeepSeek	Open-weight, RLVR-first, ~$5M training cost
GPT-4.1 / o3 / o4-mini	OpenAI	Reasoning models now in ChatGPT; Codex CLI open-sourced
Claude 3.7 / 3.5	Anthropic	Extended thinking; Claude Code as first convincing agent
Llama 4 Behemoth/Maverick/Scout	Meta	Native multimodal, 128-expert MoE, 10M context on Scout
Gemini 2.5 Pro	Google	Public API preview; Veo 2 video generation
Amazon Nova Premier/Sonic	Amazon	1M token context, speech-to-speech conversational AI
Mellum	JetBrains	Open-source code completion LLM, cost-effective focal model

AI Agents: From Chatbots to Autonomous Workers

The agent paradigm shift is real. Key indicators:

Cursor now generates 1 billion lines of accepted code per day — more than the world's entire daily output
Microsoft CEO: ~30% of Microsoft code is AI-written; Google CEO: 30%+ of Google code involves AI
Claude Code (Anthropic): first convincing LLM Agent that "lives" on your computer with private context
OpenAI's GPT-powered WhatsApp (+1-800-ChatGPT): real-time AI answers at scale
Yelp testing AI voice agents for restaurant phone calls; ElevenLabs agent transfer capability

Anthropic predicts fully automated AI employees within 12 months. Microsoft calls 2025 "The Year the Frontier Firm is Born."

Benchmaxxing & the Benchmark Crisis

A major 2025 theme: "benchmaxxing" — over-optimizing leaderboards until benchmarks become the goal rather than capability proxies.

"If the test set is public, it isn't a real test set." — Sebastian Raschka

The problem: Llama 4 scored extremely well on benchmarks but failed real-world expectations. Karpathy expressed "general apathy and loss of trust in benchmarks." The only workarounds: try LLMs in practice, generate new benchmarks dynamically.

Infrastructure & Developer Ecosystem

Agent-to-Agent protocols maturing:

Google A2A (Agent2Agent): open protocol with 50+ partners (Atlassian, Salesforce, ServiceNow, Box)
MCP (Model Context Protocol): Docker launching 100+ verified tools catalog; Cloudflare remote MCP server
NVIDIA AI-Q Blueprint: pre-defined workflows for digital workforces

Key developer tools:

GitHub Copilot agent mode in VS Code — iterates across entire projects
JetBrains Mellum: open-source code completion, free tier announced
Zencoder: AI coding + unit testing agents with 20+ integrations

Vibe Coding & the Democratization of Software

Karpathy's coined term "vibe coding" captures a real shift: AI crossed a capability threshold where anyone can build impressive programs via English, "forgetting that the code even exists." The implications for software creation are profound — programming is no longer exclusively for trained engineers.

Waymo Milestone: Robotaxis at Scale

250,000 weekly paid robo-taxi rides across four U.S. cities. The robotaxi future is here — Waymo's success demonstrates the commercial viability of embodied AI in real-world environments.

Biohazard Concerns: AI in Biology

A new benchmark shows OpenAI's O3 outperforms 94% of expert virologists — raising urgent questions about AI biosafety and the dual-use nature of frontier AI capabilities.

Outlook: What to Watch in 2026

RLVR extensions and inference-time scaling will continue to dominate
Linear sequence scaling (Gated DeltaNets, Mamba-2) may challenge transformer dominance for efficiency-sensitive workloads
AI employees will move from pilot to production at frontier firms
Continued benchmaxxing will drive demand for dynamic, private evaluation benchmarks
Chip export controls remain a geopolitical flashpoint as AI capabilities accelerate

Sources: Sebastian Raschka's State of LLMs 2025, Andrej Karpathy's 2025 LLM Year in Review, SD Times April 2025 AI Roundup, Local Media Association May 2025 Trends, various company announcements.