Daily AI & LLM Trends Report

The State of AI & LLMs: 2025 Year in Review & May 2025 Trends

RLVR & Reasoning: The Dominant Theme of 2025

The biggest story in AI this year has been the emergence of Reinforcement Learning with Verifiable Rewards (RLVR) β€” sometimes called GRPO (Group Relative Policy Optimization). DeepSeek R1 (January 2025) was the watershed moment: an open-weight model matching proprietary giants at a fraction of the cost (~$5M vs $50-500M estimated), with RLVR as the core algorithm.

RLVR enables post-training at massive scale using deterministic correctness labels in math and code domains. It spontaneously develops reasoning strategies β€” intermediate calculation steps β€” that were nearly impossible in prior paradigms (SFT/RLHF). The result: test-time compute scaling, or "thinking time," emerged as a new axis of capability. OpenAI o3 (early 2025) was the inflection point.


Key Model Releases (April–May 2025)

Model Provider Highlights
DeepSeek R1/V3 DeepSeek Open-weight, RLVR-first, ~$5M training cost
GPT-4.1 / o3 / o4-mini OpenAI Reasoning models now in ChatGPT; Codex CLI open-sourced
Claude 3.7 / 3.5 Anthropic Extended thinking; Claude Code as first convincing agent
Llama 4 Behemoth/Maverick/Scout Meta Native multimodal, 128-expert MoE, 10M context on Scout
Gemini 2.5 Pro Google Public API preview; Veo 2 video generation
Amazon Nova Premier/Sonic Amazon 1M token context, speech-to-speech conversational AI
Mellum JetBrains Open-source code completion LLM, cost-effective focal model

AI Agents: From Chatbots to Autonomous Workers

The agent paradigm shift is real. Key indicators:

  • Cursor now generates 1 billion lines of accepted code per day β€” more than the world's entire daily output
  • Microsoft CEO: ~30% of Microsoft code is AI-written; Google CEO: 30%+ of Google code involves AI
  • Claude Code (Anthropic): first convincing LLM Agent that "lives" on your computer with private context
  • OpenAI's GPT-powered WhatsApp (+1-800-ChatGPT): real-time AI answers at scale
  • Yelp testing AI voice agents for restaurant phone calls; ElevenLabs agent transfer capability

Anthropic predicts fully automated AI employees within 12 months. Microsoft calls 2025 "The Year the Frontier Firm is Born."


Benchmaxxing & the Benchmark Crisis

A major 2025 theme: "benchmaxxing" β€” over-optimizing leaderboards until benchmarks become the goal rather than capability proxies.

"If the test set is public, it isn't a real test set." β€” Sebastian Raschka

The problem: Llama 4 scored extremely well on benchmarks but failed real-world expectations. Karpathy expressed "general apathy and loss of trust in benchmarks." The only workarounds: try LLMs in practice, generate new benchmarks dynamically.


Infrastructure & Developer Ecosystem

Agent-to-Agent protocols maturing:

  • Google A2A (Agent2Agent): open protocol with 50+ partners (Atlassian, Salesforce, ServiceNow, Box)
  • MCP (Model Context Protocol): Docker launching 100+ verified tools catalog; Cloudflare remote MCP server
  • NVIDIA AI-Q Blueprint: pre-defined workflows for digital workforces

Key developer tools:

  • GitHub Copilot agent mode in VS Code β€” iterates across entire projects
  • JetBrains Mellum: open-source code completion, free tier announced
  • Zencoder: AI coding + unit testing agents with 20+ integrations

Vibe Coding & the Democratization of Software

Karpathy's coined term "vibe coding" captures a real shift: AI crossed a capability threshold where anyone can build impressive programs via English, "forgetting that the code even exists." The implications for software creation are profound β€” programming is no longer exclusively for trained engineers.


Waymo Milestone: Robotaxis at Scale

250,000 weekly paid robo-taxi rides across four U.S. cities. The robotaxi future is here β€” Waymo's success demonstrates the commercial viability of embodied AI in real-world environments.


Biohazard Concerns: AI in Biology

A new benchmark shows OpenAI's O3 outperforms 94% of expert virologists β€” raising urgent questions about AI biosafety and the dual-use nature of frontier AI capabilities.


Outlook: What to Watch in 2026

  • RLVR extensions and inference-time scaling will continue to dominate
  • Linear sequence scaling (Gated DeltaNets, Mamba-2) may challenge transformer dominance for efficiency-sensitive workloads
  • AI employees will move from pilot to production at frontier firms
  • Continued benchmaxxing will drive demand for dynamic, private evaluation benchmarks
  • Chip export controls remain a geopolitical flashpoint as AI capabilities accelerate

Sources: Sebastian Raschka's State of LLMs 2025, Andrej Karpathy's 2025 LLM Year in Review, SD Times April 2025 AI Roundup, Local Media Association May 2025 Trends, various company announcements.