I still remember the insight that changed how I approach product management. At ShareChat, we were analyzing the sign-up funnel, a critical path for an app that had grown to 50 million users. The marketing team was pouring users into the top of the funnel, but our conversion rate wasn’t where it should be. After digging into the data, I found a simple but powerful insight: we were treating new and returning users identically. By using the device ID to differentiate them and sending returning users directly to a login flow, we boosted successful logins by 15%. It was a game-changer.
That experience taught me a lesson that is more critical than ever in today’s agentic AI landscape: surface-level “vibe checks” are worthless. Real breakthroughs come from deep, systematic evaluation. Today, as we build complex, multi-step AI agents, the stakes are exponentially higher. A single error in a long chain of actions can lead to catastrophic failure. The 95% failure rate for AI pilots isn’t a myth; it’s the reality for teams who don’t treat evaluation as the core of their development process [1].
In this post, I’m going to give you the updated 2026 playbook I use at Vestra AI. We’ll move beyond the obsolete single-turn mindset and detail how to build a robust evaluation harness that lets you ship complex AI agents with confidence.
The Landscape Has Shifted: Why 2025’s Evals Don’t Work Anymore
The core challenge has changed. We’re no longer just evaluating the quality of a single output. We’re evaluating the integrity of an entire process. An AI agent is a system that reasons, calls tools, and modifies its environment over multiple turns. A single error can cascade, leading to catastrophic failure. As the experts at Anthropic noted in their January 2026 guide, the very capabilities that make agents useful—autonomy and flexibility—also make them exponentially harder to evaluate [8].
To succeed in 2026, you must think in terms of traces, harnesses, and outcomes. You need a system that doesn’t just check the final answer but validates every single step in the agent’s reasoning process. The table below breaks down the modern evaluation toolkit for a high-agency PM.
| Evaluation Type | What It Is (2026 Context) | When to Use It | My Take (Ankur’s Opinion) |
|---|---|---|---|
| Human Evals | Expert-labeled “golden traces” of multi-step agent tasks. | Establishing the ground truth for complex agentic behavior and calibrating your LLM judges. | Still your North Star, but you can’t navigate with it. It’s too slow for the speed of agent development. Use it to create your initial test cases and for periodic calibration, not for daily testing. |
| Code-Based Graders | Automated checks on the outcome in the environment (e.g., was the database updated correctly?) and validation of the agent’s tool calls. | Your first line of defense for catching regressions in agentic workflows. Essential for CI/CD. | Non-negotiable. If your agent interacts with an external system, you must have code-based graders to verify the state of that system. This is the only way to know if the agent actually did what it said it did. |
| LLM-as-a-Judge | A panel of judge LLMs scoring an agent’s trace against a detailed rubric, often with a consensus mechanism for reliability. | Scaling the evaluation of subjective qualities like reasoning, tool selection, and recovery from errors across thousands of multi-turn simulations. | This is the engine of modern evaluation. But don’t just use one judge. Use a panel. And constantly calibrate it against your human-labeled golden traces. The goal is automated, nuanced judgment at scale. |
| Agent Simulation | An automated system where a “User Simulator Agent” interacts with your AI agent, and a “Judge Agent” scores the interaction in real-time. | The ultimate test of resilience. Used for stress-testing, red-teaming, and discovering emergent failures in complex, interactive scenarios. | This is the frontier, and where high-agency teams are winning. It’s the closest you can get to understanding how your agent will behave in the chaotic real world before you ship. Platforms like LangWatch are making this accessible [9]. |
The 2026 High-Agency Guide to Implementing Evals
Getting started requires a more systematic approach than before. Here is the updated, high-agency guide for building an evaluation harness.
Step 1: Define “Golden Traces” (Days 1-5)
Forget simple input-output pairs. You need to define the ideal path for your 50-100 most critical user tasks. This “golden trace” should document every step: the initial prompt, the agent’s internal reasoning, every tool call with its parameters, the result of that tool call, and the final outcome. This is your source of truth for what a perfect agent interaction looks like.
Step 2: Build Your Code-Based Graders and Evaluation Harness (Days 6-10)
Instrument your systems. Your evaluation harness needs to be able to run a task and then programmatically check the final state of the world. If your agent is supposed to book a flight, your grader needs to query the airline’s test database to confirm the booking. This is your objective, binary check for success. This is also where you build the infrastructure to capture the full transcript, or trace, of every evaluation run.
Step 3: Deploy Your LLM-as-a-Judge Panel (Weeks 2-4)
Now, scale your judgment. Use your golden traces to create detailed rubrics for an LLM-as-a-Judge system. Your rubrics shouldn’t just ask, “Was the answer correct?” They should ask, “Did the agent choose the most efficient tool?” “Did it recover gracefully from a tool error?” “Was its reasoning sound at step 4?” Use a panel of multiple judge models (e.g., GPT, Claude Sonnet, Gemini) to score each trace and look for consensus. This reduces the risk of a single judge’s bias or weakness.
Step 4: Embrace Real-Time Production Monitoring (Ongoing)
Evaluation doesn’t stop at deployment. The 2026 standard is real-time, streaming evaluation of production traffic. Modern platforms can score live user interactions against your evaluators asynchronously, so you can catch quality degradation, policy violations, or a spike in hallucinations the moment they happen. Set up automated alerts that feed directly into your incident response workflow. This turns your production environment into a continuous source of evaluation data.
Step 5: Master Agent Simulation (The New Frontier)
This is the final boss of evaluation. Create a simulated environment where a “User Simulator Agent” can interact with your product in realistic, multi-turn conversations. This user agent can be programmed to represent different personas: the confused novice, the power user, the adversary trying to break your system. A “Judge Agent” observes this interaction and scores it in real-time. This is how you find the “unknown unknowns”—the emergent, unpredictable failures that only surface after a long and complex series of interactions.
The PM’s Role Has Evolved: From Prompt Engineer to Systems Architect
The conversation around AI product management has fundamentally changed. A few months ago, we were all becoming prompt engineers. Today, a high-agency PM must be a systems architect. Your job is no longer to just craft the perfect prompt, but to design and manage the entire system that ensures your AI agent behaves reliably, safely, and effectively in the wild. At Vestra AI, where I own the product on evals and we’ve built a platform for users to create their own AI agents using plain text, this systems-level thinking is what allows us to pivot and adapt in a rapidly changing AI landscape.
The 95% failure rate is a choice. It’s the result of choosing to stay in the comfort zone of single-turn thinking while the world has moved on to agentic systems. Don’t be a statistic. Take ownership of the complexity. Build your evaluation harness. It’s the only way to defuse the ticking time bomb of agentic AI and ship products that truly deliver on the promise of this technology.
References
[1] Snyder, J. (2025, August 26). MIT Finds 95% Of GenAI Pilots Fail Because Companies Avoid Friction. Forbes. Retrieved from https://www.forbes.com/sites/jasonsnyder/2025/08/26/mit-finds-95-of-genai-pilots-fail-because-companies-avoid-friction/
[2] Rand Corporation. (2024, August 13). Why AI Projects Fail and How They Can Succeed. Retrieved from https://www.rand.org/pubs/research_reports/RRA2680-1.html
[3] Khan, A. (2025, April 8). Beyond vibe checks: A PM’s complete guide to evals. Lenny’s Newsletter. Retrieved from https://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-complete
[4] Maliugina, D. (2025, March 25). How companies evaluate LLM systems: 7 examples from Asana, GitHub, and more. Evidently AI. Retrieved from https://www.evidentlyai.com/blog/llm-evaluation-examples
[5] Maierhöfer, J. (2025, March 4). LLM Evaluation 101: Best Practices, Challenges & Proven Techniques. Langfuse Blog. Retrieved from https://langfuse.com/blog/2025-03-04-llm-evaluation-101-best-practices-and-challenges
[6] Arize AI. (n.d.). The Definitive Guide to LLM Evaluation. Retrieved from https://arize.com/llm-evaluation/
[7] OpenAI. (n.d.). Working with evals. OpenAI API. Retrieved from https://platform.openai.com/docs/guides/evals
[8] Anthropic. (2026, January 9). Demystifying evals for AI agents. Retrieved from https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
[9] LangWatch. (2026, January 30). Top 5 AI evaluation tools for AI agents & products in production (2026). Retrieved from https://langwatch.ai/blog/top-5-ai-evaluation-tools-for-ai-agents-products-in-production-(2026)
[10] QAlified. (2026, February 3). Key LLM Evaluation Metrics & Techniques for Reliable Models. Retrieved from https://qalified.com/blog/llm-evaluation-metrics/