For months, the primary question for developers building with large language models has been a simple one: can it do this at all? In the early stages of prototyping, a few successful prompts and a handful of correct answers are usually enough to prove a concept. But as the industry shifts toward Deep Agents—systems capable of utilizing complex tools and executing multi-step reasoning—the goalposts have moved. It is no longer enough to know that an agent reached the correct conclusion; developers must now prove that the process used to get there is reliable, repeatable, and safe. Because agents are inherently non-deterministic, a minor hallucination in step two of a ten-step workflow can trigger a catastrophic chain reaction, rendering the final output useless or, worse, misleading. For those building on LangChain and Amazon Bedrock, the challenge is no longer about model capability, but about the rigor of the evaluation framework.

The Three-Tier Hierarchy of Agent Verification

Evaluating a Deep Agent is fundamentally different from evaluating a standard LLM response. Because agents operate through a series of autonomous decisions, the measurement system must be stratified into three distinct levels of granularity to capture both objective correctness and subjective quality. The first level is the code-based evaluator. This is the most deterministic layer, utilizing regular expressions or tool-call validation to verify specific conditions. Code-based evaluators check for exact string matches, binary pass/fail tests, and static analysis. They are also used to analyze transcripts to measure the number of turns and total token consumption. While these evaluators are fast, inexpensive, and objective, they are brittle; they often fail when an agent provides a correct answer that is phrased differently than the expected string. They are most effective in domains where success can be defined by a strict programmatic contract.

The second level is the model-based evaluator, where a separate, highly capable LLM acts as the judge. This approach allows for the evaluation of nuance, complex analysis, and unstructured outputs. Model-based evaluators employ rubric-based scoring, natural language assertions, pair-wise comparisons, and multi-judge consensus to determine quality. While this provides immense flexibility and scalability, it introduces its own set of risks, including non-determinism and the potential for the judge model to hallucinate. To mitigate this, architects must design the judge with an unknown option, allowing the model to abstain rather than guess. Tools like the Align Evaluator in LangSmith facilitate this by allowing developers to calibrate the LLM judge against human feedback, ensuring the automated grader aligns with expert expectations.

The final level is the human evaluator, which serves as the gold standard. Expert review and crowdsourced judgments provide the subjective quality metrics that automated systems cannot reach. Although human evaluation is the slowest and most expensive method, it is indispensable for calibrating the other two tiers. In a professional pipeline, human experts first define the rubrics for the model-based evaluators. Once the automated system is running, humans perform periodic sampling to ensure the LLM judge has not drifted from the established baseline. This creates a feedback loop where human intuition informs automated scale, and automated scale flags anomalies for human review.

Shifting from Output Validation to Trajectory Analysis

To understand why these tiers are necessary, one must recognize the fundamental gap between LLM evaluation and agent evaluation. Traditional LLM evaluation is a linear process: an input is provided, an output is generated, and the output is compared against a ground-truth answer. This single-input, single-output paradigm is insufficient for agents. An agent does not simply answer; it plans, executes, observes, and corrects. If an agent arrives at the correct answer through a flawed reasoning process or by accidentally calling the wrong tool in a way that happened to work, it is a false positive. In a production environment, that same flawed logic will eventually fail when the input variables change slightly.

Reliability in Deep Agents is found in the trajectory—the entire execution path the agent took to reach its goal—and the state transitions that occurred along the way. Evaluation must therefore shift toward verifying whether the correct tool was called at the right moment and whether the arguments passed to that tool were accurate. This transforms evaluation into a form of unit testing for reasoning. By integrating pytest with LangSmith, developers can build offline evaluation environments that verify the agent's trajectory before it ever reaches the user. This ensures that the agent is not just guessing correctly, but is following a logical, intended path.

This shift is powered by the integration of high-performance models like Amazon Nova 2 Lite and orchestration frameworks like LangGraph. Nova 2 Lite, with its 1 million token context window, is specifically optimized for the long-horizon reasoning required by Deep Agents. Its ability to handle multimodal data and its precision in function calling make it an ideal engine for complex workloads. When paired with LangGraph, which manages planning and incremental context loading, the system can maintain a stable state across long trajectories. LangSmith then provides the visibility needed to pinpoint exactly where a trajectory diverged from the intended path, allowing for rapid iteration and regression testing.

To operationalize this, developers can implement four specific evaluation patterns. The first is Single-step Evaluation, which acts as a unit test for a specific decision point. For example, in a text-to-SQL agent, the evaluator checks if the agent first called `sql_db_list_tables` or `sql_db_schema` before attempting to write a query. This prevents the agent from hallucinating table names. The second is Full-turn Evaluation, an end-to-end check that validates the final result and the overall appropriateness of the path taken. LangSmith visualizes these traces, allowing developers to see the flow from the initial planning phase, such as `write_todos`, through to the final formatting of the answer.

The third pattern is Multi-turn Evaluation, which handles conditional logic across a conversation. Because an agent's subsequent steps depend on the output of previous steps, the test suite must be able to branch. If an agent deviates from the expected path in turn two, the evaluation for turn three must adjust accordingly. To increase efficiency, developers often force a specific state as the starting point for a turn, bypassing the need to run the entire sequence from the beginning. Finally, these tests are migrated into a Regression Suite. Once a feature is proven reliable, it is moved into a permanent monitoring system that runs continuously, ensuring that updates to the model or the prompt do not break existing functionality.

In practical application, such as a text-to-SQL agent, a hybrid strategy is most effective. For deterministic steps, such as verifying tool calls, code-based assertions are used to keep costs low and speed high.

python

Tool call verification example

assert "sql_db_list_tables" in [tool_call.name for tool_call in trace.tool_calls]

For complex analytical questions—such as identifying the top-performing employees by country—where the answer could be phrased in multiple ways, a model-based grader is employed. This grader is instructed to return unknown if the provided information is insufficient, preventing the agent from fabricating data. By placing the right evaluation tool at the right point in the trajectory, developers can move beyond vibe-based testing and establish a measurable, objective standard for agent reliability.

The commercial viability of AI agents will not be determined by the size of the underlying model, but by the sophistication of the systems used to verify them. As agents take on more autonomy, the ability to audit their reasoning trajectories becomes the primary safeguard against failure. The transition from simple output matching to comprehensive trajectory evaluation is what transforms a fragile prototype into a production-ready system.