Every morning, engineers building AI agents face the same high-stakes dilemma: how to test tool-calling capabilities without triggering real-world side effects or exposing sensitive data. Relying on live APIs during development is a recipe for disaster, risking unintended database mutations and costly usage fees. Conversely, traditional static mocks—those brittle, hard-coded JSON responses—collapse the moment an agent attempts a multi-step workflow that requires state persistence. As agents grow more autonomous, the gap between simple unit testing and complex environment simulation has become the primary bottleneck in the development lifecycle.

The Mechanics of ToolSimulator

To bridge this gap, the Strands Evals framework introduces ToolSimulator, a specialized engine designed to intercept agent tool calls and route them through an LLM-based response generator. Instead of requiring developers to manually write fixtures for every possible permutation of an agent's logic, the simulator synthesizes context-aware responses in real-time. By analyzing the tool schema, the agent’s specific input, and the current simulation state, the system generates responses that feel authentic to the agent’s execution flow.

Getting started requires minimal overhead. Developers simply instantiate the simulator and wrap their target functions with the `@simulator.tool()` decorator. Because the simulator infers behavior from the function signature and docstrings, the function body itself can remain empty:

python
@simulator.tool()
def search_flights(destination: str, date: str) -> str:
 """Search for available flights to a destination on a specific date."""
 pass

This approach allows developers to define the interface of their tools without needing to implement the underlying backend logic, effectively creating a "digital twin" of the tool environment that lives entirely within the test suite.

From Static Responses to Stateful Simulation

What separates ToolSimulator from traditional mocking is its ability to maintain a persistent world state. In a standard mock environment, a flight booking agent might search for a flight and then attempt to reserve it; however, a static mock would fail to "remember" the search results during the second call. ToolSimulator solves this by intercepting calls and updating an internal State Registry. This ensures that the agent’s subsequent actions are grounded in the results of its previous tool calls, maintaining a consistent narrative throughout the entire execution trace.

For more complex scenarios, developers can leverage advanced configuration options to mirror real-world database behaviors. By using `share_state_id`, multiple tools can operate on the same backend state, while `initial_state_description` allows for the definition of the environment's starting conditions using natural language. When dealing with large datasets, developers can pass statistical summaries using `DataFrame.describe()` to generate realistic, data-driven responses without ever exposing raw production data. Furthermore, the framework integrates with Pydantic to enforce strict output schemas, ensuring that the simulated responses remain within the expected data types.

python

State sharing example

@simulator.tool(share_state_id="flight_db")

def book_flight(flight_id: str):

pass

By decoupling the agent's logic from live infrastructure, developers gain the ability to run high-fidelity, multi-step tests in complete isolation. When paired with the GoalSuccessRateEvaluator, this system provides a quantitative look at an agent's performance, allowing teams to measure success rates across thousands of simulated trajectories before a single line of code hits production. This shift marks a transition from testing code to evaluating the reasoning capabilities of the agent itself.

AI agent testing has officially evolved from rigid, manual mocking into a dynamic, context-aware simulation paradigm.