Compliance analysts in the financial sector spend the vast majority of their day in a state of digital fragmentation. They navigate a dozen different internal screens, copy-pasting data from one legacy system to another, and hunting for a single piece of evidence to verify a user's identity or business legitimacy. For many, roughly 80 percent of their workday is consumed by this administrative drudgery rather than the high-value risk analysis they were hired to perform. This inefficiency persists regardless of an analyst's seniority, creating a systemic bottleneck in the global movement of money.
The Scale of Programmable Finance and the Compliance Burden
Stripe operates as a programmable financial infrastructure, providing the plumbing for payments across 50 countries. The scale of this operation is immense, processing 1.4 trillion dollars in annual payment volume. With 62 percent of Fortune 500 companies relying on its services, Stripe effectively manages a volume of transactions equivalent to approximately 1.3 percent of the global GDP. As the company grew into a central pillar of the global economy, the burden of regulatory compliance grew proportionally. The sheer volume of daily transactions made it impossible to maintain quality standards simply by hiring more people; the operational scale required a fundamental shift in how reviews are conducted.
To address this, Stripe integrated AI agents with automated orchestration to transform a resource-intensive manual process into a scalable engine. The results were immediate and measurable. Stripe achieved a 26 percent reduction in compliance review processing time. Beyond efficiency, the system significantly bolstered security, identifying 95 percent of card-testing attacks—where bad actors use stolen card data to verify validity through small payments—in real-time. Simultaneously, the precision of the AI reduced unnecessary customer friction by 20 percent, ensuring that legitimate users were not caught in overly aggressive security filters. These improvements were achieved while maintaining the strict auditability and precision required by global financial regulators.
From Linear Automation to Reasoning Loops
Traditional automation in compliance relies on rigid, rule-based systems. If a case falls outside a predefined path, the automation fails, and the case is dumped back onto a human analyst. Stripe moved beyond this by implementing the ReAct (Reasoning and Acting) framework, powered by large language models (LLMs) via Amazon Bedrock. Unlike a standard chatbot that provides a direct answer, a ReAct agent operates in a continuous loop of thought, action, and observation.
In a typical ReAct cycle, the agent first enters a Thought phase, where it analyzes the current state of the case and determines what information is missing. It then executes an Action, such as calling a specific internal tool or querying a database. Finally, it receives an Observation—the actual data returned by the tool—and feeds that back into its next Thought phase. This is similar to how a human might calculate the value of 10 divided by pi; the model first realizes it does not know the value of pi, decides to use a calculator tool, observes the result, and then determines if that result is sufficient to provide the final answer. In complex scenarios, such as analyzing corporate revenue forecasts, the agent may repeat this loop multiple times, refining its understanding with each iteration.
This architecture functions as a closed-loop feedback system, akin to control systems in engineering. By requiring an Observation before moving to the next step, Stripe prevents the model from leaping to conclusions without evidence. This design effectively blocks open-loop behavior, which is the primary driver of hallucinations in LLMs. By forcing the agent to ground every step in a real-world observation, Stripe ensures that the reasoning trajectory remains accurate and the signals collected are authentic.
Task Decomposition via Directed Acyclic Graphs
One of the primary risks of deploying a single, unrestricted AI agent is the tendency to get lost in the weeds. When faced with a massive volume of documentation, a general-purpose agent might obsess over a trivial detail while overlooking a critical red flag. Stripe solved this by implementing task decomposition, breaking the complex review process into smaller, configurable sub-tasks.
These sub-tasks are organized using a Directed Acyclic Graph (DAG). In a DAG, the workflow moves in one direction without looping back on itself, ensuring that the process is predictable and comprehensive. Each node in the graph represents a specific, validated question or check. The orchestrator manages the flow, passing the confirmed answers from one sub-task as context for the next. This structure ensures that the agent remains focused on a narrow scope at any given time, which increases accuracy and prevents the model from becoming overwhelmed by excessive context.
Crucially, Stripe maintains a human-centric review structure to ensure accountability. The AI agent does not provide the final verdict; instead, it provides auxiliary information and a synthesized summary of evidence. The final answer must be written and submitted by a human reviewer. This ensures that the legal responsibility and audit trail remain with the human analyst, while the AI handles the grueling work of data aggregation. The agent acts as a high-fidelity research assistant, reducing the time spent searching for data while leaving the final judgment to the expert.
Optimizing for Cost and Security at Scale
Deploying reasoning agents at this scale introduces a significant cost challenge. Because ReAct loops require the model to re-read the entire conversation history at every turn, token costs can grow exponentially as the observation log expands. Stripe mitigated this by utilizing prompt caching through Amazon Bedrock. By caching the static portions of the prompt and only paying for the newly added observations and reasoning steps, Stripe decoupled the cost of the agent from the length of the reasoning chain.
This optimization allowed Stripe to maintain a high level of performance without prohibitive expenses. The system achieved a helpfulness score of over 96 percent, as rated by human experts who evaluated the accuracy and utility of the agent's auxiliary information. The ability to run complex reasoning loops affordably meant that the system could be applied to the most difficult cases, not just the simplest ones.
To protect sensitive financial data and maintain system control, Stripe implemented a specialized architecture featuring an LLM Proxy layer. This proxy sits between the agent service and the LLM, acting as a gateway for all requests and responses. The agent service hosts the core logic and connects to internal signal tools, while the proxy layer handles logging, filtering, and data masking. This separation ensures that sensitive information is scrubbed before it ever reaches the model, providing a critical layer of security that is essential for any financial institution operating under strict data residency and privacy laws.
By combining the reasoning power of ReAct, the structural discipline of DAGs, and the cost-efficiency of prompt caching, Stripe has created a blueprint for AI integration in highly regulated environments. The shift is not about replacing the analyst, but about eliminating the friction of data retrieval, allowing the human expert to focus entirely on the act of judgment.




