Monday morning for a cloud operations lead managing over 50 AWS accounts usually begins with a deluge of notifications. The inbox is a chaotic stream of Amazon Linux 2 end-of-support warnings, RDS version deprecations, and urgent EC2 instance replacement notices. In a sprawling multi-account environment, the primary challenge is not receiving the alerts, but deciphering which ones actually threaten production stability and which can be scheduled for next quarter. For years, the industry standard for solving this was waiting for a Technical Account Manager (TAM) to provide a curated interpretation of the noise. This dependency created a systemic bottleneck where critical infrastructure decisions were delayed by the availability of a single human expert, shifting the engineering team's focus from innovation to reactive firefighting.

The Three-Tier Architecture of the Chaplin Framework

To break this dependency, AWS has open-sourced Chaplin, the Customer Health and Planned Lifecycle Intelligence Nexus. This AI agent transforms the manual analysis of health events into a self-service capability by leveraging Amazon Bedrock and the Model Context Protocol (MCP). The system is built on a rigorous three-layer architecture designed to centralize fragmented data before passing it to an LLM. The data layer begins with the AWS Health API, where Amazon EventBridge triggers real-time events that are captured by AWS Lambda functions. These functions utilize cross-account IAM roles to aggregate events from dozens of accounts into an Amazon S3 data lake, partitioned by account, date, and event type. To make this data queryable, S3 event notifications trigger another Lambda process that loads the JSON-formatted health events into Amazon DynamoDB.

The middle layer acts as the translation engine, exposing the raw DynamoDB data through an MCP server. MCP is an open standard that allows AI models to interact with external data and tools in a consistent way. By indexing metadata such as event type, severity, and account IDs within DynamoDB, the MCP server provides a standardized interface. This allows the AI agent to request specific subsets of data without requiring the operator to write complex database queries. The final client layer is where the user interacts with the system, using MCP-compatible assistants like Claude Code or the Kiro CLI. When an operator asks a natural language question, the assistant communicates with the MCP server to retrieve refined data from DynamoDB and presents the answer in the terminal or chat interface. The full deployment guidelines and source code are available in the Chaplin AWS Health Agentic Assistant repository.

Solving the RAG Hallucination Problem with Structured Queries

While the architecture provides the plumbing, the real technical breakthrough in Chaplin is how it handles numerical accuracy. Most AI implementations rely on Retrieval-Augmented Generation (RAG) using vector similarity searches. While vector search is excellent for finding conceptually related text, it is notoriously unreliable for numerical aggregation. In a production environment, a probabilistic error where an AI reports 190 health events instead of 958 is not just a glitch; it is a business risk that leads to under-provisioning of resources or missed deadlines. This happens because standard RAG combines probabilistic token prediction with similarity-based retrieval, which often fails when precise counting or filtering is required.

Chaplin solves this by implementing a Natural Language to Structured Query Agent. Instead of converting the entire health event dataset into vectors, the system treats the data as two distinct types: structured metadata (service names, timestamps, severity) and unstructured text (problem descriptions, recommended actions). When a user asks for a summary of EC2 replacement events in production accounts, the agent does not perform a similarity search. Instead, it translates the natural language intent into a deterministic DynamoDB filter using the database schema. It maps the request directly to fields like `event_type` and `affected_accounts`, forcing the database to perform the actual calculation and filtering.

By delegating the math to the database and the synthesis to the LLM, Chaplin eliminates numerical hallucinations. The Amazon Bedrock Claude model is only used to interpret and summarize the final, accurate result set returned by the query. This separation of concerns not only ensures 100% accuracy in event counting but also significantly reduces inference costs by preventing the LLM from processing massive amounts of irrelevant raw text. The result is a system that provides the reliability of a SQL query with the accessibility of a chat interface.

This operational shift allows teams to move beyond simple alert monitoring and toward business impact analysis. By integrating resource tags, environment classifications, and ownership metadata, Chaplin can identify not just that an RDS version is expiring, but exactly which core business module is affected and who owns the resource. This intelligence can be piped directly into collaboration tools like JIRA, GitHub, or ServiceNow, turning an AI-generated insight into a tracked engineering ticket automatically. Furthermore, the system is designed to be model-agnostic. While it defaults to Claude via Amazon Bedrock, the architecture supports OpenAI or local models via Ollama, making it viable for air-gapped environments where data privacy is paramount. By combining rule-based pre-processing for common patterns with a structured query agent for complex analysis, Chaplin transforms the role of the cloud operator from a notification manager to a strategic orchestrator.