How Amazon Bedrock Automates Data Extraction for Millions of Scanned PDFs

The modern enterprise is often a graveyard of legacy data, where critical information is trapped inside millions of scanned PDF files. For a legal clerk or a land registry officer, the daily routine involves staring at a non-searchable image of a contract, manually typing dates and names into a database, and correcting typos in a tedious cycle of repetition. When this scale reaches hundreds of millions of documents, human intervention becomes a mathematical impossibility. The bottleneck is no longer the speed of the typist, but the sheer volume of unstructured imagery that defies traditional text-based search. This is the exact friction point that Amazon Bedrock aims to eliminate by transforming static images into standardized, queryable data through a sophisticated automated pipeline.

The Architecture of Large-Scale Document Intelligence

To tackle the challenge of extracting data from hundreds of millions of land lease agreements, the architecture relies on a combination of Amazon Bedrock and Amazon Bedrock Prompt Management. The primary goal is to convert image-based PDFs, which lack a text layer, into structured JSON strings that can be stored in a NoSQL environment. The pipeline is designed to handle two distinct operational needs: immediate, single-document verification and massive, asynchronous backlogs. This is achieved through two separate inference paths—On-demand and Batch—each optimized for a different set of constraints.

For urgent requests, the On-demand pipeline utilizes an SQS FIFO (First-In, First-Out) queue to ensure that documents are processed in the exact order they are received. Once a message hits the queue, an AWS Lambda function is triggered to download the PDF from Amazon S3. Because the documents are scanned images, the system converts the PDF pages into PNG images, allowing the multimodal LLM to perceive the visual layout of the document. The Lambda function then fetches the appropriate prompt from Bedrock Prompt Management and executes the inference. The resulting structured data is written directly to Amazon DynamoDB.

Developers can trigger this process using the AWS CLI or SDKs with a command such as:

bash

aws sqs send-message --queue-url <QUEUE_URL> --message-body file://message_txt.txt

The message body contains a JSON object specifying the document ID, the LLM model ID, and the specific prompt ID and version required for that document type. To ensure reliability, the Lambda function only deletes the message from the SQS queue after Bedrock has successfully returned the extraction results.

When the volume shifts from a few documents to millions, the system switches to the Batch inference pipeline. This path is triggered by an Amazon EventBridge Scheduler, which invokes a Lambda function to poll an SQS Standard queue. Unlike the FIFO queue, SQS Standard is optimized for massive throughput, making it ideal for processing hundreds of millions of records where strict ordering is less critical than raw speed. The Lambda function aggregates these requests into JSONL (JSON Lines) artifacts, which are stored in S3.

There are specific technical constraints governing this batch process. Amazon Bedrock requires a minimum of 100 records to initiate a batch inference job. Furthermore, since a single batch job can only utilize one model, the Lambda function employs a polling mechanism to identify the most frequently requested model ID in the queue and sets it as the primary model for that batch. Once the Bedrock Batch Inference Job completes, it deposits the results back into an S3 output bucket. An EventBridge rule detects this completion and triggers a post-processing Lambda to parse the JSONL files and update Amazon DynamoDB.

The Tension Between Latency, Cost, and Model Constraints

The decision to split the pipeline into On-demand and Batch paths is not merely a convenience but a response to the fundamental trade-off between latency and operational cost. In the On-demand path, the priority is the user experience. A practitioner cannot wait for a batch window to open just to verify a single lease agreement. By using SQS FIFO and immediate Lambda execution, the system provides results in seconds. However, this comes at a higher cost per invocation and requires a more rigid delivery guarantee to avoid duplicate processing.

In contrast, the Batch pipeline accepts higher latency in exchange for significant cost optimization. By processing data asynchronously, the system avoids the overhead of individual API calls and leverages the efficiency of bulk inference. The shift from SQS FIFO to SQS Standard introduces the possibility of duplicate messages, a tension that is resolved by implementing idempotency logic within the Lambda function to ignore already-processed requests. This architectural choice allows the organization to clear a backlog of millions of documents without incurring the prohibitive costs of real-time inference.

Beyond the pipeline path, the system must navigate the physical limits of multimodal LLMs. For instance, the Claude 4 Sonnet model used in this pipeline can only accept a maximum of 20 images per single multimodal call. For documents exceeding 20 pages, the system implements a chunking strategy. The document is split into 20-page segments, and each segment is processed as a separate request. To maintain data integrity, each chunk is recorded in DynamoDB with a `doc_id`, `chunk_count`, and `chunk_id`. This allows the system to reconstruct the full document analysis from fragmented pieces during the final data aggregation phase.

The most critical insight, however, lies in the use of Amazon Bedrock Prompt Management. In a real-world dataset of land leases, no two documents are identical. Some use numbered lists, some use complex grids, and others rely on handwritten annotations. A single, static prompt would inevitably fail across such diversity. Bedrock Prompt Management allows the system to maintain up to 50 different prompts per region, with up to 10 versions for each prompt. By specifying the prompt ID and version within the SQS message, the pipeline dynamically assigns the most effective instruction set to each specific document type. This means that when an extraction error is discovered in a specific lease format, engineers can update the prompt version in the management console without redeploying a single line of code or interrupting the pipeline.

This decoupling of the prompt from the application logic transforms the pipeline from a rigid tool into an evolving intelligence engine. The infrastructure is deployed via CloudFormation stacks, ensuring that the entire environment—from the SQS queues to the DynamoDB tables—can be replicated across regions with absolute consistency. By combining the raw power of multimodal LLMs with a dual-path orchestration layer and dynamic prompt versioning, the system effectively bridges the gap between legacy analog archives and modern digital intelligence.

This framework proves that the challenge of legacy digital transformation is no longer about the ability to read text, but about the ability to orchestrate the flow of data at scale.

How Amazon Bedrock Automates Data Extraction for Millions of Scanned PDFs

The Architecture of Large-Scale Document Intelligence

The Tension Between Latency, Cost, and Model Constraints

Related Articles