Imagine staring at a scanned page from a decades-old graduation yearbook. You see a chaotic array of 176 printed names and four distinct portrait photos, but there is no underlying digital structure to tell you which name belongs to which face. For years, this has been the wall that automated document digitization hits. The lack of machine-readable associations turns a simple archival task into a grueling manual matching exercise. This week, a new architectural approach using a two-stage pipeline of Amazon Nova 2 Lite and Claude Sonnet 4.6 has demonstrated that this gap can be closed with high precision and significantly lower overhead.

The Architecture of High-Confidence Extraction

The technical challenge was tested against a dataset of 336 scanned pages. The goal was to extract and verify name-face relationships without relying on pre-existing metadata. By deploying a sequential pipeline on Amazon Bedrock, the development team successfully extracted 3,122 name-face relationships. The results were striking, with 93% of these matches recording a high-confidence score of 0.95 or higher. This performance was achieved while reducing the cost per page by approximately 33% compared to using a single, massive vision-language model for the entire end-to-end process.

The pipeline splits the workload into two distinct phases: high-volume extraction and high-order reasoning. In the first stage, Amazon Nova 2 Lite handles the heavy lifting. Through a single Converse API call, the model simultaneously detects photos, extracts bounding boxes to define object coordinates, and captures visible names and page-level metadata. Nova 2 Lite outputs this data as a structured JSON object, categorizing each image as a person, a group, or a snapshot, accompanied by a brief description.

Once the raw coordinates and text are extracted, the process moves to the second stage where Claude Sonnet 4.6 takes over. Claude performs the spatial reasoning necessary to link the names to the faces. Because both Nova 2 Lite and Claude share an identical coordinate system ranging from 0 to 1000, Claude can ingest Nova's output directly. This shared spatial language eliminates the need for complex coordinate transformation layers, streamlining the data flow and reducing the likelihood of alignment errors during the matching process.

The Economic Shift in Vision Pipelines

The real breakthrough in this pipeline is not just the accuracy, but the aggressive optimization of compute and cost. The team discovered that for structured extraction tasks, the reasoning depth of the model had a diminishing return. By setting the reasoning_config of Amazon Nova 2 Lite to LOW, they maintained the same level of accuracy as the MEDIUM or HIGH settings while minimizing the cost per token.

Beyond the reasoning configuration, the team implemented a strict token optimization strategy. Rather than forcing Nova 2 Lite to perform a full OCR scan of every token on the page, the prompt was restricted to extract only the names immediately surrounding the detected photos. This surgical approach to extraction compressed the output tokens from an estimated 4,500 per page down to approximately 1,000. This reduction does more than just save money; it filters out noise, ensuring that Claude receives only the most relevant spatial data for its reasoning phase.

Perhaps the most significant operational shift comes from Nova 2 Lite's fixed pricing model. Traditionally, vision models charge based on image resolution or file size, which creates unpredictable costs and forces developers to implement resolution normalization. This pre-processing step, designed to shrink images to a standard size to save costs, often introduces image degradation or loss of fine detail. Because Nova 2 Lite applies a fixed fee per image or document page regardless of resolution, the normalization step was completely removed from the pipeline. This allows the system to process original, high-resolution files without increasing costs or risking the quality of the source material.

To handle the inherent variability of document layouts, the pipeline leverages the adaptive thinking capabilities of Claude Sonnet 4.6. By configuring the thinking type to adaptive, the model dynamically adjusts its internal reasoning depth based on the complexity of the input.

{

"thinking": {

"type": "adaptive"

}

}

This adaptive mechanism allows the model to switch gears instantly. When encountering a simple grid where eight names are neatly aligned above their respective photos, Claude employs minimal reasoning to produce a rapid response. However, when faced with a complex layout—such as three group photos sharing a single, sprawling caption block—the model triggers a deeper, step-by-step spatial analysis. Across the 336-page test, the reasoning traces varied from 544 to 1,658 characters, proving that the model was actively calculating vertical offsets and column alignments to ensure accuracy. This removes the need for developers to write separate prompts for different layout types or manually manage token budgets for complex pages.

Full implementation details, including the source code, sample images, and Jupyter notebooks, are available via the AWS Samples GitHub repository.

This shift toward specialized, multi-model pipelines suggests a future where the brute-force application of a single large model is replaced by coordinated agents that optimize for both cost and cognitive depth.