Imagine a compliance officer tasked with scrubbing sensitive data from a digital archive that has grown unchecked since 2015. The archive contains 400 million documents—a chaotic mix of PDFs, scanned images, and unstructured forms. In a traditional banking environment, this means thousands of hours of manual masking, where employees painstakingly find Social Security numbers and account details to black them out. The risk of human error is astronomical, and the timeline for completion is measured in years. This was the exact wall Huntington Bank hit as it prepared for its 2025 proactive compliance initiatives.

The Architecture of Massive Scale

To tackle a dataset of this magnitude, Huntington Bank moved away from sequential processing and built a scalable redaction workflow leveraging Amazon Textract, Amazon SageMaker, AWS Step Functions, and AWS Lambda. The first hurdle was the physical movement of data. The bank utilized AWS Direct Connect and AWS DataSync to migrate 400 million documents from on-premises file sharing servers to Amazon S3 buckets. To ensure security during this transition, the bank deployed DataSync agents within its own data centers to monitor SMB file shares, applying AWS KMS (Key Management Service) to encrypt data both in transit and at rest. This setup also established a bidirectional path, allowing processed documents to be synced back to on-premises storage once redaction was complete.

Once the data landed in S3, the identification phase began. The bank employed Amazon Textract to parse the unstructured documents. Unlike simple OCR, Textract was used to extract not just text, but the specific geometry of tables and forms. The critical output here was not the text itself, but the JSON metadata containing the exact coordinates of detected sensitive fields, such as Social Security numbers and personal addresses. These coordinates provided the physical map required for the subsequent redaction phase, removing the need for human eyes to locate the sensitive data.

Overcoming the API Bottleneck

While Amazon Textract provided the intelligence, the sheer volume of 400 million documents created a massive orchestration problem. A standard API call approach would have been throttled almost immediately. The turning point for the project was the implementation of the Distributed Map state within AWS Step Functions. This feature allowed the bank to take massive input collections—essentially lists of S3 document paths—and split them into smaller batches processed by thousands of parallel execution units.

Even with Distributed Map, the bank encountered the hard limits of AWS Service Quotas. To maintain maximum throughput, the team actively requested quota increases via the AWS Service Quotas console and built a real-time monitoring system using Amazon CloudWatch. By tracking response times, throttling events, and error rates, the engineers could dynamically adjust the concurrency limits of child workflows. This created a feedback loop where the system operated at the absolute ceiling of the service quota without triggering a total collapse of the pipeline.

For the actual removal of data, the bank opted for a hybrid approach. Instead of relying on a purely ML-based deletion, they used the coordinates provided by Textract in conjunction with regular expression (Regex) patterns. These were fed into open-source Python libraries, specifically `PyMuPDF` and `PIL` (Python Imaging Library), to physically overwrite the sensitive areas of the documents. To ensure the 95% accuracy threshold was met, the bank utilized Textract's confidence scores. Any document where the model's confidence fell below a predefined threshold was automatically routed to a separate manual verification workflow. This ensured that the vast majority of the 400 million documents were handled autonomously, while high-risk outliers received human oversight.

By shifting the focus from individual API performance to systemic orchestration, Huntington Bank achieved a processing capacity of approximately 10 million documents per day. The project that was estimated to take years was compressed into a few months. Furthermore, the cost of processing the entire repository dropped to roughly 5% of the initial estimates, a result of using serverless orchestration to eliminate idle computing waste. The bank now views this framework as a repeatable blueprint for future large-scale data integration and redaction tasks, particularly during corporate mergers and acquisitions.

The success of this deployment proves that the primary challenge of enterprise AI is rarely the model's accuracy, but the orchestration of the data pipeline surrounding it.