Databricks Unity Catalog and Amazon SageMaker AI Solve LLM Governance

Enterprise developers are currently hitting a wall where the need for rapid LLM iteration crashes directly into the rigid requirements of corporate data governance. For a full-stack engineer tasked with fine-tuning a large language model on proprietary internal data, the process often becomes a battle of permissions. Every time a secure data management tool is connected to an external training service, the authorization logic tends to break, or the provenance of the training data vanishes into a black box. This friction creates a dangerous choice for teams: either sacrifice security for the sake of model performance or stall innovation to satisfy compliance audits.

The Architecture of Governed Fine-Tuning

To resolve this tension, a new integrated workflow combines Databricks Unity Catalog with Amazon SageMaker AI to create a secure pipeline for model optimization. Databricks Unity Catalog serves as the central nervous system for data and metadata permissions, while Amazon SageMaker AI handles the heavy lifting of building, training, and deploying the machine learning models. The technical objective is to ensure that even when data resides in Amazon S3, the training process never bypasses the granular permission models defined within Unity Catalog.

This architecture introduces EMR Serverless into the preprocessing stage to ensure that data lineage remains intact throughout the pipeline. By using a serverless approach to big data processing, the system avoids the overhead of cluster management while maintaining a strict audit trail of how raw data is transformed into training sets. The specific model targeted for this workflow is Ministral-3-3B-Instruct, a lightweight language model developed by Mistral AI. Once the fine-tuning process is complete, the resulting model artifacts are not left adrift in a storage bucket but are registered back into Unity Catalog for centralized versioning and access control.

The Shift from Storage-Centric to Governance-Centric Access

Historically, the integration between data lakes and training services suffered from a fundamental security flaw. SageMaker AI training jobs typically accessed S3 objects directly, which effectively bypassed the Unity Catalog permission layer. This created a visibility gap where organizations could not definitively prove which specific version of a dataset trained which version of a model. In highly regulated industries, this lack of transparency represents a critical compliance risk that can prevent a model from ever reaching production.

The current implementation pivots away from this direct-access model by utilizing the Unity Catalog Open REST API and AWS Secrets Manager. Instead of relying on broad S3 bucket policies, the system uses OAuth credentials stored securely in AWS Secrets Manager to authenticate a Service Principal. This Service Principal is granted delegated authority to access the data, ensuring that every request is validated against the central governance policy. The entire lifecycle, from the initial preprocessing of data to the final registration of the model, now operates under a single, unified governance umbrella.

For developers implementing this setup, the primary benefit is the automation of security configurations and the ability to track data lineage without manual logging. The complete implementation logic is available in the LLM_Finetunig_SageMaker_AI_Unity_Catalog.ipynb notebook. The setup requires creating a Service Principal in Databricks, issuing an OAuth Secret, and storing that secret in AWS Secrets Manager. To facilitate the communication between services, the IAM roles for EMR Serverless and SageMaker AI must be configured with specific permissions.

bash

EMR Serverless Runtime Role 예시

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Action": ["s3:GetObject", "s3:ListBucket"],

"Resource": ["arn:aws:s3:::your-bucket-name/*"]

}

]

}

bash

SageMaker AI Execution Role 예시

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Action": ["secretsmanager:GetSecretValue"],

"Resource": ["arn:aws:secretsmanager:region:account-id:secret:your-secret"]

}

]

}

To demonstrate the efficacy of this pipeline, the workflow utilizes the SEC EDGAR database, specifically extracting 10-K and 10-Q reports from S&P 500 companies for the 2023-2024 period. The process employs the Databricks SDK to access Unity Catalog tables for preprocessing, and the final trained model artifacts are delivered back to a Unity Catalog-managed S3 bucket. This ensures that the financial data, which is subject to strict handling rules, is never exposed to unauthorized processes during the tuning phase.

This integration establishes a standardized path for organizations to leverage state-of-the-art LLM fine-tuning without compromising their data governance posture.