The modern AI developer is currently trapped in a frustrating paradox known as the specialist's dilemma. You spend weeks curating a pristine dataset of proprietary medical records or legal briefs and fine-tune a powerful large language model to master that specific domain. The result is often a model that can recite obscure case law or diagnose rare pathologies with startling accuracy, but has suddenly lost the ability to follow a simple instruction or maintain a coherent conversation. This is the reality of catastrophic forgetting, where the pursuit of narrow expertise effectively wipes the slate clean of the general reasoning capabilities that made the model useful in the first place.
The Three-Stage Architecture of Nova Forge
Amazon Nova Forge addresses this instability by replacing haphazard fine-tuning with a structured, three-stage custom pipeline designed to layer knowledge without erasing it. The process begins with Continued Pre-training (CPT), which focuses on the ingestion of unlabeled text. This stage is critical when a base model lacks the fundamental vocabulary or domain-specific jargon of a particular industry. By exposing the model to raw domain data, CPT builds the necessary linguistic foundation before any specific tasks are assigned.
Once the model understands the language of the domain, it moves into Supervised Fine-Tuning (SFT). Unlike CPT, SFT relies on demonstration data—pairs of prompts and ideal responses. This stage teaches the model how to behave, transforming it from a text-completion engine into a functional assistant that knows how to execute specific tasks within its new area of expertise. The final layer is Reinforcement Fine-Tuning (RFT), which utilizes reward signals to polish the model's output. RFT works by generating multiple candidate responses to a single prompt and scoring them against quality benchmarks, effectively reinforcing the paths that lead to the most accurate and helpful answers.
To support this pipeline, Amazon integrates the workflow into the SageMaker ecosystem. Developers can choose from three distinct infrastructure tiers based on their scale. SageMaker Serverless provides a UI-driven experience with automated computing provisioning for those who want to avoid infrastructure management. SageMaker AI training jobs (SMTJ) offer a fully managed experience for standard training runs without the need for cluster orchestration. For the most demanding scenarios involving massive distributed training, Amazon SageMaker HyperPod provides the specialized, high-performance environment required to handle the computational load of large-scale model adaptation.
The Balancing Act of Data Mixing and Checkpoints
The real technical breakthrough in Nova Forge is not just the sequence of training, but the mechanism used to prevent the model from collapsing into a narrow specialist. This is achieved through data mixing, a strategy where proprietary domain data is blended with curated general-purpose datasets provided by Amazon Nova. By maintaining a significant portion of general data during the training process, the model is forced to keep its general reasoning and instruction-following circuits active while it absorbs new specialized knowledge. This prevents the weights of the model from shifting too drastically toward a single domain, effectively anchoring the AI's general intelligence.
This stability is further managed through the precise control of the learning rate. In the context of data mixing, the learning rate becomes an incredibly sensitive lever. If the rate is set too high, the model suffers from overshooting, where it bypasses the optimal weight configuration and rapidly loses its base capabilities. If set too low, the model converges too slowly, wasting expensive compute resources. Because the interaction between curated data and user data is so volatile, Nova Forge provides calibrated service defaults. These defaults serve as a stabilized starting point, discouraging developers from making arbitrary adjustments that often lead to training instability.
Strategic flexibility is also introduced through the choice of checkpoints. Developers must decide between using a pre-trained checkpoint or a post-trained checkpoint. A pre-trained checkpoint is a raw model that has not yet undergone instruction tuning. It offers the highest level of flexibility for those who wish to completely overhaul the model for a specific domain, but it comes with a catch: the model loses its ability to follow instructions and must undergo SFT to recover its conversational utility. Conversely, a post-trained checkpoint is already aligned and conversational. This is the ideal choice for developers with smaller datasets or those using efficient techniques like Low-Rank Adaptation (LoRA), as it preserves the model's existing alignment while allowing for rapid performance gains.
Finally, the effectiveness of RFT is gated by the model's baseline accuracy. RFT is not a tool for teaching a model a new skill from scratch; rather, it is a tool for refining a skill the model already possesses. If a model's baseline accuracy on a task is too low, there are not enough high-quality examples for the reward function to reinforce, making RFT useless. In such cases, the developer must return to SFT to build a foundation of competence. Once the model can consistently produce a correct answer, RFT can then be used to optimize the quality and reliability of those answers through a sophisticated reward function.
Success in domain-specific AI is no longer about how much data you can cram into a model, but about how precisely you can balance expertise with general intelligence through metric-driven LLMOps.




