NVFP4: How NVIDIA Pre-trained a 12B Model on 10 Trillion Tokens

Late nights in the data center control room are often defined by a singular, stressful visual: the red spikes of memory utilization graphs. As engineers push the boundaries of Large Language Models, the struggle to reclaim even a single percent of VRAM becomes an obsession. To avoid the dreaded Out of Memory error, teams spend weeks shrinking batch sizes and fracturing pipelines, desperately searching for a numerical representation that can hold the intelligence of a model without choking the hardware. In an era where even 8-bit precision feels heavy, memory bottlenecks are no longer just technical hurdles; they are direct taxes on training speed and capital expenditure.

NVIDIA is attempting to break this cycle with the introduction of NVFP4 (NVIDIA FP4), a new 4-bit training methodology designed to move low-precision learning from the laboratory to the production floor. Historically, 4-bit precision was considered too volatile for pre-training because its narrow numerical range frequently led to divergence, where model weights spiral out of control and the training process collapses. However, by leveraging the hardware acceleration of the Blackwell Tensor Cores, NVIDIA has successfully pre-trained a 12-billion parameter model on a staggering 10 trillion tokens. This represents the longest 4-bit pre-training journey ever publicly documented, proving that precision can be halved without sacrificing the cognitive capabilities of the resulting model.

The 10 Trillion Token Benchmark and the Nemotron-Nano Architecture

The scale of 10 trillion tokens is the primary metric NVIDIA uses to validate the stability of NVFP4. In the world of LLM development, tokens are the fundamental units of learning, and maintaining stability over such a vast dataset is an immense challenge. While FP8 (8-bit floating point) has become the industry standard for efficiency, dropping to 4-bit has long been a forbidden frontier due to quantization error. This error occurs when a high-precision number is forced into a low-precision format, losing critical data in the process. Over trillions of tokens, these tiny errors accumulate like a snowball, eventually triggering a total collapse of model performance.

To test this, NVIDIA utilized the Nemotron-Nano-12B-v2-Base architecture. This is not a standard transformer but a sophisticated hybrid model consisting of 62 blocks. The architecture blends 6 self-attention layers for deep contextual understanding, 28 Feed-Forward Networks (FFN) for feature extraction, and 28 Mamba-2 layers to maintain linear time complexity. This design essentially pairs a meticulous analyst with a high-speed reader, allowing the model to process massive sequences efficiently. The technical triumph here is not just the 4-bit precision, but the fact that NVFP4 remained stable across this hybrid architecture, where different mathematical operations typically react differently to quantization.

The resulting performance metrics suggest that the trade-off for this efficiency is nearly zero. In the MMLU-Pro 5-shot test, which measures professional-grade knowledge, the NVFP4-trained model achieved an accuracy of 62.58%. When compared to the FP8 baseline model's score of 62.62%, the difference is a negligible 0.04 percentage points. By utilizing the NVIDIA Transformer Engine to optimize the interaction between software and hardware, NVIDIA has demonstrated that the intelligence of a model is not tied to the bit-width of its training, provided the scaling is handled correctly.

The Engineering of E2M1 and the 36% Efficiency Gap

The core of NVFP4 lies in the E2M1 format, a restrictive numerical space consisting of 1 sign bit, 2 exponent bits, and 1 mantissa bit. This configuration allows for only eight possible values: ±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, and ±6. Using E2M1 is like trying to measure a complex architectural blueprint with a ruler that only has a few coarse markings; the gap between the actual value and the representable value is enormous. This is where the quantization error becomes lethal, as the model struggles to represent the subtle gradients necessary for learning.

To solve this, NVIDIA redesigned the microscaling logic. They first reduced the block size from 32 elements to 16, meaning a single scale factor now manages half as many numbers, allowing for much tighter control. More importantly, they shifted the block scale factor format from UE8M0 to E4M3. While UE8M0 is limited to powers of two, which creates jagged gaps in numerical representation, E4M3 allows for far more granular adjustments. This enables the system to map the absolute maximum value of a block, known as amax, much closer to the maximum representable value of the 4-bit format, effectively stretching the utility of those eight available numbers.

This precision is further reinforced by a two-stage scaling hierarchy. The system first establishes a global numerical range using FP32 (32-bit floating point) at the tensor level, then applies the E4M3 block scale for fine-grained local adjustments. This is analogous to using a city map to find a neighborhood and then a detailed street map to find a specific door. Because of this hierarchy, the amax of each block is represented with FP8-level precision at a rate of at least 6.25%. By preserving these critical anchor points, the rest of the 4-bit values maintain a level of accuracy that mimics 8-bit training.

This architectural shift creates a massive efficiency advantage over previous standards like MXFP4. In experiments with an 8B parameter model, NVFP4 reached a specific loss target using only 1 trillion tokens. In contrast, MXFP4 required 1.36 trillion tokens to reach the same performance level. This means MXFP4 suffers from a 36% token overhead, requiring significantly more data and compute time to achieve the same result. In an industry where data acquisition and GPU hours cost millions of dollars, a 36% reduction in training requirements is a decisive competitive advantage.

Hardware Acceleration and the Four Pillars of Stability

The hardware implications of NVFP4 are most evident in the Blackwell architecture. The FP4 GEMM (General Matrix Multiply) on the GB200 is four times faster than BF16 and roughly twice as fast as FP8. The upcoming GB300 pushes this further, offering a six-fold speed increase over BF16 and a three-fold increase over FP8. When combined with a memory footprint that is half that of FP8, developers can either double their batch sizes or fit significantly larger models into the same hardware footprint.

However, raw speed is useless if the model diverges. To prevent this, NVIDIA implemented four specific safety mechanisms. First, they used selective high precision, keeping the first two and last eight blocks of the 62-layer model in BF16. This ensures that the sensitive input and output stages of the network operate with maximum resolution, while the bulk of the internal processing is handled in 4-bit. This accounts for roughly 16% of the layers, acting as a stabilizer for the entire pipeline.

Second, they introduced Random Hadamard Transforms (RHT) to manage weight gradients. By applying a 16x16 Hadamard matrix and a random ±1 sign vector to the weight gradient inputs, NVIDIA effectively spreads out extreme outliers into a Gaussian distribution. This prevents a single massive value from destabilizing the entire weight update. Third, they moved to 2D block scaling. Traditional 1D scaling (1x16) fails during backpropagation because the weight tensor is transposed, changing the quantization values and breaking the operational chain. By using 16x16 blocks, the quantization remains consistent regardless of whether the model is in a forward or backward pass.

Finally, NVIDIA applied stochastic rounding specifically to the gradient tensors. Unlike standard rounding, which can introduce a systematic bias over time, stochastic rounding decides the direction of the round based on the distance to the nearest values. For example, a value of 0.7 has a 70% chance of rounding to 1 and a 30% chance of rounding to 0. This prevents the accumulation of rounding errors that would otherwise degrade the model's intelligence. Notably, this is only applied to gradients, as applying it to the forward pass tensors was found to decrease performance.

These combined innovations transform 4-bit training from a theoretical curiosity into a production-ready tool. By offsetting the inherent loss of precision with mathematical transforms and strategic high-precision anchors, NVIDIA has paved the way for a new era of infrastructure optimization where the bottleneck is no longer the bit-width, but the imagination of the architect.

NVFP4: How NVIDIA Pre-trained a 12B Model on 10 Trillion Tokens

The 10 Trillion Token Benchmark and the Nemotron-Nano Architecture

The Engineering of E2M1 and the 36% Efficiency Gap

Hardware Acceleration and the Four Pillars of Stability

Related Articles