DiffusionBlocks: Sakana AI's Method to Slash Training Memory to 1/B

Every deep learning engineer has faced the same wall: the dreaded Out of Memory error. It usually happens at the most frustrating moment, just as a model begins to scale in depth or complexity. For years, the industry has treated VRAM as a finite resource to be managed through desperate measures, like reducing batch sizes or sacrificing model depth. The current arms race for H100 GPUs is driven largely by this physical limitation, where the cost of entry for training state-of-the-art models is determined not by the quality of the algorithm, but by the amount of available silicon. The community has long accepted that to build deeper networks, one must simply buy more memory.

The Memory Tax of Modern Optimizers

At the heart of this bottleneck is a hidden tax imposed by the tools we use to train models. When using the Adam optimizer, the memory requirement is not simply the size of the model parameters. Instead, it requires roughly four times the parameter size per layer. This is because the system must simultaneously store the parameters themselves, the gradients, and two distinct optimizer states: momentum and variance. It is the computational equivalent of trying to edit a book by keeping four separate copies of every single page open on a desk at once. As models grow deeper, this memory occupancy increases linearly, eventually hitting a ceiling that even the most advanced GPUs cannot breach.

Existing solutions like activation checkpointing attempt to mitigate this by discarding intermediate values and recalculating them during the backward pass. While this reduces activation memory, it does nothing to address the fixed cost of the optimizer states and parameters. This leaves developers with a binary choice: invest in massive hardware clusters or compromise on the model's depth. Sakana AI, in collaboration with researchers from the University of Tokyo, decided to challenge the fundamental assumption that a network must be trained as a single, monolithic entity. They proposed DiffusionBlocks, a framework that divides a Transformer-based network into B independent blocks, theoretically reducing the total training memory requirement to 1/B.

In practice, DiffusionBlocks ensures that only a single sampled block participates in the computation during each iteration. Rather than simply splitting the network into equal segments, the team implemented equi-probability partitioning. This method ensures that each block handles exactly 1/B of the total probability mass, effectively allocating more resources to the intermediate noise intervals that contribute most significantly to generation quality. The results are striking: DiffusionBlocks maintains performance within 1 percentage point of traditional end-to-end backpropagation while drastically lowering the hardware barrier.

Reimagining Residual Networks as Diffusion Processes

To understand why this works without collapsing the model's performance, one must look at the mathematical intersection of residual networks and score-based diffusion models. In a standard residual network, the update rule for a layer is expressed as zℓ = zℓ−1 + fθℓ(zℓ−1). The researchers observed that this structure is mathematically identical to the Euler discretization of a probability flow Ordinary Differential Equation (ODE) used in diffusion models. By interpreting the stacking of residual blocks not as a sequence of operations, but as a discretized denoising process moving between noise levels [𝞂min, 𝞂max], the team unlocked a new way to train.

This theoretical shift allows each block to be treated as an independent learning unit. Because score matching objectives can be optimized independently at specific noise levels, each block can pursue its own local objective function without requiring a global backpropagation pass through the entire network. Communication between blocks is entirely eliminated during training, and since only L/B layers are active at any given time, the memory footprint is slashed. This is a departure from previous attempts at local learning, such as Geoffrey Hinton's Forward-Forward algorithm, which struggled with global consistency. When applied to Vision Transformers on the CIFAR-100 dataset, the Forward-Forward approach yielded a meager 7.85% accuracy. DiffusionBlocks avoids this pitfall by using a mathematically grounded score matching objective that ensures local updates align with the global goal of the network.

The superiority of the equi-probability partitioning approach is evident in the benchmarks. When testing on the DiT-S/2 model, the equi-probability method achieved an FID of 38.03, significantly outperforming the 43.53 recorded by uniform partitioning. This proves that by aligning the block boundaries with the probabilistic characteristics of the data, the model can learn more effectively even with fragmented training. While other recent frameworks like NoProp have attempted backpropagation-free learning, they remained confined to simple classification tasks and custom CNN structures. DiffusionBlocks is the first methodology to combine continuous-time formulation with block-based learning in a way that is applicable to general Transformer architectures.

From Hardware Dependency to Architectural Efficiency

The practical implications of this shift extend beyond training memory and into the realm of inference efficiency. In a typical Diffusion Transformer (DiT) architecture, a model with 12 layers usually requires all 12 layers to be computed for every step. By applying DiffusionBlocks and dividing those 12 layers into 3 blocks, only 4 layers need to be activated per step during inference. This results in a 3x reduction in inference computation, allowing for faster response times or higher throughput on the same hardware.

Training efficiency sees an even more dramatic improvement. Previous models like Huginn relied on stochastic recursion, where a block of 4 layers was repeated an average of 32 times, requiring 8 steps of Truncated Backpropagation Through Time (BPTT). DiffusionBlocks replaces this complex recursion with a single forward pass. Although the number of training epochs increased from 5 to 15, the 32-fold reduction in repetitions led to an overall decrease in total computation by approximately 10 times. This represents a massive reduction in the time and electricity required to bring a model to convergence.

Crucially, these gains do not come at the cost of quality. In experiments using the OpenWebText dataset, DiffusionBlocks achieved a MAUVE score of 0.82, nearly matching the 0.85 baseline. Similarly, when tested against Llama-2, the generation perplexity was 14.99, almost identical to the baseline of 15.05. For developers operating in memory-constrained environments or deploying real-time services, this provides a viable path to maintain high-fidelity output without the need for an endless supply of H100s.

This transition marks a strategic pivot in AI development. For too long, the industry has focused on scaling through brute-force hardware acquisition. DiffusionBlocks suggests that the next leap in AI capability will not come from larger clusters, but from architectures that treat memory as a dynamic variable rather than a fixed constraint. By decoupling the depth of a model from its memory footprint, Sakana AI has provided a blueprint for a more democratic and efficient era of model training.

DiffusionBlocks: Sakana AI's Method to Slash Training Memory to 1/B

The Memory Tax of Modern Optimizers

Reimagining Residual Networks as Diffusion Processes

From Hardware Dependency to Architectural Efficiency

Related Articles