TwELL CUDA Kernel Boosts LLM Inference by 20.5% and Training by 21.9%

Modern large language models are plagued by a hidden inefficiency that wastes billions of GPU cycles every second. In a typical neural network, a significant portion of neurons often remain inactive during a specific calculation, effectively acting as zeros that contribute nothing to the final output. Imagine a corporate headquarters where one out of every three employees is absent from their desk, yet the management continues to send them emails, wait for their replies, and allocate office space for them as if they were present. This is precisely how current GPU architectures handle sparsity; they process every single data point, including the zeros, consuming power and time for calculations that do not change the result.

The Architecture of TwELL and Sparse Induction

To solve this systemic waste, researchers from Sakana AI and NVIDIA have developed TwELL, a Tile-wise ELLPACK data format accompanied by a dedicated CUDA kernel. The focus of this optimization is the feed-forward network (FFN) layers of the LLM. These layers are the heaviest part of the model, accounting for more than two-thirds of the total parameters and generating over 80% of the total floating-point operations (FLOPs) in large-scale deployments. By targeting the FFN, the research team addressed the primary bottleneck of LLM compute.

The team achieved sparsity by employing a ReLU (Rectified Linear Unit) gate activation function combined with L1 regularization. Specifically, they added an L1 loss term to the training objective to penalize the absolute value of the weights, effectively forcing the model to deactivate unnecessary neurons. When the L1 coefficient was set to 2×10⁻⁵, the researchers observed that more than 30% of the neurons could be deactivated across layers without any measurable loss in model accuracy.

Beyond the initial regularization, the team implemented a strategic re-initialization of the gate weights. This refinement allowed them to push performance further, extracting an additional speed increase ranging from 17.9% to 19.1%. Crucially, this sparsity-induction strategy was designed to be minimally invasive. It does not require adjustments to critical hyperparameters such as the learning rate, weight decay, or the specific optimizer settings, making it a plug-and-play enhancement for existing training pipelines. The final result is a significant leap in efficiency: a 20.5% acceleration in inference speed and a 21.9% boost in training speed.

Breaking the Sparsity Overhead Barrier

For years, the industry has attempted to use sparse matrix formats like ELLPACK, which stores only non-zero values in a row-based format. However, these attempts usually failed in practice because of the conversion tax. The GPU had to run a separate kernel pass to convert dense data into a sparse representation, and the time spent on this conversion often outweighed the time saved by skipping the zeros. The result was a theoretical gain that vanished during actual execution.

TwELL changes the fundamental unit of calculation to align with the physical reality of NVIDIA GPUs. Instead of row-based processing, TwELL aligns data to the 2D tile size assigned to a Cooperative Thread Array (CTA). By generating the TwELL format directly within the epilogue of the gate projection kernel, the system eliminates the need for additional kernel executions, global memory reads, or synchronization between CTAs. To maximize data locality, the format uses a compression coefficient C to accommodate the maximum number of non-zero elements within a tile, packaging values, indices, and counts into a single 32-bit matrix.

During the inference phase, a single fused kernel reads the TwELL activations and performs both the up-projection and down-projection simultaneously. Each CTA handles one row of the input, statically iterating through column tiles and dynamically processing the non-zero counts within each tile. By loading the corresponding columns of the up-projection weight matrix and rows of the down-projection weight matrix based on the active neuron indices, the kernel calculates the inner product on the fly. This approach prevents the intermediate hidden states from being written to the DRAM, drastically reducing memory traffic and eliminating one of the most expensive operations in GPU computing.

Training presents a different challenge because sparsity patterns are highly irregular across different tokens and layers. To handle this, the researchers introduced a hybrid format. Rows with a non-zero count below a certain threshold are routed to a compact ELL matrix, while rows exceeding that threshold are routed to a dense backup matrix. This dynamic routing allows the system to maintain efficiency even in GEMM (General Matrix Multiply) environments where thousands of tokens are processed simultaneously, a scenario where traditional GEMV (General Matrix-Vector Multiply) sparse methods typically fail. Even when applied to standard Transformer feed-forward blocks that lack a gating mechanism, the TwELL approach yielded an 11.2% increase in inference speed.

The optimization of LLMs has shifted. The primary battle is no longer just about changing the model architecture or shrinking the parameter count, but about how precisely the data format can be fused with the physical execution units of the hardware.

TwELL CUDA Kernel Boosts LLM Inference by 20.5% and Training by 21.9%

The Architecture of TwELL and Sparse Induction

Breaking the Sparsity Overhead Barrier

Related Articles