PyTorch nn.Linear: Why torch.compile Targets CPU Overhead, Not Kernels

Deep learning engineers often treat torch.compile as a magic switch. The prevailing belief in the community is that wrapping a model in this compiler automatically triggers a cascade of GPU kernel optimizations, fusing operations and slashing execution time. When a developer sees a performance bump, they typically attribute it to a more efficient GPU binary. However, this assumption ignores a critical distinction in the PyTorch execution stack: the difference between what the GPU computes and how the CPU tells the GPU to compute it.

The Mechanics of nn.Linear and the GEMM Epilogue

To understand where the actual optimization occurs, one must look at the execution trace of a standard linear layer. In a controlled environment using an NVIDIA A100-SXM4-80GB GPU, tests were conducted using scripts such as `02_linear.py`, `03_simple_mlp.py`, and `03_kernels_mlp.py`. The profiling configuration was strictly maintained with `wait=1`, `warmup=1`, and `active=3` to ensure consistency across runs. For those looking to replicate these results, the same environment can be mirrored using Dev Mode or Jobs pipelines within Hugging Face infrastructure.

At the architectural level, the `nn.Linear` module does not perform a simple multiplication followed by an addition. Instead, it flows through `torch.nn.functional.linear`, which calls `aten::linear`, and eventually resolves to `aten::addmm`. This specific operator is designed to handle both the matrix multiplication (GEMM) and the bias addition in a single coordinated effort.

Crucially, `aten::addmm` does not launch a separate `aten::add` kernel to handle the bias. Instead, it utilizes what is known as a GEMM epilogue. An epilogue is a small set of calculations performed within the GEMM kernel itself, immediately before the final result is written back to the High Bandwidth Memory (HBM). By integrating the bias addition into the epilogue, PyTorch avoids the costly process of writing the multiplication result to HBM and then reading it back again just to add a bias vector. This reduction in memory traffic is a fundamental optimization of the cuBLAS GEMM kernels, which provide specialized variants specifically for this purpose.

Another common sight in the profiler trace is the `aten::t` operation, which represents a tensor transpose. To the uninitiated, this looks like a GPU operation, but it is actually a CPU-side metadata modification. Tensors are stored as flat, contiguous arrays in memory. The CPU manages how this flat array is interpreted through metadata called shape and stride. When `aten::t` is called, the CPU simply swaps the stride and shape values to create a new view of the data. No data is moved, no memory is copied, and no GPU kernel is launched. The GPU simply receives a different set of instructions on how to read the existing memory layout.

The Illusion of GPU Optimization in Compiled Mode

This leads to the central tension between Eager mode and Compiled mode. When running in Eager mode, the CPU must explicitly dispatch every single step: it creates the transposed view via `aten::t` and then dispatches the `aten::addmm` call. Each of these steps incurs a small amount of CPU overhead—a few microseconds of dispatch latency that can add up across thousands of iterations.

When `torch.compile` is introduced, the Inductor backend analyzes the computation graph. In the case of a single linear layer, the compiler realizes that the transpose operation is a constant metadata change. Instead of emitting a call to `aten::t` during runtime, Inductor pre-calculates the necessary strides at compile time. The resulting compiled code skips the `aten::t` call entirely and dispatches `aten::addmm` directly with the pre-calculated metadata.

The twist is that the GPU is not doing anything different. The mathematical operations, the memory access patterns, and the actual binary executing on the A100 remain identical to the Eager mode execution. The performance gain observed in these scenarios is not the result of a faster GPU kernel, but the result of removing the CPU's administrative burden. The GPU is simply being fed instructions faster because the CPU no longer has to manage the intermediate view creation.

To prove this, one can analyze the kernel hash dumps in the profiler. Libraries like cuBLAS and CUTLASS provide pre-compiled binaries tailored to specific input layouts. These are identified by suffixes such as `_tn_` (transposed, non-transposed) or `_nn_` (non-transposed, non-transposed), and they account for data types like bf16 or fp16. By comparing the hashes of the kernels launched in Eager mode versus Compiled mode, it becomes evident that the hashes are identical. If the hashes match, the GPU is executing the exact same binary code.

For the practitioner, this distinction is vital. It means that for isolated, single-operator structures, `torch.compile` is essentially a CPU-side optimizer. Real GPU-side optimization—such as kernel fusion—only occurs when there are multiple operations that can be collapsed into a single kernel to further reduce HBM roundtrips. In the case of `nn.Linear`, the fusion has already been done by the cuBLAS developers via the GEMM epilogue long before the PyTorch compiler ever touched the code.

bash

python 03_simple_mlp.py

Understanding the boundary between CPU dispatch and GPU execution transforms how we profile models, shifting the focus from vague speedups to the precise elimination of overhead.

PyTorch nn.Linear: Why torch.compile Targets CPU Overhead, Not Kernels

The Mechanics of nn.Linear and the GEMM Epilogue

The Illusion of GPU Optimization in Compiled Mode

Related Articles