Modern AI developers live in a state of perpetual cognitive dissonance. We build models with billions of parameters—far more than there are data points in our training sets—and according to every classical rule of statistics, these models should be useless. They should simply memorize the training data, creating a perfect map of the noise and failing miserably the moment they encounter a real-world example. Yet, the opposite happens. The larger the model, the more robust its ability to generalize. We are currently operating the most powerful technology in human history based on a set of empirical observations that contradict our fundamental understanding of learning. This gap between what we see and what we can prove has turned deep learning into a form of high-stakes alchemy, where we mix hyperparameters and architectures hoping for a gold-standard result without truly knowing why the reaction occurred.
The Paradox of Benign Overfitting and Double Descent
For decades, the bedrock of statistical learning was the bias-variance trade-off. The logic was simple: as a model becomes more complex, it reduces bias but increases variance, eventually leading to overfitting. In this classical view, there is a sweet spot of complexity beyond which performance on unseen data inevitably crashes. However, deep neural networks have shattered this paradigm. We have entered the era of benign overfitting, where a network can drive its training error to absolute zero—effectively memorizing every single outlier and noise spike in the dataset—and still maintain high accuracy on a test set.
This phenomenon is best illustrated by the double descent curve. In traditional models, the test error follows a U-shaped curve. But in deep learning, as model capacity increases, the test error drops, rises as it hits the interpolation threshold, and then drops again in a second, deeper descent. This suggests that once a model is large enough to memorize the entire dataset, it doesn't stop there; it begins to find a smoother, more generalizable solution. Research by Belkin et al., 2019 proved that this second descent is where true generalization begins. This is further complicated by the implicit bias of gradient descent. The optimization algorithm does not just find any solution that minimizes error; it preferentially selects solutions that are simpler or smoother, even when the model has the capacity to be wildly complex. Despite these observations, the industry has largely treated this as a black box, focusing on scaling laws rather than the underlying mechanics.
Shifting the Lens from Parameters to Output Dynamics
Until now, the attempt to solve the black box problem focused on the parameter space. Researchers tried to quantify the complexity of a model by counting its weights or analyzing the geometry of its high-dimensional hypothesis space. The problem is that in a model with 175 billion parameters, the parameter space is too vast and noisy to provide meaningful insights. The breakthrough comes from a fundamental shift in perspective: stop looking at the weights and start looking at the outputs. Instead of asking how the parameters are moving, we should ask how the predictions are evolving.
The Stanford Diffusion Group has pioneered this approach in arXiv:2605.01172, proposing a theory that treats the neural network as a dynamical system in the output space. Rather than tracking billions of individual weights, they analyze the evolution of the model's predictions using the Jacobian—a matrix of all first-order partial derivatives. By doing this, they derive the empirical Neural Tangent Kernel (eNTK), a core matrix that dictates the learning dynamics of the network. The eNTK quantifies exactly how a gradient update at one training point ripples through the network to affect the prediction at another point. This transforms the mystery of generalization into a problem of flow and interaction.
In this framework, the evolution of training outputs and their gradients is defined by specific mathematical trajectories. The source defines the evolution of training output and gradients as follows:
bash
훈련 출력과 경사의 진화식
\dot{u}_S = -K_S \nabla \Phi_S(u_S)
\dot{g} = -B K_S g
In these equations, g represents the output gradient, while B denotes the Loss Hessian, which describes the curvature of the loss function. The test outputs evolve in parallel through a cross-kernel known as K_QS. This allows researchers to calculate the exact rate at which the loss function vanishes:
bash
손실 함수의 소멸 속도
\dot{\Phi}_S = -g^\top K_S g
This shift is profound because it removes the need for the infinite-width or infinite-depth assumptions that plagued previous theoretical attempts. This analysis applies to any differentiable architecture and any convex loss function. For the practitioner, this means we are moving away from the era of guessing how many layers or neurons are needed to avoid overfitting. Instead, we are entering an era where we can control the kernel flow—the way data points interact during training—to strategically engineer generalization.
Deep learning is finally migrating from the realm of empirical alchemy to the domain of rigorous engineering, where the dynamics of the output space provide the blueprint for the next generation of efficient, predictable AI.




