ParaRNN speeds up RNN training 665x and enables a 7B classic RNN

RNNs are back in the spotlight this week because Apple’s ParaRNN turns what used to be a sequential bottleneck into something you can train in parallel.

Section 1: Apple’s ParaRNN brings a 665x training jump to RNNs

Apple’s research team presented ParaRNN at ICLR 2026 as a spoken paper, and the headline claim is hard to ignore: the framework boosts training speed by up to 665x compared with conventional sequential training.

The context matters. For years, many teams treated RNNs as a poor fit for large language model training. Even though RNN inference can be fast and memory-efficient, the training story runs into a structural limitation: RNNs process data in order, so the computation chain tends to stay sequential. That sequential dependency makes it difficult to scale training the way modern transformer pipelines do.

ParaRNN is positioned as a direct response to that limitation. Apple reports that the efficiency gains are large enough to make it possible to train a “classic” RNN at a 7B-parameter scale for the first time, with performance that can compete with transformer-based approaches. In other words, the team is not just accelerating toy experiments; it is aiming at a model size that developers associate with serious LLM-like capacity.

To help others reproduce and build on the work, Apple open-sourced the framework as a codebase: ParaRNN codebase.

The tension for practitioners is obvious. If RNNs really can be trained at scale without giving up their core advantages, then the architecture choices for resource-constrained deployment start to look different.

Apple’s ParaRNN claims a 665x RNN training speedup and enables the first 7B-scale classic RNN training.

Section 2: What actually changes when you parallelize nonlinear RNN learning

So what is different about ParaRNN, beyond the headline speed number?

Historically, one way to keep RNN inference efficiency while improving training parallelism has been to simplify the recurrence itself. Models such as Mamba, which uses selective state space modeling to compress and process information efficiently, take a route that makes the recurrence effectively linear. That linearization can unlock parallel computation, because the dependency structure becomes more amenable to batched or parallel solvers.

But Apple points to a tradeoff: linearizing the recurrence can cap the model’s expressiveness. In practice, that means you may get speed, but you also risk losing the nonlinear modeling power that makes richer RNN variants useful in the first place.

ParaRNN’s twist is to avoid that expressiveness ceiling by changing how the nonlinearity is handled rather than removing it.

Apple describes the approach as introducing Newton’s method, a numerical technique for solving nonlinear equations via iterative linear approximations. Instead of treating the RNN as a chain of sequential steps, the team redefines the training problem as a single large system of equations.

From there, the method uses local derivatives to linearize the nonlinear system. Specifically, the framework relies on the Jacobian, the matrix of partial derivatives that captures how a multivariable function changes locally. With the Jacobian in hand, the nonlinear dynamics can be approximated as a linear system in each iteration.

Then comes the parallelization mechanism. Apple converts the resulting linearized problem into a linear state space model (SSM) form, which can be solved in parallel rather than step-by-step. The key idea is that the “nonlinear RNN” no longer forces the training computation to march through time in the usual way; instead, each Newton iteration turns the problem into a parallel-friendly linear system.

The development story becomes concrete when you look at what happens to standard RNN families.

Apple reports that when ParaRNN is applied to GRU and LSTM models, it can reproduce the same level of hidden-state evolution as sequential training with only 3 iterations. That is a striking claim because GRU and LSTM are already designed to manage information flow over time with gating mechanisms, and they are typically associated with sequential processing during training.

In other words, ParaRNN is not merely accelerating a narrow custom RNN. It is presented as a general technique that can wrap around established nonlinear RNN structures and still deliver the parallel training behavior.

The practical tension for developers is whether this is a research-only trick or something that can fit into real deployment constraints. Apple’s framing suggests the latter: ParaRNN is “in principle” applicable to all RNN architectures, and it could become an architecture option for teams that need to deploy LLM-like systems in environments with limited compute or memory.

ParaRNN changes the training computation itself by using Newton-style Jacobian linearization and converting the result into a parallel linear SSM solve.

RNN training no longer has to be a sequential tax, and the next wave of LLM deployment choices may start with how quickly you can parallelize nonlinear recurrence.

ParaRNN speeds up RNN training 665x and enables a 7B classic RNN

Section 1: Apple’s ParaRNN brings a 665x training jump to RNNs

Section 2: What actually changes when you parallelize nonlinear RNN learning

Related Articles