OpenAI's MRC Protocol: The Architecture Linking 131,000 GPUs

Every morning, engineers managing the world's largest AI clusters face a precarious gamble. When training a frontier model across thousands of GPUs, the network is not just a utility but a single point of failure. In these hyper-scale environments, the industry has long struggled with a brutal reality: a single faulty cable or a momentary network glitch can trigger a cascading failure, halting the entire training run. For a cluster of this magnitude, such an outage forces the system to roll back to the last saved checkpoint, wasting thousands of GPU hours and millions of dollars in compute credits. The tension in the developer community has shifted from how to increase raw FLOPS to how to stop the network from collapsing under its own scale.

The Architecture of Multipath Reliable Connection

To solve this systemic fragility, OpenAI has collaborated with a consortium of industry giants including AMD, Broadcom, Intel, Microsoft, and NVIDIA to develop the Multipath Reliable Connection (MRC) protocol. This new networking standard has been officially released through the Open Compute Project, ensuring that the blueprints for this high-resilience infrastructure are available to the broader data center community. At its core, MRC is designed for the latest 800Gb/s network interfaces, but it fundamentally changes how data moves across the wire. Instead of treating a data transmission as a single stream, MRC distributes a single transfer across hundreds of different paths simultaneously.

This protocol does not exist in a vacuum; it is an evolution of existing high-performance networking standards. MRC extends RDMA over Converged Ethernet, the industry standard for accelerating memory access between GPUs and CPUs in an Ethernet environment. To achieve its reliability goals, it integrates technologies from the Ultra Ethernet Consortium, which focuses on creating a scalable AI network fabric, and SRv6, a source routing mechanism based on IPv6. By combining these, MRC can detect a link failure and reroute data in microseconds, effectively masking hardware glitches from the training software.

The protocol is already operational in the most advanced AI environments currently in existence. It is deployed within OpenAI's NVIDIA GB200 supercomputers, as well as the Abilene data center within Oracle Cloud Infrastructure and Microsoft's Fairwater supercomputer. These deployments serve as the primary testbeds for the most demanding LLM training workloads in the world.

From Monolithic Links to Parallel Planes

The true innovation of MRC lies in its departure from traditional network topology. For years, the standard approach was to treat a network interface as a single, monolithic 800Gb/s link. While this provides high peak bandwidth, it creates a rigid path; if that path fails or becomes congested, the data stops. MRC replaces this monolithic pipe with a fragmented, parallel architecture. In a typical MRC configuration, a single interface is split into multiple smaller links. For example, one interface can be connected to eight different switches, creating eight parallel network planes each operating at 100Gb/s.

This shift in logic fundamentally alters the physical constraints of the cluster. Under the old architecture, a switch might only be able to connect 64 ports. By utilizing the MRC approach of splitting and parallelizing, the same hardware can now support up to 512 ports. The mathematical result of this expansion is staggering. By reducing the complexity of the routing structure and increasing port density, OpenAI can now fully interconnect approximately 131,000 GPUs using only two layers of switches.

For the developer, this means the end of the dreaded training stall. Because MRC utilizes static source routing to bypass failed segments instantly, the GPUs no longer sit idle waiting for the network to recover or for a manual reset of the cluster. The transition from a single-link dependency to a multipath fabric transforms the network from a fragile chain into a resilient mesh. The focus of AI infrastructure has moved beyond the raw speed of a single connection to the aggregate reliability of the entire fabric.

Infrastructure competition in the AI era is no longer just about who has the most GPUs, but who can keep them all running at peak utilization without interruption.

OpenAI's MRC Protocol: The Architecture Linking 131,000 GPUs

The Architecture of Multipath Reliable Connection

From Monolithic Links to Parallel Planes

Related Articles