NVIDIA Vera Rubin NVL72 Slashes Token Costs by 10x at Computex

The modern data center is no longer just a collection of servers but is evolving into what NVIDIA calls the AI Factory. For developers and infrastructure architects, the current bottleneck is no longer just raw compute power, but the brutal physics of power delivery, thermal management, and the escalating cost of generating a single token. As models scale toward the trillion-parameter mark, the industry is hitting a wall where incremental chip improvements cannot keep pace with the energy demands of generative AI. This week, the focus shifted from individual GPUs to the rack as the fundamental unit of compute.

The Architecture of the AI Factory

At the Computex Best Choice Awards (BCA), NVIDIA secured a dominant sweep, taking home Golden Awards for the Vera Rubin NVL72 and Jetson Thor, alongside a specialized award for the Alpamayo autonomous driving platform. These wins signal a strategic pivot toward a full-stack integration of data center supercomputing and physical AI. The centerpiece of this effort, the Vera Rubin NVL72, is a rack-scale AI supercomputer that integrates 36 NVIDIA Vera CPUs and 72 NVIDIA Rubin GPUs into a single cohesive unit. This is not a simple cluster but a tightly coupled system designed to eliminate the communication bottlenecks that plague large-scale model inference.

To achieve this, the NVL72 employs a hierarchical connectivity strategy. For scale-up operations within the rack, it utilizes 6th generation NVIDIA NVLink Switches to facilitate ultra-high-speed GPU-to-GPU communication. For scale-out and scale-across connectivity between racks, the system integrates ConnectX-9 SuperNICs and Spectrum-X Ethernet Photonics co-packaged optical switches. This networking layer is further augmented by BlueField-4 DPUs, which accelerate data processing across storage and security layers. The result is a system that increases inference performance per watt by up to 10x and reduces the cost per token by 10x. When paired with Groq 3 LPX, the throughput per watt for trillion-parameter models can increase by up to 35x.

Physical deployment has also been reimagined. NVIDIA has moved to a modular tray design that removes traditional cables, hoses, and fans, reducing the assembly time for a compute tray from 2 hours down to just 5 minutes. The system operates on a 100% liquid cooling architecture designed to function at 45 degrees Celsius, making it compatible with existing liquid-cooled data centers and enabling the use of dry cooler designs that rely on external air. To stabilize the power grid against the volatile spikes typical of trillion-parameter inference, the power shelf's onboard energy storage has been increased 6x, implementing an intelligent power smoothing function that protects the broader data center infrastructure.

Parallel to the data center, NVIDIA is pushing the Blackwell GPU architecture to the edge with Jetson Thor. Designed for physical AI and autonomous robotics, Jetson Thor delivers up to 2,070 FP4 TFLOPS of AI performance. Compared to the previous generation Jetson Orin, this represents a 7.5x increase in computational power and a 3.5x improvement in energy efficiency. The platform is highly flexible, offering power configurations ranging from 40W to 130W to fit various robotic form factors.

Finally, the Alpamayo platform addresses the most dangerous aspect of autonomous driving: the long-tail scenario. These are rare, high-risk events, such as interpreting ambiguous hand signals from a pedestrian or navigating a road where traffic lights and pavement markings contradict each other. Alpamayo utilizes vision-language-action models, specifically Alpamayo 1 and Alpamayo 1.5, which feature 10 billion parameters and employ Chain-of-Thought (CoT) reasoning. This allows the vehicle to not just recognize a pattern, but to logically reason through a situation—such as determining the legal priority and safety distance when passing an emergency vehicle partially blocking a lane. This ecosystem is supported by AlpaSim, an end-to-end simulation framework, and the NVIDIA Physical AI Open Datasets, which contain over 1,700 hours of diverse driving data.

From Pattern Recognition to Physical Reasoning

The technical specifications of the Vera Rubin NVL72 and Jetson Thor reveal a deeper shift in how AI is being deployed. For years, the industry focused on scaling model parameters in the cloud, treating the hardware as a passive vessel for software. The NVL72 reverses this by making the rack the computer. By integrating the 6th gen NVLink and liquid cooling directly into the structural design, NVIDIA is treating thermal and electrical constraints as primary architectural variables. The 10x reduction in token cost is not coming from a more efficient algorithm alone, but from the elimination of the energy overhead wasted on inefficient cooling and data movement.

This transition is even more evident in the move from Jetson Orin to Jetson Thor. The 7.5x jump in performance, specifically through FP4 precision, changes the deployment strategy for on-device AI. Previously, developers had to aggressively quantize models or rely on hybrid cloud-edge loops to handle complex tasks, which introduced latency. With the compute density of Thor, generative AI can be embedded directly into the control loop of a medical device or an industrial robot. This removes the latency bottleneck, allowing for real-time, local inference that is critical for safety-critical physical systems.

Similarly, Alpamayo represents a departure from the traditional deep learning approach to autonomy. Most autonomous systems rely on pattern matching—if the camera sees X, do Y. However, pattern matching fails in the long-tail scenarios where the data is too sparse to train a reliable pattern. By introducing Chain-of-Thought reasoning into a 10-billion parameter model, NVIDIA is moving toward a cognitive architecture for machines. The model doesn't just see a flashing light; it reasons that the light indicates an emergency, checks the surrounding traffic, and calculates a safe bypass route. The integration of AlpaSim allows developers to trace the logical steps of a failure, turning a crash in a simulation into a specific logical correction in the CoT pipeline.

When these three technologies—the NVL72, Jetson Thor, and Alpamayo—are viewed together, they form a closed loop of physical AI. The NVL72 provides the massive compute needed to train the trillion-parameter foundation models; Alpamayo refines those models for complex, real-world reasoning; and Jetson Thor deploys that reasoning into the physical world. The 6x increase in power smoothing and the shift to 45-degree liquid cooling are the unsung heroes of this transition, providing the physical stability required for Agentic AI to operate without interruption.

The era of AI as a digital chatbot is ending, replaced by a world where reasoning models are embedded in the very hardware that moves and interacts with the physical environment.

NVIDIA Vera Rubin NVL72 Slashes Token Costs by 10x at Computex

The Architecture of the AI Factory

From Pattern Recognition to Physical Reasoning

Related Articles