Vera Rubin NVL72 Cuts Agentic AI Inference Costs by 10x

The industry is currently hitting a wall with agentic AI. While the promise of autonomous agents that can plan, reason, and execute complex tasks is immense, the actual cost of running these loops is staggering. Unlike a simple chatbot interaction, an agentic workflow requires multiple iterative reasoning steps, tool calls, and self-corrections, all of which consume an exponential number of tokens. For most enterprises, this token tax has kept agentic AI trapped in the pilot phase, where the operational expense of a single autonomous agent often outweighs the productivity gain it provides. The conversation in the developer community has shifted from how to build these agents to how to actually afford to run them at scale.

The Hardware Architecture of the Dell AI Factory

Dell and NVIDIA have responded to this economic bottleneck by unveiling a next-generation AI infrastructure portfolio centered on the Vera Rubin NVL72 and the Vera CPU. The centerpiece of this rollout is the Dell PowerEdge XE9812, a system designed specifically to crash the cost of inference. According to the technical specifications, the integration of NVIDIA Vera Rubin NVL72 allows for a reduction in token-per-cost by up to 10 times compared to the previous Blackwell architecture. This is not a marginal gain; it is a structural shift that lowers the entry barrier for deploying large-scale autonomous agents in production environments.

Beyond the flagship XE9812, Dell is introducing a broader lineup of high-density systems including the PowerEdge XE9880L, XE9885L, and XE9882L, all powered by NVIDIA HGX Rubin NVL8. These systems are engineered for extreme density, supporting up to 144 GPUs per rack. To manage the immense thermal load of such a configuration, Dell has implemented 100% direct liquid cooling across all computing nodes. The performance leap is equally significant, with these systems delivering up to 5.5 times the performance of the HGX B200. This density is critical for agentic workloads, which require massive compute resources to be physically close to minimize latency during the rapid-fire reasoning cycles an agent performs.

To ensure that the network does not become the bottleneck, the infrastructure utilizes the Dell PowerSwitch portfolio, incorporating NVIDIA Quantum-X800 InfiniBand and Spectrum-6 Ethernet. This combination is designed to maximize GPU-to-GPU communication speeds while stripping away latency. Dell has bundled these components into the PowerRack, a full-stack environment that integrates computing, networking, and storage into a single thermal and power-managed unit. By abstracting the complexity of hardware integration, the PowerRack allows enterprises to deploy high-performance computing workloads without the typical trial-and-error overhead associated with assembling disparate components.

The Sequential Bottleneck and the On-Premises Pivot

While GPU raw power usually dominates the AI conversation, the real twist in this release is the role of the Vera CPU. Most AI optimization focuses on parallel processing, but agentic AI is fundamentally sequential. An agent cannot move to step B until it has processed the result of step A. This creates a dependency chain where the single-thread performance of the CPU becomes the primary bottleneck for the entire pipeline. The Vera CPU addresses this specifically, recording completion speeds for agentic workloads that are 50% faster than traditional x86 processors. By reducing the idle time between reasoning steps, the Vera CPU effectively shortens the feedback loop of the agent.

This CPU performance is augmented by a massive leap in memory bandwidth. The Vera CPU provides 1.2 TB/s of memory bandwidth, which is essential for agents that must constantly query external databases to update their context. In real-world testing using the Starburst distributed SQL query engine, query throughput increased by 3 times. This is further accelerated by the integration of NVIDIA CUDA-X libraries, specifically cuDF for GPU-accelerated dataframes and cuVS for vector search. This creates a seamless pipeline where data is extracted, processed in a sandbox, and fed back into the model with minimal friction.

This technical evolution coincides with a broader strategic shift in where AI actually lives. Dell's data reveals that 67% of AI workloads are already running outside the public cloud, with 88% of surveyed enterprises operating at least one workload on-premises. The initial rush to the cloud was driven by agility, but as models scale, the priorities have shifted toward data sovereignty and cost control. The move toward an on-premises AI Factory is a calculated response to the risks of data leakage and the unpredictability of cloud egress fees. To secure this environment, NVIDIA Confidential Computing is employed to encrypt data even during processing, ensuring that model weights and sensitive corporate IP remain protected from the hardware level up.

Deploying Frontier Models in Private Environments

The final piece of the puzzle is the democratization of frontier-grade models within these private factories. The industry is moving away from a total reliance on closed-source APIs toward a hybrid model where high-performance open-weight models are hosted internally. This is evidenced by the availability of the Gemini 3.0 preview via Google Distributed Cloud on PowerEdge XE9780 servers, and the deployment of SpaceXAI models within the Dell AI Factory environment. By combining NVIDIA Confidential Computing with these models, enterprises can achieve frontier-level performance without their data ever leaving their own firewall.

For developers, the barrier to entry for these models is being lowered through the Dell Enterprise Hub on Hugging Face. The hub provides immediate access to a diverse array of models including Gemma 4, Mistral Small 4, and Arcee Trinity-Large-Thinking. The ecosystem is expanding rapidly to include NVIDIA Nemotron, Reflection, MiniMax-M2.7, DeepSeek-V4, GLM 5.1, and Kimi K2.6. Many of these models utilize NVFP4 optimization to maximize inference efficiency, allowing companies to run sophisticated agents on a smaller hardware footprint.

This shift is being codified through the adoption of the Sovereign AI OS reference architecture from Palantir. This is more than just a model deployment tool; it is a blueprint for connecting AI agents to existing corporate data pipelines and workflows. By combining secure hardware, open-weight models, and a sovereign operating system, enterprises are reclaiming control over their AI stack. The transition from API-dependency to infrastructure-ownership means that within the next few months, the way developers allocate resources and version their models will shift from a cloud-console experience to a direct, hardware-level orchestration.

Vera Rubin NVL72 Cuts Agentic AI Inference Costs by 10x

The Hardware Architecture of the Dell AI Factory

The Sequential Bottleneck and the On-Premises Pivot

Deploying Frontier Models in Private Environments

Related Articles