The $0.11 GPU That Cuts AI Inference Costs by 90%

Every morning, the same trade-off hits developers serving AI models: cut inference costs and sacrifice quality, or push performance and watch GPU rental bills climb past six figures. This week at Google Cloud Next in Las Vegas, a hardware announcement broke that deadlock without asking anyone to compromise.

A5X Instances, Rubin NVL72 — Inference Costs Drop 90%

Google Cloud, in partnership with NVIDIA, launched the A5X bare-metal instance powered by the latest Vera Rubin NVL72 rack-scale system — a configuration that treats an entire server rack as a single, massive GPU. Through extreme chip-system-software co-design, the A5X delivers a 10x reduction in per-token inference cost compared to the previous generation (A4X, based on GB300), and a 10x increase in tokens per megawatt. The instance pairs NVIDIA ConnectX-9 SuperNICs with next-generation Google Virgo networking, scaling to 80,000 Rubin GPUs in a single-site cluster and up to 960,000 Rubin GPUs across multi-site clusters.

From Blackwell to Rubin — GPU Choices Triple

Not long ago, running NVIDIA GPUs on Google Cloud meant picking between A100 and H100. That menu has expanded dramatically. Google's NVIDIA Blackwell portfolio now spans A4 VMs (HGX B200 systems), A4X VMs (GB200 NVL72), A4X Max VMs (GB300 NVL72), and G4 VMs (RTX PRO 6000 Blackwell Server Edition with fractional GPU support). Customers can link multiple NVL72 racks to scale to tens of thousands of Blackwell GPUs, bind 72 Blackwell GPUs in a single rack via 5th-gen NVLink and NVLink 5 switches, or rent as little as one-eighth of a GPU. This lets teams precisely match GPU capacity to workloads like mixture-of-experts inference, multimodal inference, data processing, and physical AI simulation for robotics or digital twins.

OpenAI and Thinking Machines Lab Are Already Using It

The shift is tangible: leading AI labs have already adopted the infrastructure. Thinking Machines Lab is scaling its Tinker API on A4X Max VMs (GB300 NVL72) to accelerate training. OpenAI runs its most demanding inference workloads — including ChatGPT — at scale on Google Cloud using NVIDIA GB300 (A4X Max VM) and GB200 NVL72 systems (A4X VM). Separately, a preview of Google's Gemini models running on NVIDIA Blackwell and Blackwell Ultra GPUs is now available through Google Distributed Cloud, which deploys directly into customer data centers or edge locations. Sensitive data never leaves the premises, yet teams get cutting-edge AI.

Confidential GPU VMs Debut — Regulated Industries Get a Path In

Confidential Computing on the NVIDIA Blackwell platform now encrypts prompts and fine-tuning data even during processing, so infrastructure operators cannot see the contents. In the public cloud, the Confidential G4 VM (with NVIDIA RTX PRO 6000 Blackwell GPU) enters preview, offering the same protection in multi-tenant environments. This is the first confidential computing offering for NVIDIA Blackwell GPUs in the cloud, opening the door for financial services, healthcare, and other regulated industries to adopt AI without trading security for performance.

NeMo Open Models Run Directly on Gemini Agent Platform

Google Cloud's NVIDIA platform supports the full spectrum of models — from Google's Gemini and Gemma families to NVIDIA NeMo open-weight models and the broader open-weight ecosystem. NVIDIA NeMo 3 Super is immediately usable within the Gemini Enterprise Agent Platform, letting developers discover, customize, and deploy inference and multimodal models. A new managed Reinforcement Learning API (RL API) has also landed in Managed Training Clusters. Built on NVIDIA NeMo RL, it automates cluster scaling, fault recovery, and job execution so teams focus on agent behavior and model quality instead of infrastructure. Cybersecurity firm CrowdStrike is already using NVIDIA NeMo open libraries — NeMo Data Designer, NeMo Automodel, and NeMo Megatron Bridge — to generate synthetic data and domain-fine-tune open LLMs like NeMo, running on Google Cloud's Blackwell GPU-based managed training clusters.

Building physical AI and industrial-grade systems demands a combination of powerful hardware, open models, libraries, and frameworks. With this week's announcements, Google and NVIDIA moved that combination from a product catalog listing to something labs and enterprises can deploy right now.