NVIDIA and Microsoft Launch Agent Full-Stack to Slash Token Costs

For months, the AI community has operated under a frustrating paradox. Developers have access to frontier models with staggering reasoning capabilities, yet the actual deployment of autonomous agents often stalls at the infrastructure layer. The gap between a successful prototype and a production-ready agent is usually a lack of accessible, high-performance compute that does not bankrupt the operator through cloud egress fees or token costs. This week, NVIDIA and Microsoft moved to close that gap by unveiling a comprehensive agent full-stack that extends from the local Windows desktop to the massive scale of the Azure cloud.

The Hardware and Software Blueprint

The foundation of this ecosystem begins with a tiered hardware strategy designed to move agentic workloads closer to the data. For the individual developer and the AI PC market, NVIDIA introduced RTX Spark. This hardware is specifically engineered for personal AI PCs, delivering 1 petaflop of AI performance and up to 128GB of unified memory. By integrating three decades of technical evolution—including CUDA, RTX, DLSS, and TensorRT—RTX Spark provides a local environment where agents can run without constant cloud round-trips. This hardware will be available this autumn through a wide array of partners, including Microsoft Surface, ASUS, Dell, HP, Lenovo, and MSI.

For enterprise-grade workflows that require more than a workstation, the DGX Station for Windows transforms the desktop into an AI supercomputer. Powered by the GB300 Grace Blackwell Ultra, this system provides up to 748GB of coherent memory and 20 petaflops of FP4 performance. Scheduled for a fourth-quarter release via ASUS, Dell, Gigabyte, HP, MSI, and Supermicro, the DGX Station is a critical piece of infrastructure for data-sensitive organizations. The massive coherent memory allows for the efficient placement of large model weights, which physically reduces inference latency. Most notably, the DGX Station can natively run frontier models with up to 1 trillion parameters within a Windows environment, allowing companies to operate high-performance agents locally without ever connecting to an external cloud.

To populate this hardware, NVIDIA is releasing Nemotron 3 Ultra this month. This open frontier reasoning model is optimized specifically for coding, research, and corporate workflows, designed to maximize inference efficiency across the new Windows hardware lineup. However, hardware and models are only half the battle; the other half is the runtime environment. NVIDIA OpenShell addresses the inherent security risks of granting agents system-level permissions. OpenShell implements a runtime structure where each agent operates within an independent sandbox container. Released under the Apache 2.0 license, OpenShell ensures that every outbound call—whether it is a request for a file, a network connection, or a credential—undergoes a policy-based evaluation. These policies are written as code, allowing them to be version-controlled in repositories and updated in real-time. This system integrates directly with GitHub Copilot, enabling agents to perform autonomous tasks in a strictly isolated environment across on-premises, hybrid, and cloud deployments.

Beyond digital tasks, the stack extends into the physical world via NVIDIA Cosmos 3. Utilizing a Mixture-of-Transformers (MoT) architecture, Cosmos 3 is an omni-model that integrates vision reasoning, world simulation, and action generation into a single framework. This allows robots, autonomous vehicles, and industrial systems to perceive their surroundings, predict subsequent states, and execute physical movements. Cosmos 3 currently ranks first among open models in key benchmarks for vision reasoning and world generation, providing a bridge for agents to move from digital interfaces to physical industrial equipment.

To ensure these agents remain responsive in the cloud, NVIDIA introduced Dynamo and Grove. NVIDIA Dynamo accelerates the cold start times of models running on Azure Kubernetes Service (AKS), ensuring that agents react instantaneously to user triggers. Complementing this, NVIDIA Grove provides Kubernetes-native distributed inference orchestration. Grove efficiently allocates GPU resources across multiple nodes, preventing the performance degradation that typically occurs when managing large clusters of autonomous agents.

The Economics of Autonomous Intelligence

The true shift in this announcement is not just the availability of new chips, but the fundamental change in the economics of inference. When NVIDIA and Microsoft applied GPU acceleration to the Microsoft Fabric data warehouse, the results were stark: SQL execution speeds increased by up to 6x compared to CPU-based baselines. In high-concurrency environments where multiple agents request data simultaneously, the system performed up to 7x faster than other major cloud data warehouse providers. This acceleration in the data layer directly shortens the cycle between an agent's data retrieval and its subsequent decision, making real-time autonomous operation viable at scale.

This efficiency extends to the power grid with the NVIDIA Vera Rubin platform. Vera Rubin increases inference throughput per megawatt by up to 10x compared to the previous Blackwell architecture. Because it maintains slot compatibility with Blackwell, it can be integrated into Azure infrastructure immediately. This 10x increase in compute-per-watt means that for the same power budget, an operator can process ten times as many tokens. This leads to a reduction in token costs by an order of magnitude, fundamentally changing how developers design agentic workflows. Previously, developers were forced to minimize reasoning steps to keep costs manageable. With the Vera Rubin platform, complex multi-step reasoning and long-running autonomous workflows become commercially viable.

This cost reduction enables a new level of deployment flexibility through Foundry Local on Azure Local. For organizations that cannot allow data to leave their premises, Microsoft is deploying the RTX PRO 6000 Blackwell Server Edition platform. This setup supports the vLLM runtime and multi-node deployment, allowing large-scale local inference that exceeds the memory limits of a single server. This is particularly critical for the manufacturing and energy sectors, where millisecond-level response times are tied to safety and productivity. By using the vLLM runtime and multi-node distribution, these industries can eliminate inference bottlenecks in real-time process control or energy grid load balancing, keeping data sovereign and latency near zero.

On a global scale, this infrastructure is manifesting as AI Factories. The Fairwater Wisconsin AI Factory connects hundreds of thousands of NVIDIA Grace Blackwell systems into a single entity, utilizing the Multipath Reliable Connection (MRC) protocol to maximize data transfer efficiency. This facility, linked with another in Georgia, creates a distributed AI system capable of powering the most demanding frontier models. The combination of power, cooling, and NVIDIA Spectrum-X Ethernet optimization means that the physical tightness of the infrastructure now directly dictates the total inference throughput.

Finally, the Foundry Agent Service provides the flexibility of model choice, supporting Anthropic Claude, OpenAI, and Hermes models. Specifically, Claude models now run natively on GB300 Blackwell Ultra systems within Azure. By leveraging Azure's built-in identity and governance frameworks, enterprises can mix and match models based on the specific needs of their workflow, reducing vendor lock-in while optimizing for performance.

The bottleneck for AI agents has shifted. It is no longer a question of whether a model is smart enough to perform a task, but whether the infrastructure can support that intelligence without prohibitive costs or security risks. By unifying the stack from the Windows desktop to the AI Factory, NVIDIA and Microsoft have moved the goalposts from model performance to environmental optimization.

Success in the agent era will now be defined by the ability to choose the right deployment target—on-premises, hybrid, or cloud—based on the precise intersection of token cost, data sovereignty, and inference latency.

NVIDIA and Microsoft Launch Agent Full-Stack to Slash Token Costs

The Hardware and Software Blueprint

The Economics of Autonomous Intelligence

Related Articles