This week, the developer community buzzes with excitement over the launch of Amazon SageMaker AI's G7e instance. Equipped with the NVIDIA RTX PRO 6000 Blackwell Server Edition GPU, the G7e instance presents a unique opportunity to achieve both cost efficiency and enhanced performance. Developers are actively discussing the implications of this new instance's pricing and capabilities.

G7e Instance Specifications and Performance

The G7e instance offers configurations with 1, 2, 4, or 8 RTX PRO 6000 GPUs, each boasting 96 GB of GDDR7 memory. This powerful setup can host robust open-source models like GPT-OSS-120B, Nemotron-3-Super-120B-A12B, and Qwen3.5-35B-A3B. Compared to its predecessor, the G6e instance, the G7e delivers up to 2.3 times improved inference performance, with each GPU supporting a bandwidth of 1,597 GB/s. Its networking speed reaches an impressive 1,600 Gbps, marking a fourfold increase over the G6e and a sixteenfold improvement over the G5.

The G7e instance provides a total of 768 GB of aggregated GPU memory, allowing it to host models that previously required multiple nodes on a single instance. This consolidation reduces operational complexity and minimizes latency between nodes. Additionally, the G7e supports FP4 precision and enables high-speed data transfers through NVIDIA GPUDirect RDMA.

Cost Efficiency of the G7e Instance

To better understand the G7e instance's performance, benchmarking results of the Qwen3-32B model on both the G6e and G7e instances are noteworthy. The G6e, using 4x L40S GPUs on the ml.g6e.12xlarge instance, operates at a cost of $13.12 per hour, achieving a performance of 37.1 tok/s. In contrast, the G7e runs on the ml.g7e.2xlarge instance with a single RTX PRO 6000 GPU for just $4.20 per hour. The G7e processes one million output tokens at a cost of $0.79 with C=32 concurrent requests, representing a 2.6 times cost reduction compared to the G6e's $2.06.

The G7e's single GPU architecture maintains more stable performance as load increases, exhibiting a lower rate of latency increase compared to the G6e. While the G6e experiences a 62% increase in latency when moving from C=1 to C=32, the G7e only sees a 22% increase, suggesting that the G7e is a superior choice for cost-optimized production batches.

The G7e instance can also be combined with EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) for even greater performance enhancements. EAGLE predicts multiple future tokens based on the model's hidden representations and validates them in a single forward pass, allowing for the generation of multiple tokens while maintaining output quality. The combination of G7e and EAGLE3 offers a 2.4 times increase in throughput and a 75% reduction in costs compared to previous generations.

The G7e instance is billed at standard inference pricing on Amazon SageMaker AI, opening new opportunities for developers. With the introduction of the G7e instance, many companies can now explore the potential for building more efficient AI models.