Every morning, imagine asking your shopping AI: "Find me a USB-C charger under $25 with two-day shipping." The AI talks a good game—but too often it recommends a product that doesn't exist, invents an out-of-stock item ID, or adds the wrong variant to your cart. This gap between fluent conversation and reliable execution has been the Achilles' heel of e-commerce agents. This week, a new training ground called EcomRLVE-GYM aims to close that gap, one simulated transaction at a time.
400 Environments, 13 Difficulty Levels: A Shopping Simulator Built for Reinforcement Learning
The research team behind EcomRLVE-GYM has constructed 400 independent environments, each designed to mirror a real e-commerce scenario: product search, cart assembly, order lookup, return processing, and more. Every environment generates a hidden ground-truth goal, and when a simulated user initiates a conversation, the AI agent must handle the request using six tools: `catalog.search`, `cart.add`, `cart.remove`, `order.lookup`, `policy.query`, and `cart.checkout`. The agent must parse natural language, decide which tool to call, and produce valid JSON outputs—all while staying within a turn budget.
Difficulty is controlled across 13 levels (0 through 12), where each level simultaneously governs 12 independent dimensions: number of products, variant selection, quantity, turn budget, typo noise, context switching, search depth, order history size, policy complexity, tool budget, and more. At d=0, the agent simply adds a single product to the cart with no variant selection. At d=6, the agent must add three products, each with specific variants (USB-C vs Lightning, matte vs glossy) and quantities greater than one, while handling interruptions and correcting errors. The system tracks success rates per environment and automatically escalates difficulty only when the agent consistently passes the current level.
Code-Based Rewards Replace Human Judges and LLM Evaluators
Traditional shopping AI training relied on human annotators or LLM-as-a-judge to evaluate conversations—both expensive and inconsistent. EcomRLVE-GYM takes a different approach: reward is computed entirely in code. Three signals drive the reward function:
1. **F1 score** — exact match accuracy on (product, variant, quantity) tuples between the agent's output and the ground-truth goal.
2. **Efficiency bonus** — additional points for completing the task in fewer turns.
3. **Hallucination check** — every recommended product ID must appear in actual search results from the catalog. Any fabricated ID triggers a penalty.
Invalid JSON output or calls to disallowed tools immediately assign a failure score. Because the evaluation program has access to the hidden goal, no human annotation or LLM judgment is required. Developers can download the full environment and code from the GitHub repository.
The Twist: Adaptive Difficulty Scheduling That Prevents Plateaus
What makes EcomRLVE-GYM different from static benchmarks is its adaptive difficulty scheduling. Each environment independently tracks the agent's success rate. Only when the agent reliably passes the current difficulty level does the system automatically advance to the next. This prevents two common failure modes: tasks so easy the agent learns nothing, and tasks so hard the agent makes no progress.
A concrete example from the team's experiments: the Qwen 3 8B model completed a d=1 task cleanly in three turns. At d=8, the same model chose "charcoal" instead of "bamboo" for a product color, selected XS instead of XL for size, and then—after two user corrections on an air fryer variant—hallucinated that "that variant doesn't exist." This cascade of errors, each compounding the last, is exactly the pattern adaptive training is designed to correct. The agent must learn to recover from mistakes, not just execute a single perfect trajectory.
Shopping AI is taking its first step from fluent conversation to reliable task completion—one environment at a time.


