Weave Router Slashes LLM Costs by 40-70% With Smart Proxying

Developers today find themselves in a constant tug-of-war between model performance and API bills. The current workflow often involves a manual, tedious cycle: using a frontier model like GPT-4o or Claude 3.5 Sonnet for complex reasoning, then attempting to downgrade to a cheaper model for simpler tasks to save costs. This fragmentation forces teams to either overpay for simplicity or risk quality degradation by guessing which model is sufficient for a specific prompt. The industry has lacked a seamless, automated layer that can make these decisions in real-time without adding significant latency to the developer experience.

The Architecture of Automated Cost Reduction

Weave has introduced a smart model router designed to unify requests across Anthropic, OpenAI, and Gemini through a single endpoint. The primary value proposition is a drastic reduction in operational overhead, with the company claiming that the router can lower total LLM spending by 40-70%. To ensure this does not come at the cost of speed, the routing logic is engineered to execute in under 50ms, making the overhead virtually imperceptible to the end user.

The tool is built for immediate integration into existing AI development ecosystems. It provides native support for high-velocity tools including Claude Code, Codex (the OpenAI CLI), opencode, and Cursor, as well as custom-built internal applications. Deployment is flexible, offering two distinct paths. Users can opt for a hosted version that requires no Docker or Postgres installation, allowing for a single-command startup. Alternatively, developers can choose a self-hosted deployment running on `localhost:8080`, which provides a local router and a dedicated management dashboard. The self-hosted environment requires Node.js version 18 or higher, and those utilizing Claude Code or opencode paths must have the `jq` library installed on their system.

Shifting from Prompt-Based to Embedder-Based Routing

What separates Weave Router from previous attempts at model orchestration is its underlying mechanism. Most existing routers rely on prompt-based routing, where a primary LLM is asked to analyze a request and decide which secondary model should handle it. This approach is inherently paradoxical: it consumes tokens and adds latency just to decide how to save tokens and reduce latency.

Weave Router instead implements a system based on the Avengers-Pro research, detailed in the paper Beyond GPT-5: Making LLMs Cheaper and Better via Performance–Efficiency Optimized Routing. Rather than using a generative model for decision-making, it employs a tiny on-box embedder. This embedder processes the input vector of a request to determine the most efficient model capable of maintaining the required performance level. By moving the decision logic from a generative prompt to a vector-based embedding, the system achieves the sub-50ms latency mentioned previously.

From an API perspective, the router functions as a transparent proxy. It supports the Anthropic Messages API natively, meaning tools like opencode require zero code changes to function. Authentication is handled via a specific HTTP header, `X-Weave-Router-Key`, while existing keys such as `OPENAI_API_KEY` are managed through a plan-based passthrough that forwards them directly to `api.openai.com`. Developers access the model via the `http://localhost:8080/v1` endpoint, while the administrative interface is hosted at `http://localhost:8080/ui/`. This creates a centralized control plane where infrastructure can be managed without modifying individual API calls across a codebase.

Implementation and Real-World Control

Integrating the router into a professional workflow involves modifying configuration files, a process Weave has automated through tool-specific installation commands. For users of Codex (OpenAI CLI), running the following command automatically patches the configuration:

bash

npx @workweave/router --codex

This command modifies `~/.codex/config.toml` or the local project configuration by adding a `[model_providers.weave]` block and updating the `model_provider` value to `weave`. Similarly, for opencode, the router merges a `provider.weave` entry into `~/.config/opencode/opencode.json`.

For Cursor users, the process is manual but straightforward. In the settings menu under Models, users must set the Override OpenAI Base URL to `http://localhost:8080/v1` and provide the router key starting with `rk_...` in the API key field.

Crucially, Weave provides a safety valve for developers who encounter edge cases or performance dips. Routing can be disabled instantly for specific tools using the following commands:

bash

npx @workweave/router off --claude
npx @workweave/router off --codex
npx @workweave/router off --opencode

For those using Claude Code, the router is integrated directly into the CLI experience via slash commands. Developers can use `/router-off`, `/router-on`, and `/router-status` to toggle the proxy or check its health in real-time. This ensures that the developer retains absolute control, allowing them to revert to the default API path without editing config files or restarting their environment.

This shift toward embedder-based routing suggests a future where the specific model choice becomes an implementation detail handled by the infrastructure layer rather than a manual architectural decision.

Weave Router Slashes LLM Costs by 40-70% With Smart Proxying

The Architecture of Automated Cost Reduction

Shifting from Prompt-Based to Embedder-Based Routing

Implementation and Real-World Control

Related Articles