The end of the month usually brings a specific kind of anxiety for AI engineers: the arrival of the OpenAI invoice. For many teams, this bill is a black box. When the total exceeds the budget, the subsequent investigation is a manual nightmare of exporting CSV files, scrubbing through thousands of lines of logs, and attempting to reconstruct token usage in Excel spreadsheets to find the leak. This gap between spending and visibility has turned LLM cost management into a game of guesswork, where developers often realize a prompt loop has gone rogue only after the credit limit is hit.

The Zero-Config Approach to LLM Observability

Spanlens addresses this visibility gap by eliminating the need for complex SDK integrations or invasive code changes. The entire tracking mechanism is triggered by modifying a single line of configuration: the baseURL. By redirecting the API endpoint used by OpenAI, Anthropic, or Gemini SDKs to `https://api.spanlens.io/proxy/openai/v1`, the platform intercepts all traffic to provide real-time cost and token tracking. Because it operates as a passthrough proxy, it maintains full compatibility with advanced features like streaming, tool calling, and JSON mode, ensuring that the observability layer does not break the existing application logic.

Beyond simple accounting, the platform tackles the complexity of agentic workflows. Debugging a multi-model agent that takes 30 seconds to complete a task is nearly impossible with standard text logs. Spanlens solves this by integrating a topology view specifically for LangGraph. This visualization maps the agent's execution as a series of nodes and edges, allowing developers to see exactly which step in the graph is causing the most friction. To further accelerate optimization, the Critical Path analysis automatically highlights the specific chain of calls responsible for the highest latency, removing the need for developers to manually correlate timestamps across disparate log files.

To move beyond anecdotal evidence when tuning prompts, Spanlens incorporates a rigorous statistical validation tool. Rather than relying on simple averages—which can be skewed by outliers—the platform employs the Welch t-test to compare two different prompt versions. This allows teams to determine if a reduction in latency or token usage is a statistically significant improvement or merely a result of random variance. The entire project is released as open source under the MIT license, allowing teams to inspect the code on GitHub or deploy their own instance via Docker.

Engineering a Non-Blocking Tracing Architecture

Implementing a proxy that monitors every single token without introducing its own latency is a significant engineering challenge. Spanlens utilizes Hono, a lightweight web framework, to build a high-performance proxy layer. Security is handled at the entry point, where sensitive API keys are decrypted using AES-256-GCM. To minimize the attack surface, these keys are only decrypted in memory immediately before the call is forwarded to the LLM provider, ensuring that plain-text keys are never persisted in the observability layer.

To ensure that the act of monitoring does not slow down the user experience, the architecture employs a technique called `body.tee()`. When a request and response flow through the proxy, the data stream is split into two identical paths. The original stream is passed directly to the user without any processing, while a duplicate copy is sent to a background parser. This parser calculates token counts and costs asynchronously, meaning the analysis happens in parallel with the response delivery rather than in the critical path of the request.

For the data layer, the platform leverages ClickHouse, a column-oriented database designed for massive analytical workloads. This allows Spanlens to ingest and query vast amounts of log data with minimal overhead. To prevent data loss during spikes or system instability, the system uses a fallback queue powered by Supabase. If a log fails to write to ClickHouse, it is temporarily stored in the Supabase queue and later re-processed by a cron job. This redundancy ensures that the logs used for billing audits are perfectly synchronized with actual API usage.

Performance is further optimized through a sophisticated caching strategy for model pricing. Since pricing data changes infrequently, Spanlens stores it in a database with a 5-minute TTL (Time To Live) cache. It implements a stale-while-revalidate pattern, which serves the cached price immediately while updating the value in the background. This prevents the system from waiting on a database query before forwarding a request to the LLM. The complete stack—comprising Next.js 14, Hono, Supabase Postgres, and ClickHouse—is managed as a TypeScript pnpm monorepo. Detailed instructions for those wishing to run the system on their own infrastructure are available in the self-hosting guide.

By shifting the focus from raw log analysis to visual topology and statistical validation, Spanlens transforms LLM optimization from a guessing game into a precise engineering discipline.