Kapa's Indexing Captioning Strategy Solves Multimodal RAG Cost Bloat

Developers building multimodal Retrieval-Augmented Generation (RAG) systems are hitting a wall that no amount of context window expansion seems to fix. The workflow is familiar: a user asks a technical question, the system retrieves several high-resolution images from a manual, and those images are fed directly into a vision-capable model to generate an answer. However, this approach creates a punishing token tax. Every single query triggers a massive spike in input costs and consumes a significant portion of the context window, often leaving little room for the actual reasoning process or the retrieved text. The industry has been treating images as dynamic query-time inputs, but this is proving to be an unsustainable economic model for production-scale technical support.

The Shift to Index-Time Vision Processing

Kapa, the AI-powered technical documentation search service, has pivoted away from query-time image processing in favor of a method called indexing captioning. Instead of passing raw images to the model during the retrieval phase, Kapa moves the vision workload to the indexing phase. The process is straightforward: a vision model analyzes every image in the documentation once, generates a detailed text caption, and stores that caption in the same vector database as the standard text chunks. When a user submits a query, the system searches for the relevant text captions rather than the images themselves. Once a match is found, the model receives the text caption and a URL pointing to the original image, allowing it to cite the visual source without ever having to process the raw pixels during the inference step.

This architecture distinguishes between two critical types of visual data found in technical manuals. The first are illustrative images, which serve as visual aids to support existing text and help users execute a task. The second, and far more critical, are load-bearing images. These include wiring diagrams, specification tables, certification documents, and color availability matrices. Load-bearing images are the sole source of truth for specific data points that do not exist anywhere else in the text. By converting these specific values and table structures into precise text captions during indexing, Kapa ensures that the retrieval mechanism can find the image based on the actual data it contains, rather than a vague visual similarity.

The Failure of Multimodal Embeddings and the Cost of Raw Input

Many teams attempt to solve the image retrieval problem using multimodal embeddings, such as CLIP (Contrastive Language-Image Pre-training), which maps images and text into a shared vector space. While this works for general image search, it fails catastrophically in technical domains. CLIP-style embeddings tend to smudge fine-grained details, losing the precise annotations, small-print labels, and specific chart values that make a technical document useful. When a developer asks a highly specific question like "How do I configure X?", the resulting text vector often lacks enough signal to match with a generalized image vector, leading to poor retrieval accuracy.

Moving the vision processing to the indexing stage fundamentally alters the cost structure. Indexing is a one-time expense, whereas query-time processing is a recurring tax. Kapa found that by using pre-generated captions, the query-per-cost overhead is only 1% to 6% higher than a purely text-based RAG system. Despite this minimal cost increase, the performance gains are statistically significant. Using McNemar's test, Kapa confirmed with p < 0.05 that LLMs consistently prefer answers provided with image context over those without. This proves that the quality of the answer depends on the presence of the visual information, but the delivery of that information does not require the raw image to be present in the prompt.

The financial risk of raw multimodal input is escalating. Kapa's data shows that query costs for GPT 5.1 increased by 27%, while Claude 4.6 Sonnet saw a 51% increase. The token consumption varies wildly between providers; Claude processes a single image at approximately 975 tokens, whereas GPT uses 716 tokens. Beyond the cost, there is the physical limit of the payload. With Claude's limit at 30MB and OpenAI's at 50MB, a system attempting to process just 25 high-resolution images can hit the transmission ceiling, causing the entire request to fail. This creates a hard bottleneck for any application that needs to synthesize information from multiple diagrams or screenshots simultaneously.

To further refine this pipeline, Kapa implemented a two-stage filtering process to ensure only valuable images are captioned. The first stage uses heuristics to instantly discard unsupported file formats, images that are too small to be useful, or those with extreme aspect ratios that suggest they are not meaningful content. The second stage employs a zero-shot classifier based on multimodal embeddings to verify the validity of the remaining images. This classifier achieved 96.8% accuracy and an F1 score of 0.974, ensuring that the indexing budget is spent only on images that actually contribute to the knowledge base.

Ultimately, the economic viability of a RAG system is not determined by the size of the model, but by the precision of the indexing architecture.

Kapa's Indexing Captioning Strategy Solves Multimodal RAG Cost Bloat

The Shift to Index-Time Vision Processing

The Failure of Multimodal Embeddings and the Cost of Raw Input

Related Articles