MEMO Framework Updates LLM Knowledge Without Changing Model Parameters

A score of 0.00%. That is the result the Cartridges model recorded on the BrowseComp-Plus benchmark. It is the AI equivalent of a student who studied the entire syllabus but failed to answer a single question on the exam. This failure is not an isolated incident of a poorly tuned model; rather, it exposes a structural flaw in how large language models (LLMs) acquire and retain new information. Once the pre-training phase ends, an LLM's knowledge is effectively frozen in time. To update this knowledge, developers face a brutal choice: spend millions on full retraining, risk catastrophic forgetting through fine-tuning, or rely on Retrieval-Augmented Generation (RAG), which often collapses under the weight of noisy documents or complex, multi-hop queries.

The Modular Architecture of Memory as a Model

To break this deadlock, a joint research team from the National University of Singapore (NUS), MIT CSAIL, A*STAR, and SMART has introduced MEMO (Memory as a Model), a framework detailed in arXiv 2605.15156. The fundamental premise of MEMO is the total decoupling of memory and reasoning. Instead of forcing a single model to both store facts and reason over them, MEMO splits these roles between a dedicated MEMORY model and an EXECUTIVE model.

In the team's implementation, the MEMORY model is based on Qwen2.5-14B-Instruct, which is specifically trained to internalize a given corpus of knowledge within its parameters. The EXECUTIVE model, which handles the logic and synthesis, can be any high-performance model such as Qwen2.5-32B-Instruct or Gemini-3-Flash. Crucially, the EXECUTIVE model operates as a black-box API. It requires no access to the MEMORY model's weights or logits, interacting only through standard input-output interfaces. This means the reasoning engine can be upgraded or swapped without needing to retrain the knowledge base.

The operational flow of MEMO relies on a sophisticated three-stage multi-turn protocol. In the first stage, Grounding, the EXECUTIVE model decomposes a complex user query into atomic sub-questions. Each sub-question is designed to target a single identifying constraint, which the MEMORY model then answers independently. This decomposition prevents the reasoning errors common in RAG systems that attempt to process massive chunks of text at once. The second stage is Entity Identification, where the EXECUTIVE model analyzes the grounding responses and issues targeted follow-up queries to narrow down the specific entity in question. This iterative process continues until the entity is confirmed or the query budget is exhausted.

Finally, the system enters the Answer Seeking and Synthesis stage. Once the entity is locked, the EXECUTIVE model requests specific supporting facts from the MEMORY model. These facts are returned as highly compressed natural language snippets. Because the MEMORY model has internalized the knowledge, the cost of this retrieval remains constant regardless of how large the original corpus was. This stands in stark contrast to RAG, where inference costs scale linearly with the volume of retrieved documents.

To populate the MEMORY model without manual labeling, the researchers developed a five-step synthetic data pipeline using Qwen2.5-32B-Instruct as a Generator. This pipeline converts raw corpora into reflective QA datasets, transforming static text into diverse query-answer pairs. The importance of the final refinement stage in this pipeline is stark: when Step-5 was removed, accuracy on the NarrativeQA benchmark plummeted from 24.00% to 6.37%. The resulting QA pairs are then used for Supervised Fine-Tuning (SFT) of the MEMORY model, with the loss function calculated exclusively on the answer tokens.

Noise Immunity and the Efficiency of TIES Merging

The true tension in current AI architecture is the trade-off between knowledge depth and noise sensitivity. RAG systems are notoriously fragile; if a retrieval engine pulls in a few irrelevant documents, the LLM often hallucinates or loses the thread of the query. MEMO solves this by eliminating the external retrieval step entirely during inference. By internalizing the knowledge, the model becomes immune to the distractions of an external document store.

Benchmark results highlight this resilience. Using Gemini-3-Flash as the EXECUTIVE model, MEMO achieved 53.58% accuracy on NarrativeQA, more than double the 23.21% recorded by HippoRAG2. On the MuSiQue benchmark, MEMO reached 60.20%, surpassing HippoRAG2's 57.00%. In the BrowseComp-Plus environment, MEMO maintained a slight lead with 66.67% compared to 66.33%.

When the researchers introduced distracting documents into the BrowseComp-Plus benchmark to test noise tolerance, the results were definitive. HippoRAG2 and NV-Embed-V2 saw performance drops of up to 6.22%. MEMO, however, showed a variance of only +0.55%, remaining effectively stable. This proves that a decoupled memory architecture can maintain high precision even in chaotic information environments where traditional search-based systems fail.

Beyond accuracy, the framework addresses the prohibitive cost of updating knowledge. Rather than full retraining, the team employed TIES (Trimming, Electing, and Merging) merging. This technique extracts the parameter differences (task vectors) from models trained on specific knowledge domains and merges them into a single model using a density value of 0.3. The computational savings are massive. In a corpus environment where K=2, TIES merging reduced GPU time from 72 hours to 48 hours, a 33% saving. When scaled to K=10, the gap widened significantly: full retraining required 1,320 hours, while TIES merging completed the task in just 240 hours, representing a 5.5x reduction in compute costs.

There is a performance trade-off associated with this efficiency. The merged models showed a performance gap compared to fully retrained models—11.04% when using Qwen2.5-32B-Instruct and 19.11% with Gemini-3-Flash. However, even with this dip, the merged MEMO models consistently outperformed all RAG-based baselines. This suggests that for most industrial applications, the 5.5x reduction in cost is a justifiable trade for performance that still exceeds the current state-of-the-art in retrieval.

To ensure the framework wasn't dependent on a specific model family, the researchers tested various MEMORY model architectures. They compared Qwen2.5-1.5B-Instruct, Gemma3-1B-IT, and LFM2.5-1.2B-Instruct (a hybrid State Space Model and Transformer). The results remained consistent across all three, proving that MEMO is architecture-agnostic. This flexibility allows developers to choose the most efficient small language model (sLLM) for their specific language or infrastructure needs while relying on a powerful, separate API for reasoning.

By treating memory as a modular component rather than a fixed weight in a monolithic network, MEMO transforms the LLM from a static archive into a dynamic system capable of efficient, low-cost updates.

MEMO Framework Updates LLM Knowledge Without Changing Model Parameters

The Modular Architecture of Memory as a Model

Noise Immunity and the Efficiency of TIES Merging

Related Articles