For months, the standard operating procedure for developers deploying Vision-Language Models (VLMs) has been a simple, output-based audit. If the model generates a description of a medical scan without mentioning the patient's name or the hospital's watermark, the developer assumes the sensitive data is safely contained. This reliance on the final text string as the sole arbiter of privacy creates a dangerous blind spot. It assumes that the process of transforming internal high-dimensional representations into a few words acts as a perfect filter, scrubbing away any information not explicitly requested by the prompt.
The Probing Architecture of Vision-Language Models
A new paper titled "What Do Your Logits Know?" by Masha Fedzechkina, Eleonora Gualdoni, Rita Ramos, and Sinead Williamson challenges this assumption by systematically probing the internal representations of VLMs. The research builds on a growing body of work in AI interpretability that suggests models often "know" far more than they actually express in their final output. To test this, the team focused on two specific bottlenecks where information is compressed as it moves from the model's deep layers toward the final prediction.
The first bottleneck is the tuned lens, a technique used to project the residual stream—the primary highway of information flowing through the model's layers—into a lower-dimensional space. This allows researchers to see what the model is "thinking" before it commits to a specific word. The second bottleneck is the set of top-k logits. Logits are the raw, unnormalized scores the model assigns to every possible next token in its vocabulary before the softmax function converts them into probabilities. These logits represent the most accessible point of the model's internal state for anyone with API access to the model's probability distributions.
The researchers compared how much task-irrelevant information from an image survives these two different compression points. They discovered that the top-k logits do not merely contain the information necessary to produce the correct answer. Instead, they often retain a surprising amount of data about the image that has nothing to do with the user's query. In several experimental cases, the information leaked through the logits was nearly equivalent to the information found in the direct projection of the entire residual stream.
The Collapse of the Output-Only Safety Paradigm
The critical insight here is the distinction between intentional and natural bottlenecks. The tuned lens is an artificial projection designed for analysis, whereas the logits are a natural byproduct of the model's architecture. For years, the industry has operated under the belief that the transition from the residual stream to the logits acted as a lossy compression that discarded irrelevant details. The logic was that if the model is asked to identify a breed of dog, the logits for "Golden Retriever" would be high, and the information about the color of the carpet in the background would be discarded because it is not needed for the prediction.
However, the findings in "What Do Your Logits Know?" prove that this natural bottleneck is far leakier than previously thought. The logits act as a high-fidelity mirror of the residual stream's hidden state. This means that even if the model's final text output is perfectly sanitized, the underlying logit distribution still encodes sensitive, task-irrelevant details of the input image. This creates a massive disparity between what the model says and what the model reveals.
This revelation transforms the logit from a mathematical utility into a potential security vulnerability. Because logits are often exposed via APIs to allow developers to adjust temperature or perform beam search, they provide a direct window into the model's internal knowledge. If an adversary can probe these logits, they can extract information that the model was explicitly trained or prompted to ignore. The bottleneck that developers trusted to protect privacy is, in reality, a wide-open door.
Model providers can no longer treat the final text generation as the only surface area for data leakage. The fact that top-k logits preserve so much of the residual stream's information suggests that the model is carrying a heavy load of unnecessary data all the way to the final output layer. This necessitates a fundamental shift in how VLM safety is implemented, moving away from output filtering and toward internal representation management.
To mitigate this risk, the research suggests that model providers may need to implement active interference at the logit level. This could include adding calibrated noise to the logits to mask irrelevant information or removing specific dimensions of the representation that are prone to leaking sensitive data. Without these interventions, any API that provides raw logit access is effectively leaking a compressed version of the input image's entire feature set.
The era of believing that a clean text response equals a secure model is over. Every single logit may be harboring the secrets of the image it processed.




