For millions of people living with diabetes, the promise of AI-powered nutrition tracking is simple: snap a photo of a meal, let the model calculate the carbohydrate content, and determine the necessary insulin dose. It is a workflow that has moved from experimental novelty to a daily necessity. Yet, a recent stress test of leading large language models reveals a dangerous reality. When the same food image is submitted to the same model 500 times, the resulting carbohydrate estimates fluctuate wildly, creating a margin of error that could lead to life-threatening insulin miscalculations.

Benchmarking 26,904 Queries Across Four Leading Models

To quantify this instability, researchers conducted a rigorous evaluation of four state-of-the-art models: OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Pro, and Google Gemini 3.1 Pro Preview. The study utilized 13 distinct photographs of food, subjecting each to more than 500 individual queries to reach a total dataset of 26,904 data points. To ensure the results were not skewed by stochastic generation, all tests were performed at a temperature setting of 0, the lowest possible randomness configuration. The prompts used were derived directly from the operational environment of iAPS, an open-source automated insulin delivery system, ensuring the test reflected real-world clinical usage.

Clinical Risk and Model Variance

Historically, users have treated AI-generated nutritional data as a single, objective truth. The test results, however, demonstrate that consistency varies drastically between architectures. When analyzing a photograph of paella, for instance, Google Gemini 2.5 Pro produced estimates ranging from 55g to 484g of carbohydrates. Based on a standard 1:10 insulin-to-carbohydrate ratio (ICR), this variance represents a discrepancy of 42.9 units of insulin. For a patient relying on these numbers to manage blood glucose, such a swing is not merely a technical glitch; it is a clinical emergency. In contrast, Anthropic Claude Sonnet 4.6 maintained a significantly tighter distribution of results, demonstrating higher internal consistency compared to its peers.

The Illusion of Confidence and Systematic Bias

Perhaps the most concerning finding for developers and clinicians alike is the total lack of correlation between a model's internal confidence score and its actual accuracy. Across all four models, the confidence scores—typically ranging from 0 to 1—proved to be functionally useless. Claude frequently reported high confidence even when its accuracy was low, while the Gemini models insisted on confidence scores above 0.9 in more than 80% of cases, regardless of the actual error rate. This indicates that these models lack any meaningful mechanism to recognize their own uncertainty in a clinical context.

Furthermore, the study identified a systemic bias toward overestimation. OpenAI GPT-5.4 exhibited a consistent tendency to overestimate carbohydrate content, leading to an average excess of 1.2 units of insulin per meal. Over the course of three daily meals, this results in a cumulative error of 3.6 units. Anthropic Claude Sonnet 4.6 was the only model that avoided generating results with a clinically dangerous error margin of 5 units or more throughout the entire testing cycle.

AI-driven nutritional analysis remains in a probabilistic state that cannot currently guarantee the clinical safety required for autonomous medical decision-making.