The modern AI researcher spends a significant portion of their week staring at spreadsheets. These tables are often monstrous, filled with dozens of columns representing disparate benchmarks, percentage gains, and token-efficiency ratios that offer little intuitive sense of how a model actually feels to use. Tech critic Thibaut Mélen recently captured this collective frustration on X, noting that understanding model progress becomes much easier when the data is visualized through a lens we already understand. The industry is shifting away from raw percentage points and toward a more human-centric metric: the Intelligence Quotient.
The Architecture of the AI IQ Bell Curve
This shift has been formalized by engineer and investor Ryan Shea through aiiq.org, a platform that maps over 50 of the latest large language models onto a human-style IQ bell curve. As of mid-May 2026, the data places OpenAI's GPT-5.5 at the absolute peak of the distribution with an estimated IQ of 136. The competition at the top is fierce, with a very narrow margin separating the leaders. Anthropic's Opus 4.7 follows closely with an IQ of approximately 132, while GPT-5.4 and Google's Gemini 3.1 Pro both hover around 131.
To arrive at these numbers, the system does not rely on a single test but aggregates 12 distinct benchmarks divided into four core reasoning domains. Abstract reasoning is measured via ARC-AGI-1 and ARC-AGI-2, which test a model's ability to recognize and extrapolate patterns. Mathematical reasoning is derived from FrontierMath, AIME, and ProofBench, focusing on complex problem-solving. Programming reasoning utilizes Terminal-Bench 2.0, SWE-Bench Verified, and SciCode to evaluate code generation and debugging capabilities. Finally, academic reasoning is gauged through high-difficulty knowledge tests including Humanity's Last Exam, CritPt, and GPQA Diamond.
Shea's methodology includes a critical layer of manual adjustment to ensure the scores remain honest. By applying a difficulty curve, the system prevents models from inflating their scores through tests that are prone to data contamination or are relatively too simple. For models where data is sparse, the system adopts a conservative approach, effectively penalizing the score to avoid overestimation. This rigorous filtering reveals a dense cluster of mid-tier models, primarily from Chinese developers. Models such as Kimi K2.6, GLM-5, DeepSeek-V3.2, Qwen3.6, and MiniMax-M2.7 are tightly packed between IQ 112 and 118, providing a clear indicator for enterprise users seeking high performance-to-cost ratios.
The Tension Between Logic and Empathy
While a single IQ number provides a convenient shorthand, it masks a deeper complexity in how these models function. The real insight emerges when IQ is contrasted with EQ, or emotional intelligence. The AI IQ framework introduces an EQ score calculated by blending EQ-Bench 3 Elo, which measures empathy and emotional understanding, with Arena Elo, a relative ranking based on human preference votes, in a 50-50 ratio.
When these two metrics are plotted on a scatter plot, the hierarchy shifts. While GPT-5.5 maintains the lead in raw cognitive power, Anthropic's Opus 4.7 emerges as the most balanced model, occupying the upper-right quadrant of the graph. This suggests that while OpenAI may hold the edge in pure logic, Anthropic has found a more harmonious intersection between intellectual capability and human-like interaction.
However, this quantification of intelligence is not without its critics. Many developers and researchers argue that AI intelligence is fundamentally spiky rather than general. A model might exhibit genius-level proficiency in Python while failing a basic common-sense reasoning task, making a single average IQ score potentially misleading. There is also a brewing controversy regarding the objectivity of the EQ measurements. Because the scoring for EQ-Bench 3 is performed by Claude, an Anthropic model, critics suggest a systemic bias that may unfairly favor the Opus series.
Ultimately, the attempt to define machine intelligence through a single number reflects a human desire for simplicity over a technical reality of fragmentation. We are trying to fit a multidimensional alien intelligence into a one-dimensional human scale.
This movement toward standardized intelligence metrics suggests that the next battleground for AI labs will not be raw power, but the optimization of the balance between logic and empathy.




