This week's AI landscape is marked by significant leaps in both mathematical reasoning and business sustainability. In a major technical milestone, OpenAI has successfully solved a long-standing Erdos geometry problem, demonstrating a level of logical precision previously unseen in large models. Simultaneously, new video action models—AI designed to understand and interact with visual environments—are now outperforming established systems like GPT 5.2 and Gemini 2.5 in specific tasks. On the operational side, Anthropic has reached a pivotal financial milestone by reporting its first profitable quarter, while Google's Gemini app continues its rapid expansion, approaching 900 million users.
The industry is also refining how models are deployed and managed. The introduction of agentic workflows—systems where AI can plan and execute multi-step tasks independently—is being used to solve "FOMAT" inefficiency, a technical bottleneck in how models process information. Other updates include the integration of Quant 3 VL8B-Instruct with Kimik A 2.5 and Anthropic's new granular token tracking for better cost management. Infrastructure is also scaling up, with SpaceX expanding its AI partnership and Callosum introducing automated model routing to optimize how requests are handled across different AI systems.
01OpenAI Solves Erdos Geometry Problem
Artificial intelligence is crossing a threshold from simple data retrieval to genuine complex reasoning. This week, OpenAI demonstrated this leap by using an internal model to resolve a mathematical challenge that had remained unsolved for eight decades. The problem, originally posed by the prolific mathematician Paul Erdos, served as a benchmark for geometric reasoning, and its resolution suggests that AI is becoming capable of tackling theoretical problems that have resisted human intuition for nearly a century.
The specific challenge centered on a geometry problem involving the placement of points on a two-dimensional plane. For eighty years, the prevailing mathematical conjecture—or an unproven proposition—was that a square grid arrangement was the most efficient way to place points to maximize the number of pairs that are exactly one unit of distance apart. In simpler terms, if a researcher wanted to create as many connections of a specific length as possible between points on a flat surface, the square grid was believed to be the optimal blueprint.
OpenAI's internal model disproved this long-standing theory by identifying a more efficient arrangement. Rather than relying on traditional two-dimensional logic, the model utilized multi-dimensional mathematics and then flattened those complex structures down into a two-dimensional plane. This sophisticated approach allowed the AI to produce a configuration that yielded more pairs than the standard grid arrangement, successfully overturning the conjecture.
This breakthrough is significant because it represents a shift in how AI handles abstract logic. By leveraging high-dimensional mathematics to solve a low-dimensional problem, the model displayed a level of creative reasoning typically associated with high-level human mathematicians. This suggests that future AI systems may not only assist in calculations but could actively drive theoretical discoveries in fields like geometry, solving legacy problems that have stood for generations.
02Video Action Models Outperform GPT 5.2 and Gemini 2.5
AI systems are becoming significantly better at navigating the internet visually, moving beyond simple text-based interactions to actually "seeing" and interacting with web pages. A recent breakthrough in visual web navigation has seen a new approach outperform the industry's most powerful models. By using a diverse combination of different AI models, a research team surpassed the performance of GPT 5.2 and Gemini 2.5 on the "video web arena" benchmark by 18% and 25%, respectively. This suggests that the future of web-based AI may not lie in a single, massive model, but in a coordinated team of specialized ones.
The secret to this performance boost is the move away from a homogeneous approach, where one model handles every part of a task. Instead, the team treated web navigation as a multi-step process. They decomposed the problem into separate phases of visual reasoning—interpreting what is on the screen—and textual reasoning—understanding the written content. Because each of these sub-tasks requires different strengths, the system uses a mixture of both open-source and closed-source video action language models. By assigning the right model to the right step, the system emulates the intelligence of frontier models while achieving higher accuracy.
This strategy of mixing and matching models is becoming a central part of how complex AI systems that can take actions are built. For example, developers can use efficient, free-to-run models like Gemma 4 for basic components while reserving more advanced, expensive models for the most difficult parts of a workflow. To make this work in practice, a seamless fallback system is required. If a high-end model reaches its capacity or token limit, the management infrastructure can automatically route the request to a different model, such as a "flash" or local version, ensuring the task is completed without interruption.
Despite these gains, testing these systems remains a significant hurdle. Evaluating how an AI navigates a live or simulated web environment is difficult due to the mechanical overhead of spinning up isolated test environments. Setting up these "sandboxes" in the specific way needed to evaluate a particular set of problems is one of the trickiest parts of the process. However, the ability to decompose complex visual tasks into manageable steps marks a major shift in how AI will eventually handle the open web.
03Anthropic Reaches First Profitable Quarter
AI development is moving from a phase of burning venture capital to generating sustainable profit. Anthropic expects to achieve its first profitable quarter, making it the first AI laboratory to reach this milestone. While this projection is subject to certain caveats—including specific revenue recognition methods and discounted compute access through a partnership with SpaceX—it signals a fundamental shift in how the industry operates. For years, AI companies existed in a "subsidy era," offering flat-rate monthly plans that effectively allowed the most active users to be subsidized by those who used the service less.
This flat-rate model is becoming economically unviable due to the rise of "token-hungry agents," which are AI tools capable of performing complex, autonomous tasks. These agents consume vast quantities of tokens—the basic units of text processed by a model—at a rate that exceeds what a fixed monthly fee can cover. Consequently, the market is transitioning into a "trade-off era" defined by usage-based pricing. To help users manage these new costs, Anthropic recently introduced a tool that provides a breakdown of token consumption, allowing customers to identify which specific agents, plugins, or skills are the biggest "token hogs."
The pressure to shift pricing is driven by an explosion in demand. Google, for example, saw its monthly token processing volume grow by 700% this year, leaping from 480 trillion last May to 3.2 quadrillion. However, the cost of serving these tokens is falling as efficiency improves. Tools like Cursor's Composer 2.5 have shown that performance comparable to state-of-the-art models can be achieved at 10 to 60 times lower cost. To further optimize its operations, Anthropic is recruiting Andre Karpathy to lead a team focused on recursive self-improvement, a strategy where the company's own model, Claude, is used to accelerate the research and development of its own pre-training process.
04Agentic Workflows Address FOMAT Inefficiency
Modern software development is shifting from a state of deep, solitary focus to a process of "agent choreography," where developers manage multiple AI agents working in parallel. This shift has introduced a new kind of anxiety known as the "Fear of Missing Agent Time," or FOMAT. Developers worry about missing critical progress or failing to capture sudden insights when they are away from their desks. To alleviate this, they are implementing systems that allow them to access and redirect their agents during breaks, ensuring that the AI continues to make progress regardless of the developer's location or time of day.
To optimize this collaboration, developers are employing hybrid tool strategies and rigorous documentation. For instance, during the creation of an Enterprise Resource Planning (ERP) system, a developer might utilize Claude Code for the initial architectural design and then switch to Codex to handle the specific implementation details. This workflow begins with a structured goal document written in Markdown—a plain-text formatting language—to provide the AI with necessary business context, such as the operational needs of a 16-employee company. To ensure no features are omitted, a JSON checklist serves as a technical guide, forcing the AI to systematically complete every required task without skipping steps.
Beyond logic, developers are using specific constraints to maintain visual and operational quality. By combining the shadcn/ui library for structural consistency with a design document referencing the aesthetic of ClickUp, they can guide the AI to produce professional user interfaces. Automation is further enhanced through "hooks," which are scripts that automatically update project documentation whenever an agent completes a task. On the infrastructure side, tools like Docker Compose are used to isolate the PostgreSQL database in a virtual container, while security is tightened by restricting system access to specific company email domains, such as @goldenlab.co.kr, to reduce the risk of unauthorized access during deployment.
05Quant 3 VL8B-Instruct and Kimik A 2.5 Integration
Running high-end AI can be prohibitively expensive and slow when companies rely on a single, massive "frontier" model to handle every task. A more efficient approach involves using a mixture of specialized models to divide the workload. By integrating Quant 3 VL8B-Instruct—a model optimized for speed through a process called quantization, which reduces the model's memory footprint—with Kimik A 2.5, developers can achieve the same level of intelligence as the world's most powerful systems while drastically reducing overhead. This strategy moves away from monolithic deployments, where one model does everything, in favor of a modular system where the right tool is used for the right job.
The practical benefits of this integration are most evident in visual web navigation, the process of an AI interacting with a website as a human would. In this specific application, the combination of Quant 3 VL8B-Instruct and Kimik A 2.5 is 1.3 times faster than using Kimi alone. Even more striking is the cost difference: this setup is 18 times cheaper to operate than relying solely on GPT 5.2. For businesses, this represents a massive shift in the economics of AI deployment, allowing them to scale complex automation without the crushing costs associated with the largest closed-source models.
This efficiency is possible because visual web navigation is not a single, uniform problem. Instead, it is heterogeneous, meaning it is made of several distinct steps that require different types of intelligence. Some steps demand visual reasoning to understand a page layout, while others require textual reasoning to process written information. By using a mixture of open and closed video action language models, this integrated approach outperformed the state-of-the-art benchmarks in the Video Web Arena. Specifically, it beat GPT 5.2 by 18% and Gemini 2.5 by 25%, proving that a coordinated team of specialized models can be more capable than a single, monolithic giant.
06Anthropic Adds Granular Token Tracking
Managing the cost of large-scale AI deployments often feels like guesswork, especially as companies move away from flat-rate pricing toward models where they pay for exactly what they use. To solve this visibility problem, Anthropic recently introduced a new tool that lets developers see exactly which parts of their AI system are consuming the most resources. By using a simple command, users can stop wondering why their bills are spiking and start identifying the specific "token hogs" that are driving up their operational expenses.
The company implemented a specific command, `/usage`, which provides a granular breakdown of token consumption. In the world of AI, tokens are the basic units of text that a model processes; the more tokens used, the higher the cost. This new command allows developers to audit their systems and see a detailed report of consumption across various disparate components. Specifically, the tool tracks usage across different skills, autonomous agents, plugins, and Model Context Protocols—the standardized ways that AI models connect to external data sources and tools. By isolating these different elements, developers can see whether a specific plugin is inefficient or if a particular agent is looping and wasting resources.
This level of transparency is critical as the industry shifts toward usage-based paradigms. When running models at scale, the actual cost of operation can be significantly higher than what early experiments suggest. Without granular tracking, a developer might know their overall spending is too high but have no way to determine if the issue lies in a specific skill or a broader architectural flaw. By providing this visibility, Anthropic is giving its users the data necessary to optimize their workflows, prune expensive but low-value components, and more accurately predict the long-term financial viability of their AI integrations. This shift transforms token management from a vague overhead cost into a manageable engineering metric.
07SpaceX Expands AI Infrastructure Partnership
SpaceX is moving beyond rockets and satellites to establish itself as a critical player in the AI infrastructure market. By providing the massive computing power required to run advanced artificial intelligence, SpaceX is helping companies like Anthropic manage the immense technical demands of model serving—the process of making an AI model available and responsive to end-users. This shift indicates that the physical hardware and data centers required to power AI are becoming as strategically important as the algorithms themselves, effectively turning a space exploration company into a foundational pillar of the global AI economy.
Anthropic is significantly expanding its capacity to host and run its models through a deepened partnership with SpaceX. Tom Brown, the chief compute officer at Anthropic, recently announced that the company is scaling its operations across two massive facilities known as the Colossus 1 and Colossus 2 data centers. These centers provide the raw processing power, or compute, necessary to handle the heavy workloads associated with modern AI. By leveraging these specific SpaceX-provided resources, Anthropic can maintain its performance and grow its capabilities as the global demand for its AI services continues to climb.
This infrastructure expansion is driven by the harsh reality that the actual cost of running large-scale AI models is far higher than early, flat-rate experiments suggested. For AI developers, the ability to scale efficiently across multiple dedicated data centers is no longer just a technical advantage but a financial necessity. As models become more complex, the processing energy required to keep them running becomes a primary bottleneck for growth. By securing this specialized infrastructure through SpaceX, Anthropic is working to stabilize its operational costs and ensure its services remain stable. This partnership underscores a broader industry trend where the most successful AI companies must secure massive, dedicated hardware footprints to overcome the limitations and costs of traditional cloud computing.
08Callosum Automates Model Routing
Callosum is increasing the efficiency of AI operations by moving away from the practice of using a single, massive model for every request. Instead, the company has implemented an automated routing layer that dynamically assigns tasks based on their difficulty. This approach ensures that simple tasks are handled by lightweight models to save costs and time, while complex problems are routed to high-performance systems. By automating this decision process, Callosum optimizes both the software model and the underlying hardware used for each specific action.
This system represents a significant evolution from the company's earlier methods. Initially, Callosum used bespoke model mapping, which relied on manual decisions to pair simple subtasks with simple models. They have since replaced these hardcoded rules with an automation layer that can detect task complexity on the fly. This allows the system to treat AI problems as heterogeneous, acknowledging that a single goal is often composed of various distinct steps—such as visual reasoning and textual reasoning—each requiring a different specialized tool to be completed successfully.
The practical impact of this routing strategy is most visible in the realm of visual web navigation. Rather than treating web navigation as a uniform process, Callosum decomposes the task into multiple reasoning steps and utilizes a mixture of open and closed video action language models. This specialized routing has allowed them to surpass the performance of top-tier frontier models. In tests within the video web arena, Callosum's approach beat GPT 5.2 by 18% and Gemini 2.5 by 25%. By intelligently distributing work across a variety of models rather than relying on one general-purpose intelligence, the system achieves higher accuracy and better overall performance.
09Gemini App Hits 900 Million Users
Google’s artificial intelligence strategy has reached a pivotal turning point, as the Gemini application now commands a massive audience of 900 million monthly active users. This milestone signals a rapid acceleration in how everyday consumers interact with advanced machine learning tools, effectively positioning Google as a dominant force in the global AI landscape. By achieving this scale, the company has successfully erased the competitive distance that once separated its primary AI offering from ChatGPT, the industry’s long-standing benchmark for user engagement. For the average person, this means that the most sophisticated AI capabilities are no longer confined to experimental labs or niche developer circles, but are now embedded into the daily digital routines of nearly a billion people worldwide.
This surge in adoption represents a significant shift in the battle for consumer attention. While Google’s broader messaging regarding its AI ecosystem has occasionally appeared fragmented to outside observers, the sheer volume of users flocking to Gemini provides undeniable proof that the technology is becoming an essential utility. The closing of the gap between Gemini and its primary rival suggests that the market for consumer-facing AI is maturing, moving away from a period of novelty and into an era of widespread, practical application. As these platforms become more integrated into the software that people use to work, communicate, and organize their lives, the ability to maintain such a large, active user base becomes the ultimate metric of success.
Ultimately, this growth highlights the broader trend of AI becoming an inescapable layer of the modern digital experience. As Google continues to refine its consumer surfaces, the focus has shifted from merely demonstrating what the technology can do to ensuring it is accessible and reliable for a global audience. With 900 million users now engaging with the platform every month, the stakes for maintaining performance and user trust have never been higher. This massive scale not only validates Google’s heavy investment in its AI infrastructure but also sets the stage for the next phase of competition, where the winner will likely be determined by who can best integrate these powerful tools into the fabric of daily life.
