Andre Karpathy Joins Anthropic and Gemini 3.5 Flash Debuts

The AI landscape sees a major talent shift as Andre Karpathy joins Anthropic, while Google expands its model lineup with the debut of Gemini 3.5 Flash and Omni. The new Flash model is already being used to develop an autonomous operating system, signaling a shift toward AI that can manage computer environments independently. In the realm of software development, Codex has introduced a structured workflow to improve reliability, and new benchmarks show Google's AI-driven coding agents outperforming GPT-5.5. Connectivity is also evolving, with Vapi and Hermes integrating via the Model Context Protocol—a standard that allows different AI tools to share data more easily.

Beyond these headlines, the digest tracks the deployment of trading agents on Hyperliquid and Greg Brockman’s latest arguments for a text-based path to artificial general intelligence. We also look at how AI Overviews are scaling across Google Search and the shift in how models are evaluated, using "LLM-as-a-Judge"—where AI is used to grade other AI—and total cost of ownership metrics. Finally, we cover the introduction of proactive agents in Gemini and the release of Gemma 4, which brings advanced intelligence directly to on-device hardware.

01Andre Karpathy joins Anthropic

Andre Karpathy, a founding member of OpenAI and one of the most influential figures in modern artificial intelligence, has joined Anthropic. This move signals a strategic shift in how the company intends to evolve its AI capabilities, moving beyond traditional development toward a more automated, self-evolving system. Karpathy is specifically tasked with focusing on recursive self-improvement during the pre-training of models. For the general reader, pre-training is the foundational stage where an AI model is first exposed to vast datasets to learn the basic patterns of language and logic. By introducing recursive improvement, Anthropic is attempting to create a feedback loop where the AI helps refine the very process used to build it.

The core of this initiative involves using Anthropic's own Claude model to accelerate the research into its own pre-training. Rather than relying solely on human engineers to tweak the algorithms and data selection, Karpathy aims to leverage Claude's intelligence to identify ways to make the training process faster and more effective. This approach essentially turns the AI into its own architect, allowing the model to analyze its own learning gaps and suggest optimizations for the next generation of training. If successful, this could drastically reduce the time and resources required to reach higher levels of intelligence.

This transition underscores a critical pivot in the industry toward automating the science of AI development. While much of the current competition focuses on simply increasing the scale of models, Karpathy’s work focuses on the intelligence of the training process itself. This technical pursuit is central to the belief that pushing the technical boundaries of AI is the primary way to achieve breakthroughs. However, the broader challenge remains whether a model that is exceptionally brilliant at solving these technical, recursive problems can also avoid blind spots regarding the complexities of the real world, ensuring that the resulting progress is actually meaningful for society.

02Gemini 3.5 Flash builds autonomous OS

Google has demonstrated that AI can now construct the most fundamental layers of computing software autonomously and at a fraction of traditional costs. Using Gemini 3.5 Flash and a system called anti-gravity, Google built a fully functioning operating system from scratch in only 12 hours. Unlike a simple application, this project created the core architecture that allows other software to run, including the file system, memory management, and the scheduler. This was achieved by a coordinated team of 93 parallel sub-agents that handled every stage of development, from the initial code generation to auditing and final testing.

The sheer efficiency of this process highlights a shift in software engineering. The autonomous team processed 2.6 billion tokens and made over 15,000 model requests, yet the entire operation cost less than $1,000 in API credits. To put this in perspective, building an operating system is typically a brutal process that takes human developers many months. This speed is powered by Gemini 3.5 Flash, which generates output four times faster than other leading frontier models, while the Gravity-optimized version runs 12 times faster than Pro-tier models.

This capability is the result of Google’s vertical integration, where the company controls everything from the physical silicon to the final service. By using its 8th generation TPU chips—which split training and the process of generating answers, known as inference, into two specialized chips—Google has drastically lowered the total cost of ownership for AI. This structural advantage is critical for the era of autonomous agents that must run continuously in the background. For large enterprises processing a trillion tokens a day, switching 80% of their workloads to Gemini 3.5 Flash could save up to $1 billion annually. This shift is already reaching consumers through Google Search, which is transitioning from a manual tool into an information agent that monitors data 24/7 on behalf of the user.

03Codex implements structured AI workflow

Building software with AI often feels like a gamble, where the quality of the result depends more on the luck of the prompt than a strict engineering plan. To solve this, Codex has introduced a structured development workflow that replaces guesswork with a rigorous documentation process. By requiring four foundational documents before any code is written, the system ensures that the final product is polished and complete rather than a fragmented collection of features. This approach moves the AI from a simple chat interface to a disciplined engineering tool, starting with a high-level plan and descending into a detailed design and implementation guide. This ensures that the technical stack, data models, and database structures are clearly defined before the AI begins the actual build.

The workflow follows a strict sequence to maintain consistency: first the plan, then the design document, followed by a style design document, and finally an agent guideline file called agents.md. This last document is particularly critical for long-term maintenance. It acts as a set of permanent system rules that prevent the AI from making the same mistakes repeatedly. By explicitly defining these principles and guidelines, developers can ensure the AI operates more intelligently over time, avoiding the common pitfall where an AI forgets previous constraints or introduces recurring bugs during the development cycle.

Beyond documentation, Codex leverages a "Goal" feature that allows for autonomous, end-to-end implementation. Instead of guiding the AI through every single line of code, a user can set a high-level objective—such as implementing all remaining features and deploying the application to Vercel, a cloud hosting platform. Once the goal is set and necessary API keys are provided, the system independently handles the implementation and verification process. The AI continues to work through the defined workflow until it can report a final successful result to the user, effectively transforming the developer's role from a manual coder to a high-level project manager.

04Google's agentic coding tops GPT-5.5

Google's latest AI models are proving more capable of building working software than traditional industry tests suggest. While these models might not always lead in raw benchmark scores, they are demonstrating a superior ability to act as "agents"—AI that can independently handle complex, multi-step coding tasks to create a finished product. For example, a recent test showed a Google model building a music-powered interactive adventure game in less than an hour. Remarkably, this version produced fewer bugs than GPT-5.5 when given the exact same assignment, despite Google not positioning the model specifically as a top-tier coding tool.

This gap between test scores and real-world utility suggests that standard benchmarks—the artificial analysis used to rank AI intelligence—may be underselling actual performance. In certain "vibe coding" benchmarks, which measure a model's ability to iterate quickly on creative ideas, Gemini 3.5 Flash has shown lower performance compared to rivals like GPT-5.5 or Claude Opus 4.7. However, these numbers often fail to capture the model's effectiveness in building functional, interactive applications. In other tests, such as Simple Bench—which focuses on common sense logic and trick questions—Gemini 3.5 Flash performs very well, indicating that its practical intelligence is more robust than some rankings imply.

Beyond the quality of the code, the speed of delivery is becoming a critical factor for developers. Gemini 3.5 Flash is capable of outputting significantly more tokens per second—the basic units of text AI generates—than other models with similar benchmark performance. This combination of high-speed output and the ability to create complex, bug-free interactive experiences suggests a shift in how AI coding is measured. Because the most cited benchmarks change constantly, the ability to deliver a working, interactive game in under an hour is a more reliable signal of a model's utility than a static score. This means users may find Google's tools more practical for rapid prototyping and creative development than the current rankings suggest.

05Vapi and Hermes integrate via MCP

Users can now automate complex phone research and outreach using simple English commands, removing the need to manually navigate technical dashboards. This is made possible by the integration of Vapi and Hermes through the Model Context Protocol (MCP), a system that allows AI agents to interact directly with software interfaces. In this partnership, Vapi provides the essential communication infrastructure—handling phone numbers, call transcripts, and voice settings—while Hermes serves as the cognitive brain that manages goals, memory, and proactive decision-making. By using this protocol, Hermes can perform tasks inside the Vapi dashboard on behalf of the user, such as creating new voice assistants or analyzing call logs.

The practical impact is a shift toward fully autonomous workflows. For example, a user can instruct Hermes to research a massage spa in Manhattan, extract its phone number, and immediately place an outbound call via Vapi to check for availability. Beyond simple tasks, Hermes can build specialized assistants optimized for specific business goals, such as cold-calling car detailing shops in New Jersey. Because Hermes understands natural language, users can manage these operations through Discord, WhatsApp, or a terminal, effectively treating the AI as an operator that handles all the technical configuration.

For businesses, this integration allows for the creation of sophisticated outreach systems from a single sentence. Leveraging a library of 82 pre-built skills, Hermes can automatically set up a SQL database and a scheduled task, known as a cron job, to call different leads at set intervals, such as every 10 or 15 minutes. To ensure these calls are effective, Vapi offers granular controls over the assistant's behavior, allowing users to adjust the tone of voice, response speed, and even add background office sounds to reduce interruptions. This synergy transforms the phone from a manual tool into an autonomous agent capable of pitching startups to venture capital firms or managing lead generation with minimal human oversight.

06Gemini 3.5 Flash and Omni debut

Google has introduced new tools that fundamentally change how users create and edit digital media. The debut of Gemini Omni Flash, the first model in the Omni family available across Google's products, allows people to engage in natural language conversations to generate or modify videos. By processing a mix of text, images, video, and audio inputs, the model enables a more intuitive approach to multimodal creation, meaning it can handle several different types of media simultaneously to produce a final video output. This shift simplifies the creative process, moving it away from rigid editing software toward a conversational experience.

Alongside the Omni family, Google released Gemini 3.5 Flash, a model designed to balance high speed with significant intelligence. In performance tests, Gemini 3.5 Flash has outperformed its predecessor, Gemini 3.1 Pro, across nearly every benchmark. The improvements are particularly stark in coding and economically valuable tasks, evidenced by a major leap in the GDP val benchmark. The model is specifically optimized for "agentic coding"—the ability for AI to act as an autonomous agent to handle complex programming tasks—as well as long-horizon projects and practical, real-world workflows. This makes it a powerful asset for developers who need a tool that can manage extended sequences of work without losing track of the objective.

Beyond coding, Gemini 3.5 Flash shows a specialized proficiency in interpreting dense visual data. In the Charkhive reasoning benchmark, which tests a model's ability to analyze and synthesize information from complex charts and tables found in academic papers, the model achieved a score of 84.2%. This result surpassed all other listed models, demonstrating a high capacity for reasoning over intricate data visualizations. For professionals in research or finance, this capability means the AI can effectively distill complex academic findings into clear summaries, bridging the gap between raw data and actionable intelligence.

07Hyperliquid deploys trading agents

Trading is shifting from manual human intervention to autonomous software execution. Hyperliquid has recently introduced a system that allows users to deploy AI agents capable of managing the entire trading lifecycle, from initial research to the final execution of a trade. This means that instead of a person monitoring charts and clicking buttons, an AI can programmatically handle the process, enabling a level of automation where the software makes the move based on predefined logic or research.

The technical workflow involves setting up a user account and launching an agent that connects to the platform through an API, which is a software interface that allows the AI to communicate directly with the trading engine. Once this connection is established, the agent can execute complex orders instantly. For instance, a user can instruct the AI to take a 10x short position on Nvidia—a bet that the stock price will fall, amplified ten times—and the agent will fire the trade and verify the position update in real-time. This programmatic approach removes the delay of human execution and allows for strategies that can be scheduled to trigger at specific times.

Beyond the automation itself, the platform has expanded its reach to include a diverse array of assets. While it was once seen primarily as a tool for cryptocurrency, it now offers access to various markets, including the S&P 500 and Brent oil. It even features specialized assets, such as one tied to OpenAI. These markets are deployed independently, providing a broad playground for AI agents to operate. By allowing users to allocate specific portions of their capital to these automated agents, Hyperliquid is changing the workflow for traders, moving them from active operators to managers of AI-driven portfolios that can navigate multiple asset classes simultaneously.

08Greg Brockman advocates text-based AGI

The path to creating a machine with human-level intelligence might not require it to "see" or "feel" the physical world. Greg Brockman, the co-founder and president of OpenAI, argues that text-based models are sufficient to reach the milestones required for general intelligence. This perspective suggests that the ability to process and generate language is the primary engine for cognitive breakthroughs, meaning that a model does not necessarily need a simulated experience of physical reality to achieve a high level of reasoning.

This stance stands in direct contrast to the world-model approach to AI development. Proponents of world models believe that for an AI to truly understand the universe, it must grasp intuitive physics, such as how gravity works or how kinetic energy moves objects. Some current systems, such as VEO, Nano Banana, and Genie, attempt to achieve this by creating realistic videos and interactive simulations. These tools aim to give AI a sense of spatial and physical understanding that text alone cannot provide.

However, Brockman maintains that text-only models can achieve the critical self-improvement necessary for general intelligence. From this viewpoint, the complex patterns and logic embedded within language are enough for a system to refine its own capabilities and evolve its intelligence autonomously. The stakes of this theoretical divide are high, as they dictate whether the future of AI depends on expanding linguistic mastery or on building complex, multimodal simulations of the physical world. If Brockman is correct, the most powerful intelligence possible can be unlocked through the mastery of text alone, bypassing the need for the AI to experience the physical laws of nature through visual or interactive media.

09AI Overviews scale Google Search

Google Search is proving that artificial intelligence is a growth engine rather than a replacement. While critics feared that AI would cannibalize traditional search behavior, the integration of AI Overviews has actually expanded the platform's reach. This feature, which provides AI-generated summaries of search results, has already reached over 2.5 billion monthly active users. Furthermore, specialized AI modes within the search experience have seen rapid adoption, surpassing 1.5 billion monthly active users within just one year of their introduction. This surge suggests that users are not abandoning search for standalone AI bots but are instead embracing a hybrid experience where AI enhances the way they find information.

The primary driver behind this massive scaling is the deployment of Gemini 3.5 Flash. By integrating this specific model into Google Search, the company has optimized the balance between performance and cost. For a service used by billions, the cost of inference—the process where an AI model generates a response to a prompt—can be astronomical. Gemini 3.5 Flash is designed to be highly cost-efficient, ensuring that the high computational demands of AI-generated overviews do not slow down the user experience or become financially unsustainable. This efficiency allows Google to maintain its speed and reliability while delivering complex AI summaries to a global audience.

Google's ability to scale these tools rests on a unique structural advantage: vertical integration. Unlike many competitors who must rent computing power or license models, Google owns the entire stack. This includes the proprietary AI models, the massive datasets used to train them, and the specialized hardware, known as Tensor Processing Units (TPUs), designed specifically to run these models. By controlling both the hardware and software layers, Google can run AI operations far more cheaply than other firms. This comprehensive infrastructure transforms AI from a costly experiment into a sustainable utility, enabling Google to maintain its dominance in the search market by making AI-driven results accessible to billions of people.

10TCO and LLM-as-a-Judge redefine evaluation

The way companies measure the success of AI is shifting from the cost of a single answer to the Total Cost of Ownership, or TCO. In simple terms, TCO represents the total expense required to complete an entire task from start to finish. This shift is happening because AI is evolving from simple chatbots into agents that can operate independently in the background. Unlike a standard query where a model provides one response, an agent might call multiple tools, encounter errors, and retry several times to reach a goal. In this environment, the price of a single interaction is less important than the total investment needed to ensure the job is actually done.

To determine if these agents are actually succeeding, developers are increasingly relying on a method called LLM-as-a-judge, where one AI model is used to evaluate the performance of another. For example, the platform Langfuse uses this approach to test coding agents. Instead of just looking at the final code, the judge AI analyzes the state of the file system and the "diffs"—the specific differences between the files before and after the agent made changes. By comparing these changes and the record of the agent's steps against natural language requirements, the judge can verify if the agent's actions were correct or if it introduced errors during the process.

However, the precision of the agent's goal, known as the target function, is critical. If a specific requirement is not explicitly included in this target function, the agent may view necessary features as irrelevant "garbage" and delete them to reach the primary objective more quickly. For instance, if the goal does not specify that prompt versions must be linked to traces, the agent might remove that functionality entirely. To solve this, Langfuse is working to automate the evaluation lifecycle. The goal is to create judge systems that automatically align with user preferences and can analyze patterns across hundreds of different executions to refine how AI agents perform their tasks.

11Gemini introduces proactive agents

Google is shifting Gemini from a tool that simply answers questions to one that actively manages a user's daily life. This transition introduces proactive agents designed to handle routine organization and monitoring without needing a direct prompt for every action. A primary example is the "daily brief," a feature that acts as a personalized morning assistant. By synthesizing data from a user's calendar, inbox, and task lists, the agent creates a topic-organized digest. Rather than requiring the user to hunt through multiple apps to plan their day, the daily brief presents a cohesive overview of what needs attention, complete with suggested next steps to streamline the morning workflow.

Beyond daily organization, Google is introducing persistent search agents for those using the Pro and Ultra tiers. Unlike standard AI searches that provide a one-time answer to a specific query, these agents remain permanently active to monitor the web for specific triggers. Users can define particular conditions—such as a drop in the price of a product or an update to a specific industry benchmark—and the agent will track these changes in the background. This removes the need for manual, repetitive searching, as the AI automates the monitoring process and alerts the user only when the specified criteria are met.

These updates represent a fundamental change in how users interact with AI, moving away from a strictly chat-based interface toward a system of background automation. By delegating the synthesis of personal schedules and the tracking of external data to these agents, users can reduce the cognitive load of administrative upkeep. Whether it is preparing for the workday via a synthesized digest or tracking market shifts through persistent monitoring, the goal is to transform the AI into a functional surrogate that executes tasks on the user's behalf, rather than just a knowledge base that responds to individual queries.

12Gemma 4 enables on-device intelligence

Mobile app developers can now tailor how artificial intelligence functions within their software, choosing between streamlined system tools or highly customized internal models. The integration of Gemma 4 into both Android and iOS applications, facilitated through a system service called AI Core, allows developers to implement local intelligence and specific skills directly on a user's device. This shift means that AI capabilities no longer have to rely solely on distant servers, improving how apps perform and how they handle data locally.

When building these experiences, developers face a strategic choice between system-level and app-specific generative AI. System-level intelligence, such as Gemini Nano provided via AI Core, is pre-installed on the device. Because it is already there and highly optimized, it provides essential functions—such as a summarization API—without increasing the overall size of the application. This is an efficient route for developers who want reliable, high-performance AI without adding bulk to their downloads. However, for those seeking a more boutique or specialized experience, app-specific generative AI is the better option. This approach involves loading a lighter large language model runtime directly within the app or web page, which grants the developer significantly more control and customization, though it requires more intensive development effort.

To achieve this level of customization, developers can utilize what are known as tiny LLMs. These are language models defined by having fewer than a billion parameters, making them compact enough to be embedded directly into an application. By using these smaller models, developers can implement unique features or specialized behaviors that are not currently available through system-level services like AI Core. This flexibility ensures that whether a developer needs a lightweight, system-optimized tool or a bespoke, highly tailored AI agent, the infrastructure exists to support on-device intelligence across different mobile platforms.