LangSmith Optimizes AI Agents and Codex Hits Context Limits

Today’s digest tracks a series of shifts in how AI models handle complex tasks and the people leading those changes. We begin with Gemini 3.5 Flash topping industry benchmarks, signaling a push for faster, more efficient high-performance models. In a major industry move, Andrej Karpathy has joined Anthropic, adding significant expertise to the team behind the Claude series.

The focus is heavily on "agentic" workflows—AI systems designed to act as autonomous assistants that can execute tasks rather than just chatting. We look at how Google Search is integrating generative agents to handle multi-step requests and how Claude Code is expanding its capabilities through specialized skills. On the developer side, LangSmith is introducing new ways to optimize these agents, while Codex is running into "context window" limits, which refers to the maximum amount of information a model can process at one time.

Beyond the giants, we cover the reasoning improvements in Qwen 3 6B and the emergence of "stop hooks" in AI coding rules to prevent models from accidentally overwriting critical code. We also examine how Gemini is expanding its reach through partnerships with wearable device brands and how new "context layers" are being used to fine-tune the way AI agents interact with external data. From high-level executive moves to the granular limits of token processing, this mix highlights the current friction between model ambition and technical constraints.

01Gemini 3.5 Flash Tops Benchmarks

Google has introduced Gemini 3.5 Flash, a model designed to provide high-level intelligence at speeds that significantly lower the cost of running massive AI workloads. For businesses and developers managing high-volume tasks, this translates to a system that can handle complex requests without the typical lag or expense associated with the most powerful models. This efficiency is becoming essential as Google's AI traffic reaches unprecedented levels. The company now processes more than 3.2 quadrillion tokens—the basic units of text AI reads and writes—per month, representing a seven-fold increase over last year. Just two years ago, that figure was only 9.7 trillion, illustrating a staggering growth in how the world is utilizing these tools for real-world problem solving.

Beyond efficiency, Gemini 3.5 Flash demonstrates a leap in raw capability, beating out the Gemini 3.1 Pro model across several key tests for reasoning and coding. In the Terminal Bench 2.1 coding evaluation, Flash scored 76.2% compared to 70.3% for 3.1 Pro. Its reasoning prowess is further evidenced by its performance on the GDP Val AA benchmark, where it achieved a performance score of 1,656, far exceeding the 1,314 scored by the Pro model. Additionally, it hit 83.6% on the MCP Atlas test, a standard for model capability, while 3.1 Pro managed 78.2%. These numbers suggest that the "Flash" designation does not imply a sacrifice in intelligence.

The most significant competitive edge, however, is the model's output speed. Gemini 3.5 Flash can generate nearly 280 tokens per second, a rate that dwarfs its primary competitors. For comparison, GPT 5.5 and Claude Opus 4.7 average only 60 to 70 tokens per second, making the Google model roughly four times faster. This leap in performance allows for near-instantaneous interactions in applications that require rapid-fire responses. By delivering frontier-level intelligence at this velocity, Google is effectively reducing the operational friction for AI integration, making it feasible to deploy sophisticated reasoning capabilities across millions of simultaneous user sessions.

02Codex Hits Context Window Limits

Developers using Codex may find that their AI tools suddenly stop working as they add more capabilities to the system. This happens because Codex has a strict limit on how many skill descriptions—the specific instructions that tell the AI how to perform a task—it can read at one time. When the number of skills exceeds this "context window," or the maximum amount of text the model can process in a single go, the AI may cut off the descriptions or ignore them entirely. The most critical risk is that the model might miss "triggers," the specific keywords or conditions that tell the AI to start a skill, especially if those triggers are located toward the end of the text.

This limitation highlights a growing divide between AI providers. While both Codex and Claude Code are capable of handling skills, their underlying designs and execution methods differ significantly. Because of these differences, a skill that works perfectly in one environment may fail in another, often forcing developers to recycle and recreate their instructions repeatedly. To manage this friction, new tools are emerging to act as universal adapters. The "poly skill" command, for instance, allows users to automatically convert and repackage skills to ensure they are compatible with different "runtimes," which are the specific software environments where the AI model actually operates.

By using a command such as `/poly skill convert [skill name] to Codex`, a developer can trigger a full repackaging of a skill, ensuring it meets the technical requirements of the Codex platform. The process involves placing the skill and its associated assets into a dedicated folder and running a "poly skill install" command to build and deploy it. This automated approach allows users to tap into the strengths of multiple AI tools without manually rewriting every instruction, providing a necessary workaround for the memory limits and design inconsistencies that currently plague high-capacity automated systems.

03AI Coding Rules Use Stop Hooks

AI coding assistants are becoming self-correcting by automatically updating their own instruction manuals as they write software. This is achieved through "stop hooks," which are triggers that launch an invisible, separate AI session every time the primary agent finishes a task. This secondary session reviews the recent changes to the codebase and proposes tweaks to the global rules file, such as a document called claude.md. By ensuring that the rules evolve in tandem with the code, the system prevents the AI from following outdated guidelines that no longer apply to the current state of the project.

Beyond updating rules, AI agents are gaining the ability to navigate massive software projects with the same precision as a human developer. By using specialized bridge servers known as Model Context Protocol (MCP) servers, agents can access advanced search and navigation tools that far exceed basic text-searching commands. Specifically, by implementing Language Server Protocol (LSP) servers, the AI can search for "symbols"—the structural building blocks of the code—rather than just searching for specific strings of characters. This capability is essential for managing repositories with hundreds of thousands of lines of code, where simple text searches would return too many irrelevant results.

To maintain efficiency, these systems use sub-agents to separate the discovery phase from the actual editing phase. When an agent needs to perform web research or analyze a large portion of the codebase, it dispatches a sub-agent to do the heavy lifting. The sub-agent then returns a concise summary to the primary session, which prevents the AI's short-term memory, or context window, from becoming bloated with unnecessary data. Because setting up this infrastructure is complex, organizations are encouraged to assign a small, dedicated team to champion the buildout of this "AI layer," creating the foundational rules and servers that allow the automation to scale.

04LangSmith Drives Agent Optimization

AI agents are becoming more effective not by changing the AI model itself, but by refining how those models are managed. LangSmith facilitates this through detailed records of agent actions, known as traces, and performance tests called evaluations. Together, these tools create a "training gradient"—a feedback loop that tells developers exactly where to make improvements. Instead of retraining the model, developers can optimize the "harness," which is the code that connects the model to its environment, or the "context," such as the specific instructions and skills provided to the agent. Using this method, LangChain improved its ranking on the terminal bench 2 benchmark from the top 30 to the top 5. To further this effort, the company recently announced LangChain Labs, a research group dedicated to this type of continual learning.

This optimization process is now being opened to non-technical staff through LangSmith Fleet, a managed builder that allows users to create functional business agents using natural language. By removing the need to write code, the people most familiar with a business process can build the tools they need. LangChain has already deployed several such agents internally, including a sourcing agent for talent acquisition, an Intel bot for marketing, and OpenSuite for engineering triage. To ensure these agents are useful, Fleet provides a massive ecosystem of over 200 built-in tools, 7,500 additional tools via a partnership with Arcade, and support for the Model Context Protocol to connect proprietary data sources.

The business impact of this approach is significant. One go-to-market agent implemented via Fleet increased lead-to-qualified conversions by 240% and saved sales representatives an average of 40 hours per month by automating account research and email drafting. To prevent errors, the system integrates human-in-the-loop verification as a core feature. This allows an agent to pause and ask a human to review or edit a draft—such as a personalized email to a customer—before the final action is executed. This balance of automated optimization and human oversight allows companies to deploy high-performing agents that are both efficient and reliable.

05Google Search Launches Generative Agents

Google is transforming the search experience by moving beyond static results and toward a system that actively builds tools and monitors data for its users. This shift involves the introduction of generative coding capabilities and personalized information agents, turning the search engine into a proactive assistant that can create custom interfaces on the fly. Instead of simply directing users to external websites, Search will now be able to construct interactive environments tailored to the specific needs of a single query, fundamentally changing how people interact with information online.

To achieve this, Google is integrating generative coding powered by Gemini 3.5 Flash and a technology called anti-gravity. These tools allow the search engine to generate dynamic layouts and interactive visuals in real-time. For the user, this means that a complex question will no longer result in a standard list of links, but rather a custom-built experience designed to answer that specific inquiry through a more intuitive, visual interface. This capability effectively allows the search engine to write the code necessary to present information in the most useful format for the individual user, essentially creating a bespoke application for every search.

Alongside these visual updates, Google is launching personalized information agents that operate continuously in the background. These agents are designed to work 24/7, proactively searching for necessary information and alerting the user at the exact moment they need to take action. This moves the search process from a reactive model—where the user must manually ask a question—to a proactive model where the AI anticipates the user's needs based on their background information. These background agents are scheduled to roll out this summer, with initial access granted to subscribers of Google AI Pro and Ultra.

06Gemini Partners with Wearable Brands

Interacting with an artificial intelligence assistant is about to move from the palm of your hand to your line of sight. Google is shifting the way users engage with Gemini by integrating its AI capabilities directly into wearable hardware through strategic partnerships with eyewear brands Gentle Monster and Warby Parker. This move transforms glasses from simple fashion accessories into functional tools that allow people to manage their digital lives without needing to constantly check a smartphone screen. By embedding AI into frames, Google is attempting to make digital assistance a seamless part of the physical environment.

Launching this fall, these wearable integrations will focus on hands-free utility driven by voice commands. Users will be able to ask Gemini for information about objects or scenes they are currently looking at, effectively giving the AI the ability to help the wearer understand their surroundings in real time. The practical applications are broad, ranging from receiving natural, turn-by-turn navigation directions while walking to managing phone calls and sending text messages without touching a device. Beyond communication, the hardware will allow users to snap photos and videos or access their favorite apps using only their voice. One of the most significant additions is the inclusion of real-time translation, which could remove language barriers during face-to-face conversations.

While the initial rollout focuses on audio and voice-driven interactions, Google is already planning for a more visual future. The company has indicated that display glasses—devices that project information directly into the wearer's field of view—will arrive at a later date. This phased approach allows Google to establish the utility of voice-based AI in wearables before introducing complex visual overlays. By partnering with established eyewear brands, Google is ensuring that these high-tech tools remain wearable and stylish, bridging the gap between advanced AI research and everyday consumer fashion. This strategy expands the reach of Gemini, moving it beyond the browser and the app and into the very fabric of how people perceive and interact with the world around them.

07textcortex/spritz Orchestrates Agents

Managing a fleet of specialized AI assistants often requires tedious manual setup, but a new open-source tool called textcortex/spritz is changing how these systems are deployed. At its core, this orchestrator—a system that coordinates the activities of multiple AI components—acts as a goal operator. This means it can take a high-level objective and break it down, dispatching specific agents to handle the individual tasks required to finish the job. By automating the way AI workers are assigned to problems, the tool removes the friction of manual management and allows for a more fluid, automated workflow.

A primary use case for this technology is error reporting. In a typical software environment, a production release—the act of pushing a finished version of a program to live users—can often introduce unexpected bugs. Rather than requiring a human developer to manually identify the issue and assign a tool to fix it, textcortex/spritz can automatically dispatch a specialized agent specifically designed to debug these new errors. This ensures that technical failures are addressed immediately by an agent with the right profile and capabilities, streamlining the process of maintaining software stability.

The necessity for such a tool stems from the current limitations of chat platforms. Currently, if a developer wants to create multiple distinct AI agents, they often have to perform a series of manual steps for each one. This includes creating separate Slack apps, configuring app manifests, and manually uploading profile pictures to give each agent a unique identity. Most chat applications lack a standardized method for multi-agent provisioning, which is the ability to easily create and deploy multiple distinct AI personalities. By handling this orchestration programmatically, textcortex/spritz eliminates the need for clicking through menus and manually managing manifests, allowing developers to scale their AI workforce without an overwhelming administrative burden.

08Qwen 3 6B Enhances Reasoning

AI models are becoming more capable of solving complex problems by learning how to "think" through their logic step-by-step. For the Qwen 3 6B model, this means that its reasoning abilities can be significantly improved through a process called fine-tuning using chain-of-thought datasets. Instead of simply predicting the next word in a sentence, these datasets train the model to produce a sequence of intermediate reasoning steps. This transformation allows the model to handle more sophisticated tasks with greater accuracy, effectively giving a smaller model the logical depth typically reserved for much larger systems.

Implementing these improvements has become more accessible due to deep integration with the Hugging Face Hub. Users can now apply these reasoning enhancements using the platform's command-line interface—a text-based tool for managing software—and utilize the Hub's own graphics processing units (GPUs) to provide the necessary computing power. This is particularly valuable for developers whose own cloud hardware might be cheaper or less ideal for the specific requirements of the Qwen 3 6B model. By moving the workload to optimized Hub infrastructure, users can achieve significant speedups and better performance without needing to overhaul their own hardware setups.

To ensure these reasoning skills are actually effective, developers are using an open-source library called upskill. This tool serves as a gateway for adopting affordable, open models by automating the creation of specific skills and the performance tests, or evals, used to measure them. Upskill generates a skill, creates a corresponding test to verify that skill, and then allows users to compare how different models perform on the same task. This rigorous comparison process proves that fine-tuned open models can compete with larger alternatives, allowing companies to maintain high reasoning standards while reducing the costs associated with expensive, closed-source AI providers.

09Context Layers Tune Agentic Systems

Instead of needing a programmer to rewrite the core engine of an AI tool, users can now steer the AI's behavior through external configuration files. This means a person can customize an AI agent—an autonomous system capable of performing tasks—to fit a specific project's needs without touching the underlying harness code, which is the structural software that manages the AI's operations. For instance, tools like Claude Code can be tuned for a particular task by providing an agent.md file or specific skills. This mechanism allows the system to adapt to unique requirements on the fly, ensuring the AI behaves exactly as needed for a specific job without requiring a deep dive into the software's internal architecture.

To make this customization effective, there is a clear divide between general guidelines and specific operational steps. Global rules, often stored in files like claude.md, act as the overarching conventions and constraints that the AI must follow regardless of the specific task. These are the laws of the environment that maintain consistency across all actions. In contrast, skills function as the actual workflows. A skill is essentially a reusable prompt or a set of defined steps that the AI follows to complete a complex action. While rules define the boundaries, skills provide the roadmap for execution.

This distinction is particularly valuable in large-scale software projects where complexity can easily overwhelm a general AI. For example, a developer might create a specific skill dedicated to adding API routes within a massive codebase. Rather than explaining the entire process every time, the developer provides the AI with a pre-defined workflow that it can trigger as needed. By separating these operational skills from the global conventions, companies can scale their AI capabilities and refine how the system handles repetitive, complex processes without risking the stability of the core software. This layered approach transforms the AI from a general-purpose tool into a specialized asset tailored to a company's specific domain and operational needs.

10Andrej Karpathy Joins Anthropic

The recent move of Andrej Karpathy to join Anthropic serves as a powerful signal to the entire artificial intelligence industry, suggesting a shift in where the world's top talent sees the most viable path forward. As a co-founder of OpenAI and one of the most respected researchers in the field, Karpathy possesses the professional leverage to work almost anywhere. Because he is likely already a billionaire, his decision to join Anthropic is not driven by financial compensation, but rather by a deep-seated belief in the company's technical direction. This transition implies that Karpathy may view Anthropic as the only organization capable of developing artificial intelligence in a truly safe manner, or perhaps the only one capable of achieving the goal at all.

Beyond the technical prestige, Karpathy's arrival is being interpreted as a formal endorsement of Anthropic's specific and often restrictive stances on AI safety. By joining the team, he is effectively co-signing the company's worldview, which includes a more cautious approach to the industry. This alignment is particularly striking regarding the societal impact of the technology. Karpathy appears to share the company's conviction that the resulting job losses will be both real and deeply impactful, moving away from more optimistic projections of a frictionless transition into an AI-driven economy.

The symbolic weight of this move is amplified by Karpathy's public persona. He has long been viewed as a symbol of optimism and a champion of education within the AI community. For a figure associated with such positivity to align himself with Anthropic's more pessimistic views—especially concerning the risks of open source development and the potential for widespread economic disruption—provides significant validation to the company's cautious philosophy. His presence suggests that even the most optimistic builders in the field believe that a restrictive, safety-first framework is the necessary standard for the future of the industry.

11Google Scales Token Processing

Google's ability to process astronomical amounts of information is drastically accelerating the pace at which its artificial intelligence evolves. The sheer volume of tokens—the small chunks of text that AI models use to understand and generate language—has experienced exponential growth. Two years ago, Google processed approximately 9.7 trillion tokens per month across its various services. By the time of last year's IO event, that figure had jumped to 480 trillion. Today, the company is processing more than 3.2 quadrillion tokens every month, representing a seven-fold increase year-over-year. This scale allows the company to tackle real-world problems with a level of data density that was previously impossible.

Much of this growth is fueled by internal usage, which creates a critical advantage in model refinement. In March, Google's internal token processing stood at half a trillion tokens per day, but that number has since climbed to over three trillion. This surge is driven largely by AI developer tools, which establish a powerful feedback loop. As developers use these tools internally, the resulting data allows Google to iteratively improve its models in real-time, ensuring that the AI evolves based on actual usage patterns and developer needs.

To sustain this momentum, Google has fundamentally changed how it builds and trains its systems. The company now utilizes TPU8 chips, which are designed to maximize speed and energy efficiency while dramatically reducing latency, the lag time experienced during processing. Moreover, by employing technologies called Jacks and Pathways, Google is no longer constrained by the physical limits of a single data center. Training can now be seamlessly distributed across a global network of more than a million TPUs. This creates the largest training cluster in the world, enabling Google to develop larger and more capable models in a few weeks rather than the months it previously required. This shift in infrastructure ensures that the hardware can keep pace with the exploding demand for token processing.

12Claude Code Expands via Skills

Developers can now move beyond basic AI assistance by teaching Claude Code specific, repeatable ways to handle complex technical tasks. This shift allows the tool to evolve from a general coding assistant into a specialized teammate that understands the unique requirements of a particular project. By utilizing "skills," users can automate the tedious parts of software development, ensuring that the AI follows a precise, proven methodology every time a specific action is required.

At its core, a skill functions as a reusable prompt or a structured set of steps that directs the AI through a specific workflow. Instead of a developer having to explain the same complex process repeatedly, they can trigger a skill to handle the heavy lifting. For example, a developer might use a skill specifically designed for adding API routes—the interfaces that allow different software components to talk to one another. This transforms a manual, multi-step chore into a streamlined process that the AI can execute reliably.

This capability is particularly critical when working within massive codebases where the sheer variety of tasks can become overwhelming. In large-scale professional environments, developers often face dozens or even hundreds of different task types, each requiring its own specific approach. Skills serve as the primary mechanism for extending Claude Code's capabilities, allowing Anthropic to build an AI layer that can scale alongside the complexity of the software it manages. By turning these recurring patterns into modular skills, the system can maintain consistency and efficiency across an entire organization, regardless of how many different types of workflows are being managed simultaneously.