Isomorphic Automated Labs and Braintrust Coding Analysis Debut

This edition explores a wide range of advancements in AI infrastructure and evaluation. We start with Google's Anti-Gravity 2.0, which aims to simplify how developers build AI agents—software capable of performing complex tasks autonomously—and the use of Tmux and VPS to keep these multi-agent workflows running persistently without interruption. In the realm of science, Isomorphic has launched automated labs designed to accelerate discovery by removing manual bottlenecks. However, the industry faces a growing transparency crisis regarding how AI is measured; Braintrust has highlighted significant variance in coding performance tests, meaning a model's success can vary wildly depending on the specific test used. Furthermore, many model publishers remain vague about their evaluation methods, prompting Google DeepMind to crowdsource benchmarks to more accurately measure progress toward Artificial General Intelligence. On the operational and financial side, we look at Anthropic's move to secure massive computing resources via SpaceX's Colossus to fuel future training, alongside a pricing increase for Gemini 3.5 Flash output tokens and significant cost reductions in training for Qwen 3.7-Max. Finally, we examine the organizational friction occurring within large enterprises as they struggle to define clear roles and responsibilities for generative AI teams, often leading to the mismanagement of technical talent.

01Google Anti-Gravity 2.0 Overhauls Agent Developer Experience

Google is fundamentally changing how software is built by moving away from traditional code editors. Instead of developers spending their time in a code-centric environment similar to VS Code, Anti-Gravity 2.0 introduces an agent-centric hub. In this new workflow, the primary interface is used to direct AI agents rather than manually writing lines of code. This allows for a more high-level approach to development, where a user can simply instruct an agent to generate an entire website within a specific project, shifting the human role from a writer of code to a director of AI.

The power of this orchestration is evident in the system's ability to manage a vast number of specialized sub-agents to handle complex frameworks. In one demonstration, Google utilized 93 sub-agents to build a core operating system framework from scratch. This massive undertaking was completed in just 12 hours and cost less than $1,000, eventually proving its functionality by running the game Doom. This suggests a future where complex software architecture can be rapidly and cheaply prototyped by coordinating a swarm of AI agents rather than relying on a large team of human engineers over several months.

To lower the barrier to entry, Google introduced Gemini Spark, which allows for the one-click deployment of agents—such as OpenClaude or Hermes Agent—directly onto Google Cloud. This removes the need for developers to manually set up virtual machines or local hardware. These agents integrate deeply with the Google ecosystem, enabling them to automate multi-app workflows, such as converting email replies into Google Sheets. Complementing this is Gemini Omni, an "anything-to-anything" model capable of processing and generating text, images, audio, and video. Unlike previous tools that required a multi-step process to turn images into video, Omni can handle direct video-to-video transformations and create spatially aware videos by interpreting map data, further expanding the types of assets agents can generate and manipulate.

02Braintrust Highlights High Variance in Coding Evals

Measuring how well an AI can write code is surprisingly inconsistent, meaning a company might choose the wrong model simply because they used the wrong testing tool. The framework used to run these tests—known as an evaluation harness—can swing performance results by as much as 22%. For example, a Morph LLM blog post from March 16th noted that while six leading models performed similarly on the SweetBench Pro benchmark, the specific testing environment used created a massive discrepancy in their measured success. This suggests that the score an AI receives is often more a reflection of the test's setup than the AI's actual capability.

Because of this volatility, AI teams cannot rely solely on the internal testing provided by model creators like OpenAI, Anthropic, or Mistral. Once an API is integrated into a specific product, the implementing team must conduct their own performance tests, or evals, to ensure the tool works for their unique use case. This requires moving beyond traditional machine learning metrics, such as precision and recall, which are often too narrow for complex tasks. Instead, developers must evaluate functional performance—essentially asking if the AI agent is actually solving the intended problem in a way that is useful to the end user.

To manage this complexity, the Braintrust platform treats agent quality as a combination of two pillars: experimentation-phase tests and production-phase monitoring, also known as observability. This approach creates a feedback loop where real-world data from production is fed back into offline testing sets to ensure the AI's performance aligns with human agreement. Crucially, this process requires non-technical domain experts. These specialists analyze agent traces—the detailed logs of an AI's step-by-step actions—to explain not just whether a task failed, but why it failed. By involving people with the closest proximity to the problem, companies can better refine prompts and context to ensure the AI remains relevant and effective in the wild.

03Isomorphic Launches Automated Labs for Scientific Discovery

AI is transforming how we find new medicines and materials, but there is a fundamental physical barrier known as the "world of atoms." In digital domains like software engineering or mathematics, an AI can propose a solution and verify its accuracy almost instantly using a compiler or a logical proof. However, in the natural sciences—specifically physics, chemistry, and biology—a hypothesis cannot be proven on a screen alone. It requires physical verification in a laboratory to see if the theory holds true in reality. This creates a massive bottleneck in the discovery loop, as the time required to physically test a theory is significantly longer than the time it takes for an AI to imagine one.

To break this bottleneck, Isomorphic is establishing an automated laboratory in London. The company is currently managing a massive library of approximately 200,000 designs for new materials, including potential superconductors. These materials could revolutionize how we transport energy or build computers, but they cannot be validated quickly enough using traditional, manual scientific methods. By automating the verification process, Isomorphic aims to accelerate the transition from a digital blueprint to a physical sample, allowing them to sift through hundreds of thousands of candidates at a pace that finally matches the speed of AI generation.

This strategic move represents a shift toward what is known as "closed loop automated discovery." In this model, the AI that generates the hypothesis and the automated lab that verifies it are fused into a single, integrated unit. Rather than relying on a human scientist to manually move a sample from a computer to a test tube, the system can automatically feed the results of a physical experiment back into the AI to refine the next set of designs. This process of recursive self-improvement allows the system to learn from its own physical failures and successes in real-time, potentially turning the slow, traditional process of trial-and-error into a high-speed industrial pipeline for scientific discovery.

04Tmux and VPS Enable Persistent Multi-Agent Workflows

AI agents can now work for days without a human needing to keep their laptop open or plugged in. Professional "agentic engineers"—developers who utilize AI agents to automate software creation—are shifting their workflows from local laptops to Virtual Private Servers (VPS). A VPS is a remote server that remains active regardless of the user's local machine status. This transition eliminates common hardware risks; if a developer's laptop runs out of battery or suffers a hardware failure, the agentic development continues uninterrupted on the server. By offloading these heavy operations to a remote environment, engineers avoid the need for expensive local hardware and ensure that agents can pursue complex goals for hours or even days at a time.

To manage these remote operations, engineers use Tmux, a terminal multiplexer that allows them to run multiple terminal sessions simultaneously within a single window. By using keyboard shortcuts—such as Ctrl+B followed by a quotation mark for vertical splits or a percentage sign for horizontal splits—developers can divide their screen into a grid of active terminals. This setup enables a multi-agent workflow where different AI agents, such as those from Codex, Hermes, or Pi, can work on separate features of the same project or manage entirely different projects concurrently. For instance, a developer might launch several instances of Codex-yolo to tackle various parts of a codebase in parallel, significantly accelerating the development process.

The most critical advantage of this combination is persistence. Unlike local development, where closing a laptop kills all active processes, Tmux keeps the session alive even after the user disconnects. This allows agents to work persistently on long-term goals—such as building a 3D shooter game end-to-end—while the engineer sleeps. Because the work happens on a remote server, the developer can reconnect from any device, including a smartphone via apps like Terminus. This infrastructure transforms AI coding from a fragile local process into a robust, persistent operation, allowing agents to run for extended periods without the constraints of local battery life or hardware stability.

05Enterprises Mismanage Generative AI Team Roles

Traditional companies are stalling their AI progress by putting the wrong people in charge. When a CEO or CIO decides to implement generative AI, they often hand the project to existing machine learning or data science teams. This happens because these teams already possess the necessary technical tooling, and the "AI" label makes them seem like a natural fit for the assignment. However, this creates a technical silo that separates the development of AI agents—software designed to perform tasks and achieve goals—from the people who actually understand the product and the end user.

The core error is treating a generative AI agent as if it were a standard predictive model, which is a type of AI used primarily to forecast trends or categorize data. While data scientists are experts at measuring a model's accuracy through technical metrics, building a functional agent is more akin to building a consumer product than a mathematical formula. By isolating the work to machine learning engineers, companies fail to involve the diverse teams and domain expertise necessary for success. They miss out on the essential contributions of product managers, application engineers, and systems engineers, who are required to translate high-level requirements from non-technical experts into a working feature.

For a generative AI project to succeed, it requires a diverse team rather than a single technical department. While data scientists still provide immense value—particularly when fine-tuning an open-source model to fit a specific use case—they cannot build a complete product in isolation. When companies fail to integrate diverse engineering roles, they lose the ability to implement AI requirements directly into the product itself. Shifting the organizational structure from a data science project to a cross-functional product team ensures that the AI is not just a technical experiment, but a tool that is properly integrated into the company's actual software and business workflows.

06Model Publisher Notes Lack Evaluation Transparency

When AI companies release a new model, they typically accompany the launch with performance charts that claim their technology outperforms previous versions or competitors. For the people choosing which model to integrate into their business or software, these charts are intended to be the gold standard for decision-making. However, these results are often unverifiable, meaning that an outside party cannot independently confirm that the numbers are accurate. This lack of transparency turns what should be a scientific benchmark into a marketing claim, leaving users to trust the publisher's word rather than a reproducible fact.

The core of the issue is that the specific methods used to conduct these evaluations—the standardized tests used to measure a model's intelligence or skill—are kept secret. While the final score is public, the underlying configurations of the model are not. Furthermore, publishers frequently hide the orchestration and facilitation methods, which are the specific technical arrangements and processes used to run the benchmarks. Because there are a vast number of different configurations and setup options available, the way a test is orchestrated can significantly influence the final outcome. Without knowing these details, it is impossible to tell if a high score is the result of a superior model or simply a highly optimized test environment.

This opacity creates a systemic problem for the AI industry. When evaluations are not transparent or accessible, the leaderboards used to track progress become less relevant over time. Instead of fostering a culture of open verification, the current trend encourages a cycle where publishers prioritize releasing impressive-looking charts over providing the data necessary for external validation. For developers and companies, this means the tools they rely on may not perform in the real world as they did in the publisher's hidden test environment, potentially leading to unexpected failures or inefficiencies in production.

07Google DeepMind Crowdsources AGI Benchmarks

Measuring the actual cognitive abilities of an artificial general intelligence—or AGI—is a complex challenge that cannot be solved by a few engineers in a closed lab. To address this, Google DeepMind is turning to the broader community to help build the standardized tests, known as benchmarks, used to evaluate how these models think. Following the publication of a research paper identifying ten distinct cognitive faculties, or mental capabilities, DeepMind launched a hackathon focused on five of these areas. By crowdsourcing this process, the team aims to incorporate diverse perspectives and unique insights that internal AI labs typically lack, ensuring that the tools used to measure intelligence are comprehensive and not limited by a narrow corporate viewpoint.

A key part of this evaluative effort is the "Game Arena," a specialized environment where AI models are pitted against various games to isolate and analyze specific skills. Rather than using a single test, the arena employs a varied selection of games to probe different mental faculties. For example, the game Werewolf is used specifically to test a model's capacity for deception. Poker serves a dual purpose, allowing researchers to analyze how a model handles randomization and deceptive strategies. Meanwhile, Chess is utilized for more general machine learning analysis. This approach transforms gaming into a diagnostic tool, allowing researchers to see exactly where a model excels or fails in complex social and strategic interactions.

These tests are already revealing distinct "personalities" among different AI models. In the Poker arena, the model Grok has demonstrated a tendency to be highly aggressive, frequently opting to go "all-in." In contrast, other models are more conservative, and some of the newest generation of AI have actually become worse at poker because they are too risk-averse. These behavioral differences provide critical data on how different architectures handle uncertainty and risk. By moving these evaluations into a public and gamified space, DeepMind is creating a more transparent way to understand the cognitive boundaries of AGI before these agents are deployed into real-world tasks.

08Anthropic Secures SpaceX Colossus Computing Resources

Anthropic is significantly boosting its ability to build more powerful artificial intelligence by entering into a strategic partnership with SpaceX. This move provides the AI company with access to massive amounts of computing power, which serves as the essential infrastructure—or the "digital fuel"—required to train the next generation of large-scale AI models. By securing these vast resources, Anthropic can dramatically accelerate its research and development cycles. This means the company can move more quickly from the initial design phase of a model to final deployment, potentially bringing more advanced capabilities to the public much sooner than previously possible.

The technical foundation of this agreement is Anthropic's newfound access to the Colossus computing clusters, specifically Colossus 1 and Colossus 2. These clusters are not typical servers; they are massive arrays of processors designed to handle the staggering amount of data and mathematical operations involved in training modern AI. The partnership ensures that Anthropic has the necessary hardware to support the intense computational demands of deep learning. This access was expanded throughout June, granting the company the ability to utilize both Colossus 1 and Colossus 2, thereby maximizing the scale of their training environments.

This development underscores a critical trend in the AI industry: the race for "compute," where the winner is often decided by who has the most processing power. For the general user, this shift is important because the quality and reasoning capabilities of an AI model are often directly tied to the scale of the hardware used during its training. By partnering with SpaceX to leverage the Colossus systems, Anthropic is removing a major bottleneck in its production pipeline. This strategic move allows them to experiment with larger, more complex architectures that could lead to AI tools with better accuracy, deeper understanding, and more reliable performance across a wider variety of professional and creative tasks.

09Gemini 3.5 Flash Increases Output Token Pricing

Using Google's latest lightweight AI model, Gemini 3.5 Flash, has suddenly become significantly more expensive for those generating large amounts of text. While the cost to send information to the model—known as input tokens—remains relatively low at $1.5 per million tokens, the price for the text the AI produces—the output tokens—has jumped sharply. Specifically, the cost for output tokens has risen to $9 per million, which is a three-fold increase compared to the $3 per million tokens charged for the previous Gemini 3 Flash. This shift means that developers and businesses relying on the model for high-volume content generation will see their operational costs triple for every word the AI writes.

This price hike comes despite the model's positioning as a "Flash" or lightweight version, designed for efficiency. In terms of performance, Gemini 3.5 Flash is remarkably capable, offering a potent combination of high intelligence and rapid output speed. In various benchmark tests, which measure a model's ability to solve problems and understand data, it has performed impressively well. In many areas, it nearly matches the scores of the previous top-tier model, Gemini 3.1 Pro, despite being a smaller, faster version. Its ability to handle multimodal understanding—the capacity to process different types of data like text and images—remains highly competitive.

However, the increased cost has sparked criticism regarding the model's overall value. While it is fast and intelligent, it has not reached the absolute peak of industry performance, particularly in specialized tasks like coding, where models such as Claude Opus 4.7 and GPT 5.5 still hold an edge. Because it does not set a new absolute standard for coding excellence, some users argue that the steep price increase for output tokens undermines its appeal as a cost-effective option. Consequently, Gemini 3.5 Flash now occupies a middle ground: it is a high-performing, fast tool, but one that requires a much larger financial investment for its outputs than its predecessor.

10Qwen 3.7-Max Slashes Tetris Bot Training Costs

The cost of teaching artificial intelligence to master specific tasks is plummeting, making high-performance AI more accessible to a wider range of developers and companies. Recently, the Qwen 3.7-Max model has set a new benchmark for training efficiency, proving that sophisticated capabilities do not necessarily require astronomical budgets. By drastically reducing the financial overhead needed to refine a model for a particular goal, this development shifts the industry focus from who possesses the most raw computing power to who can utilize the most efficient model architecture. This shift lowers the barrier to entry for creating specialized tools that were previously the sole domain of the wealthiest tech giants.

A striking example of this efficiency is seen in the creation of a bot capable of playing the classic game Tetris. While the process of training—essentially the phase where a model learns to make decisions through data and repetition—often involves massive expenditures, Qwen 3.7-Max managed to train a functional Tetris-playing agent for a total cost of just $1.32. This represents a 56% improvement in training costs compared to previous benchmarks. Such a remarkably low price point demonstrates that the model can achieve specialized proficiency without the need for the exhaustive and expensive compute cycles that typically characterize the training of advanced AI systems.

When measured against other leading frontier-level models, the value proposition of Qwen 3.7-Max becomes even more apparent. In direct comparisons with other high-end models such as DeepSeek V4 and Claude Opus 4.6, Qwen 3.7-Max emerged as the most cost-effective option, delivering the best overall performance relative to the actual amount spent on training. This efficiency is critical as the industry moves toward more sustainable AI development. For businesses and researchers, the ability to deploy high-performing bots at a fraction of the usual cost means that specialized AI tools can be developed, tested, and iterated upon rapidly, removing the risk of prohibitive expenses that often stall innovation.