Processing the Earth from space is an exercise in managing astronomical costs. For environmental researchers and government agencies tracking mangrove migration or deforestation in real-time, the bottleneck is rarely the availability of data—European Space Agency's Sentinel-2 provides a constant stream of imagery—but the sheer computational tax of analyzing it. The lifecycle of satellite AI, from data export and preprocessing to inference and post-processing, consumes massive amounts of GPU memory and compute cycles. This financial barrier often restricts high-resolution planetary monitoring to well-funded labs, leaving smaller NGOs and regional governments unable to deploy these tools at scale.
The Tokenization Tax and the v1.1 Architecture
To break this cost barrier, the team behind OlmoEarth has released v1.1, a model family designed to reduce operational costs by up to 3x compared to its predecessor. Launched as an evolution of the v1 model released in November 2025, v1.1 introduces a tiered lineup consisting of Base, Tiny, and Nano versions. This stratification allows developers to choose a model size that fits their specific hardware budget, whether they are performing a one-time national crop map or maintaining a continuous global monitoring pipeline.
The core of the efficiency gain lies in how the model handles the Sentinel-2 data structure. Sentinel-2 imagery is represented as a tensor with the dimensions `[H, W, T, D=12]`, where H and W represent latitude and longitude pixels, T represents the time axis, and D=12 represents the 12 spectral channels. In the original OlmoEarth v1, the model processed these channels by creating separate tokens for different resolutions. Specifically, it generated individual tokens for the 10m, 20m, and 60m resolution bands. This meant that for every single time step in a spatial patch, the model generated three distinct tokens.
Because OlmoEarth is built on the Transformer architecture, this approach created a significant computational burden. In Transformers, the cost of the attention mechanism increases quadratically relative to the length of the token sequence. By generating three tokens per time step, v1 was effectively tripling the sequence length, which led to an exponential increase in the Multiply-Accumulate operations (MACs) required for every forward pass. OlmoEarth v1.1 solves this by integrating these three resolutions into a single token. By reducing the token count to one-third of the original volume, the model dramatically lowers the sequence length, leading to a linear reduction in compute requirements during pre-training, fine-tuning, and inference.
The Performance Trap and the Pre-training Fix
Reducing token counts is a straightforward path to efficiency, but it usually comes with a steep price in accuracy. When the research team first attempted to simply merge the resolution tokens, they encountered a significant performance collapse. In tests using the m-eurosat kNN benchmark—a standard for evaluating remote sensing models—the simple integration of tokens caused a 10ppt drop in performance. This revealed a critical tension in satellite AI: the separation of tokens for 10m, 20m, and 60m resolutions allows the model to more effectively capture the complex spectral relationships between different bands. Compressing this information into a single token essentially erased the nuanced spatial and spectral signatures the model needed to make accurate predictions.
This creates a classic trade-off seen in other remote sensing architectures. Models like Galileo and SatMAE (Self-supervised Masked Autoencoder for Satellite Imagery) prioritize accuracy by maintaining separate tokens for different resolutions, accepting the higher compute cost. On the other end of the spectrum, CROMA (Cross-modal Remote sensing Model) adopts a strategy of integrating all bands into a single token to maximize efficiency. OlmoEarth v1.1 sought a middle ground: the efficiency of CROMA with the precision of SatMAE.
To overcome the 10ppt performance drop, the team did not revert to the old tokenization method. Instead, they redesigned the pre-training regimen. By modifying the learning settings and the procedural framework of the pre-training phase, they forced the model to learn how to extract the same level of spectral detail from a single integrated token that it previously extracted from three. To ensure the results were scientifically valid, the team kept the dataset identical to that used for v1. This isolation of variables proved that the performance recovery was a result of the improved training regimen rather than simply adding more data. The result is a model that maintains v1-level accuracy while operating at a fraction of the cost. The full weights and training code are available for implementation at OlmoEarth v1.1 weights and training code.
This structural optimization transforms the economics of satellite analysis. By lowering the MACs required for inference, v1.1 allows for faster processing of millions of image patches. For a developer, this means that the time and money spent on fine-tuning the model for a specific task—such as identifying specific types of forest loss—is reduced by up to 3x, making the iteration cycle significantly faster.
The ultimate goal of this efficiency is the transition from static snapshots to dynamic, planet-scale map refreshing. Historically, global-scale analysis was a one-off research project because the cost of updating a map of an entire continent was prohibitive. With the 3x reduction in inference costs, the industry can move toward real-time global monitoring. This enables the precise tracking of rapidly changing indicators, such as the health of mangrove forests or the immediate causes of sudden deforestation, where the freshness of the data determines the effectiveness of the intervention.
By lowering the entry barrier for GPU infrastructure, OlmoEarth v1.1 moves high-performance satellite AI out of the elite research lab and into the hands of regional environmental agencies and small-scale developers. The shift from separate resolution tokens to an optimized integrated token system proves that efficiency in AI does not have to come at the expense of intelligence, provided the training strategy is evolved alongside the architecture.




