Every team training a multimodal large language model has faced the same gut-wrenching question: what ratio of caption data to QA data to OCR data should I throw into the next run? The answer has been, almost universally, a shrug and a guess. This week at the ICLR 2026 workshop on Neural Architecture, Data, and Pretraining for Foundation Models (NADPFM), a team led by Bingbin Yuan and Sirajul Salehin published a framework that replaces that guesswork with a systematic optimization loop — and it costs about 1 percent of what a full training run would.

MixAtlas: Decomposing Data by Image Concept and Task Type

The core insight behind MixAtlas is that not all data dimensions matter equally, and the ones that do matter interact in ways that single-axis tuning misses. The researchers propose splitting every training sample along two interpretable axes: the **image concept** (what the image depicts — a chart, a photograph, a diagram, a screenshot) and the **task supervision** (what the model is asked to do — caption, question-answer, classification, grounding). Each combination of concept and task defines a domain. Instead of guessing how much of each domain to include, MixAtlas runs a small proxy model — orders of magnitude smaller than the target model — on each domain and measures performance. It then fits a Gaussian process over those measurements to model uncertainty and predict the optimal mixing ratio across all domains.

The full pipeline runs at roughly 1/100th the cost of a single full-scale training run. The proxy model is small enough to iterate quickly, and the Gaussian process surrogate means the optimizer doesn't need to exhaustively sample every possible ratio. The paper reports that the optimal mix found by the proxy transfers directly to the large model — the same ratio that works for a 1B-parameter model also works for a 7B or 13B model, with no additional tuning.

What's Actually Different: Two-Axis Optimization Over Single-Criterion Tuning

Previous approaches to data mixing typically optimized along a single axis — data format (image-text pairs vs. interleaved sequences) or task type (captioning vs. VQA). MixAtlas tracks the fine-grained contribution of each concept-task intersection. For example, the system can tell you exactly how much the "chart image" × "question answering" domain contributes to final benchmark performance, separate from the contribution of "chart image" × "captioning" or "photograph" × "QA." This granularity lets the optimizer allocate data budget where it actually matters, rather than treating all chart data or all QA data as interchangeable.

The paper demonstrates that this two-axis decomposition captures interactions that single-axis methods miss. In one ablation, optimizing only by task type improved ChartQA by 2 percent but degraded TextVQA by 1 percent. Optimizing only by image concept improved both but left 4–6 percent of potential gain on the table. The full two-axis optimization captured those gains simultaneously.

What Developers Get: 3× Faster Convergence, 10–13% Benchmark Gains

The practical results are concrete enough to act on. Applying the MixAtlas-optimized mixing ratio to a standard MLLM training pipeline produced:

- **Up to 3× faster convergence** to the same loss value compared to heuristic mixing ratios.

- **Consistent 2–5% improvements** across a suite of multimodal benchmarks.

- **ChartQA: +10%** (chart understanding accuracy).

- **TextVQA: +13%** (visual question answering on images containing text).

The gains are largest on text-heavy benchmarks, which makes intuitive sense: the optimizer can allocate more data to the concept-task domains that build OCR and text-reading capability, rather than diluting them with irrelevant image types.

For a team with a fixed GPU budget, these numbers translate directly into either a better model at the same cost or the same model at lower cost. The proxy model step adds overhead, but at 1/100th of the full training cost, it pays for itself on the first run.

MixAtlas makes multimodal data mixing practical, interpretable, and cheap enough to run as a standard preprocessing step. The era of tuning data ratios by gut feeling is over.