Every morning, developers tasked with fine-tuning AI models for specialized domains hit the same invisible wall. They realize that the high-quality, domain-specific data required for professional-grade reasoning simply does not exist on the open web. Whether it is cyber threat intelligence, complex legal reasoning, or nuanced medical diagnostics, the gap between general-purpose training sets and professional requirements is vast. In many cases, the data exists but is locked behind strict privacy walls or corporate silos, leaving engineers to rely on fragile, manually crafted prompts to simulate expertise.

The Architecture of Simula and the Three Axes of Control

To solve this systemic shortage, a research team from Google and EPFL (École Polytechnique Fédérale de Lausanne) has unveiled Simula, a synthetic data generation framework that treats data creation as a mechanism design problem rather than a prompting exercise. Unlike traditional synthetic data pipelines, Simula does not rely on seed data, manual prompt engineering, or evolutionary algorithms. Instead, it constructs data from the ground up based on three controllable axes: quality, diversity, and complexity.

Quality is defined by whether a data point meets specific semantic and syntactic requirements. Diversity is split into two layers: global coverage, which ensures the entire conceptual space of a domain is spanned, and local variation, which ensures multiple interpretations of a single concept. Complexity measures how rare, sophisticated, or confusing an example is. By decoupling these three factors, Simula allows developers to scale the difficulty of a dataset without sacrificing its breadth.

The framework executes this through a rigorous four-stage pipeline. The first stage focuses on global diversity using hierarchical taxonomies. When given a high-level goal, such as creating a dataset for cyber threat intelligence, a multimodal model (M3) identifies the primary variables of the domain, such as attack types, threat actors, and vulnerability classes. These variables are expanded into a taxonomy tree using breadth-first search. To prevent the omission of critical sub-categories, the team implemented a Best-of-N proposal strategy combined with a critic improvement phase. The model proposes N candidate child nodes and then critiques them for completeness, validity, and specificity. This structured approach ensures that when the system extracts up to 512,000 training examples, it captures the long tail of the domain rather than just the most common patterns.

The second stage addresses local diversity through meta-prompting. The system takes combinations of nodes from the taxonomy and feeds them to the M3 model to generate a meta-prompt. For instance, a combination of {house cat, poetry, travel enthusiast} might result in a prompt to write a haiku about a house cat on an adventure. To prevent mode collapse, where the model begins producing repetitive outputs, Simula generates multiple meta-prompts simultaneously and sub-samples them at a specific ratio.

The third stage is complexification. The user defines a ratio, denoted as c, which determines what percentage of meta-prompts undergo a complexity boost. The M3 model is instructed to increase the difficulty of the prompt and the resulting output while maintaining the original core requirements. This separation allows developers to raise the ceiling of difficulty without narrowing the scope of the data.

The final stage employs a dual-critic approach to ensure quality. Rather than asking a model if a generated answer is correct, Simula independently queries whether the answer is correct and whether it is incorrect. This dual-verification design is specifically engineered to mitigate sycophancy bias, the tendency of LLMs to agree with a plausible-looking output even if it is wrong. This is critical for tasks with objective truths, such as mathematics or multiple-choice questions.

The Complexity Paradox and the Teacher Model Limit

To validate the framework, the researchers used Gemini 2.5 Flash (in non-reasoning mode) as the teacher model and Gemma 3 4B as the student model. The team conducted LoRA fine-tuning across five distinct domains, repeating the process 10 times with different seeds to report average accuracy within a 95% confidence interval. The datasets, totaling up to 512,000 points, included CTI-MCQ for cybersecurity standards, CTI-RCM for mapping CVE descriptions to CWE categories, LEXam for Swiss, EU, and international law exams in English and German, GSM8k for elementary mathematics, and Global MMLU for science and math in English, Korean, and Nepali.

The results confirmed that the full Simula system—combining global diversity, local diversity, complexification, and dual-criticism—consistently outperformed simple baselines across all scales. The synergy between global and local diversification proved essential; using only one of the two often led to sub-optimal results depending on the dataset size.

However, the experiments revealed a critical tension regarding complexity. In the GSM8k mathematics dataset, the high-complexity split resulted in a 10% increase in accuracy compared to the low-complexity split when using 64,000 data points. This suggests that for certain domains, harder data directly translates to a more capable student model. This trend reversed sharply in the LEXam legal dataset. In this case, the teacher model's accuracy was only 57%, and the high-complexity data actually degraded the student's performance.

This creates a complexity paradox: synthetic data only improves performance when the teacher model is reliable enough to generate accurate labels for that level of difficulty. The evidence was found in the critic rejection rates. For GSM8k, the rejection rate was a mere 2%, meaning the teacher model was highly confident and accurate. For LEXam, the rejection rate soared to 61%. This high rejection rate serves as a diagnostic signal, proving that Gemini 2.5 Flash was not sufficiently powerful to act as a reliable teacher for high-complexity legal reasoning.

Historically, synthetic data generation relied on simple prompts, such as asking an LLM to create 100 cybersecurity questions. While the results looked plausible, they failed to cover the long tail of the domain and relied entirely on the intuition of the prompt engineer. Simula shifts this paradigm from prompting to design. By embedding global diversity in taxonomies, local diversity in meta-prompt sampling, complexity in the ratio c, and quality in dual-criticism, the process becomes predictable and transparent.

For developers, this means the ability to visualize the taxonomy tree to identify missing concepts immediately. It allows for precise control over the difficulty curve by adjusting the complexification ratio. Most importantly, it turns the data generation pipeline into a tool for auditing the teacher model itself. When a developer sees a 61% rejection rate, they know they have reached the cognitive limit of their teacher model and must either upgrade the model or simplify the domain requirements.

The research team has made the Simula code available for the community on GitHub: https://github.com/google-deepmind/simula.

Simula effectively transforms the art of synthetic data generation into a rigorous engineering discipline.