Ask any frontier LLM how a typical South Korean would react to a new policy, and you will likely receive a polished, sanitized answer that represents a person who does not actually exist. For developers and researchers, this is the average trap. Most large language models generate a fictional composite of a Korean citizen, ignoring the jagged edges of regional disparities, complex class hierarchies, and the actual distribution of wealth. The result is a simulation that feels plausible but lacks any grounding in sociological reality. This week, the developer community began discussing ManyPerson, a statistics-based AI persona simulator designed to replace these hallucinations with a virtual population anchored in official government data.
The Architecture of 41,000 Data-Driven Personas
ManyPerson does not rely on the model's internal knowledge of demographics. Instead, it utilizes raw CSV data from the 2025 Household Financial Welfare Survey provided by the Statistics Korea MDIS. The system parses 34,880 household master records and joins them with 69,929 individual household member records. This process results in a structured population of approximately 41,000 unique Korean AI personas. Each persona preserves critical real-world attributes, including gender, age, income quintiles, housing types, and whether the individual resides in the Seoul metropolitan area.
To transform these statistical rows into living personas, ManyPerson employs Gemini to generate qualitative layers. The AI assigns specific job titles, MBTI types, personality traits, hobbies, and hometowns, as well as first-person self-introductions. These generations are not random. The system enforces strict constraints to ensure narrative consistency. For example, individuals with an annual income exceeding 100 million won are assigned roles as executives or specialized professionals, while those listed as unemployed or retired are given narratives that reflect their actual economic status.
Under the hood, the platform is built for scale and flexibility. It uses Node.js and the Express framework for the backend, while PostgreSQL handles the persona data using JSONB to manage extensible attributes. For high-performance queue and cache management, the system integrates Valkey, a Redis-compatible in-memory data store. The entire infrastructure is deployed on GKE Autopilot, ensuring the compute resources scale automatically based on the simulation load.
Moving Beyond the Average Trap
Traditional AI simulations often suffer from a centering bias. When a user asks an LLM to simulate 100 citizens, the model tends to gravitate toward a narrow set of stereotypes or a homogenized middle class. ManyPerson shifts the paradigm from generating a sample to querying a population. Instead of asking the AI to imagine a person, the system selects a persona from the 41,000-member database based on the user's criteria and forces the LLM to adopt that specific, data-backed identity.
This approach differs significantly from projects like Nemotron-Personas-Korea, which focused primarily on providing synthetic persona datasets. ManyPerson is a full-stack web service that allows users to interact with these personas in real time using natural language. A key innovation here is the use of heuristics to distribute household income among members. Rather than assigning a flat average, the system allocates wealth based on the relationship between the head of the household, the spouse, and the children. This creates realistic social contexts, such as a university student supported by wealthy parents or a stay-at-home spouse married to a high-earning professional.
For developers, this transforms the efficiency of survey design and product validation. A team can now gauge how a specific demographic, such as office workers in their 20s and 30s living in the Seoul metropolitan area, might react to a new service idea. The workflow begins with a natural language query, which is refined through Gemini and search grounding to ensure accuracy. The system then filters the target personas and generates responses. These results are not just text; they are quantified into positive, neutral, or negative sentiments and multiple-choice distributions. Most importantly, the final output applies the official weights from Statistics Korea, providing a weighted result and cross-analysis across various demographic axes.
AI simulation has officially migrated from the realm of plausible imagination to the domain of data-driven virtual sampling.



