Marius Husnes, the IT platform lead at the National Library of Norway, stood before an audience at the Huawei ID Forum 2026 and made a claim that stopped the room. He stated that no private company possesses the data assets the library has secured. This was not a boast about the size of their digital archives, but a reference to a series of exclusive, legal agreements with newspaper publishers. These contracts allow the library to feed copyrighted journalistic data into a Large Language Model, a privilege that commercial AI giants cannot simply buy or scrape from the open web. This moment marks a shift in the AI arms race, where the battle is moving away from raw compute and toward the legal and cultural sovereignty of data.
The Architecture of Cultural Sovereignty
The National Library of Norway is currently spearheading the development of a sovereign LLM specifically designed for the Norwegian language. This initiative stems from a growing realization that commercial AI providers have largely neglected the nuances of smaller linguistic groups, leaving a void in models that truly reflect local history, news, and cultural identity. To fill this gap, the Norwegian Ministry of Culture tasked the National Library with the project, leveraging the institution's position as the custodian of the nation's largest digital collection, including books, newspapers, and web pages.
The strategy relies on a massive data moat. Since 2005, the library has pursued an aggressive digitization campaign, resulting in 20 PB of unique, high-quality data. When including backups and redundant copies managed under a 3-2-1 storage strategy, the total data footprint reaches 60 PB. The critical advantage here is the legal framework; by securing the rights to use copyrighted content for training, Norway has created a proprietary dataset that provides a competitive edge over any private entity attempting to build a Norwegian-centric model. This project serves as a blueprint for other non-English speaking nations struggling to project their own history and values into the AI era, shifting the role of the AI from a mere builder of tools to a custodian of national heritage.
Bridging the Gap Between Archive and Pipeline
The primary technical hurdle was not the volume of data, but the physics of moving it. The library faced a fundamental conflict between two different storage requirements: the needs of a preservation archive and the needs of an AI training pipeline. The existing 60 PB preservation system is optimized for durability and cost, meaning it is designed for long-term safety rather than speed. In such systems, read latency is high, making them unsuitable for the high-throughput, low-latency parallel I/O required by modern AI workloads. Attempting to train a model directly from a cold archive would create a massive bottleneck, leaving expensive GPUs idling while waiting for data to arrive.
To resolve this, the library implemented a tiered infrastructure. They deployed 2 PB of Huawei OceanStor Dorado all-flash storage to act as the high-speed engine for the data pipeline. This all-flash array handles the most computationally intensive preprocessing stages, including data collection, cleaning, deduplication, normalization, and validation. By placing the OceanStor Dorado at the front of the pipeline, the library ensures that the data flow remains fluid and rapid. This storage layer is integrated with a 384-core CPU cluster and Nvidia DGX H200 systems, maximizing the efficiency of the preprocessing phase before the data ever reaches the training cluster.
The final stage of the process involves a hand-off to the Sigma2 Olivia system, Norway's national supercomputer. Based on the HPE Cray Supercomputing EX architecture, Olivia provides the raw horsepower for the actual LLM training, utilizing 448 GPUs and 64,512 CPU cores. The training environment uses a 5.3 PB Cray ClusterStor E1000 system to manage the refined datasets. In this dual-structure design, the Huawei all-flash storage acts as the refinery, cleaning and preparing the raw 60 PB archive into a high-performance stream that the HPE Cray system can ingest without latency issues.
This infrastructure proves that sovereign AI is as much a hardware challenge as it is a linguistic one, requiring a precise orchestration of flash storage and supercomputing to turn a national archive into a living intelligence.




