Data in the real world is rarely polite. In a perfect laboratory setting, datasets are shuffled and distributed evenly, but in production, data is fragmented, biased, and stubbornly non-IID. A medical imaging model trained across three different hospitals will encounter different patient demographics, varying equipment calibrations, and skewed disease prevalence at every site. This statistical heterogeneity, known as non-independent and identically distributed data, is the primary wall that federated learning developers hit when moving from a research paper to a deployed system. The challenge is no longer just about privacy or communication overhead, but about whether a global model can actually converge when its contributors are seeing entirely different versions of reality.

The Architecture of NVIDIA FLARE and Non-IID Simulation

NVIDIA FLARE, or the Federated Learning Application Runtime Environment, addresses this instability by decoupling the orchestration of the learning process from the local execution. The framework operates through a strict division of labor between the Job API and the Client API. The Job API resides on the server side, acting as the central conductor that defines the workflow, manages the lifecycle of the training task, and handles the aggregation of model weights. Meanwhile, the Client API manages the local environment, executing the actual training on the site's private data and communicating updates back to the server. This separation is critical for scalability, as it allows developers to manage communication overhead and control logic at the API level without needing to rewrite the underlying training scripts for every new client site.

To test how this architecture handles data imbalance, the environment utilizes the CIFAR-10 dataset, which consists of images across ten distinct classes. Rather than a random split, the experiment employs a Dirichlet distribution to partition the data. This method allows researchers to simulate varying degrees of label imbalance by adjusting the alpha value. In this setup, a lower alpha value increases the heterogeneity of the data, meaning some clients might receive a vast majority of images from only one or two classes while others receive none. This creates a high-stress environment for the global model, as the local gradients produced by each client point in wildly different directions.

The technical workflow involves a Convolutional Neural Network (CNN) deployed across multiple simulated clients. Each client receives the current global weights from the server via the Job API, performs local training using its specific data shard, and returns the updated weights. The server then aggregates these updates to form a new global model. By maintaining a shared test set, the system can track the global model's accuracy in real-time across communication rounds, providing a clear metric of how quickly the model converges despite the skewed data distribution.

The Collision of FedAvg and FedProx

When these two algorithms face non-IID data, the results reveal a fundamental flaw in the standard approach to federated learning. Federated Averaging, or FedAvg, is the baseline for most systems. It operates on a simple premise: let every client train locally and then average their weights. While FedAvg is computationally efficient and performs exceptionally well in IID environments where data is uniform, it collapses under the weight of label imbalance. This failure manifests as client drift, where local models optimize so aggressively toward their own skewed data that they move away from the global optimum. In a non-IID CIFAR-10 environment, a client with only "truck" images will pull the global model's weights toward a truck-specific local minimum, causing the global accuracy to fluctuate violently or plateau prematurely.

FedProx introduces a mathematical correction to stop this drift. It modifies the local loss function by adding a proximal term, governed by a hyperparameter denoted as $\mu$ (mu). This term acts as a penalty for local updates that stray too far from the current global model. Essentially, $\mu$ serves as a leash, ensuring that while the client learns from its local data, it remains anchored to the global consensus. If a local model begins to overfit to its specific skewed distribution, the proximal term increases the loss, forcing the model to find a solution that satisfies both the local data and the global state.

The difference in performance becomes obvious when visualizing the accuracy curves over successive communication rounds. FedAvg typically shows a jagged trajectory, with accuracy spiking and dipping as the server struggles to reconcile conflicting updates from divergent clients. FedProx, conversely, produces a smoother, more stable ascent. By suppressing the volatility of local updates, FedProx ensures that the global model converges more reliably and reaches a higher final accuracy. For practitioners, the value of FedProx is not just in the final percentage point of accuracy, but in the predictability of the training process. The ability to tune $\mu$ allows developers to balance the autonomy of local learning with the necessity of global consistency.

Streamlining the Federated Pipeline

Beyond the algorithmic battle, the operational shift provided by NVIDIA FLARE is what makes these experiments viable in a production context. Historically, building a federated learning environment required manual configuration of network protocols, socket management, and complex synchronization logic. NVFlare abstracts this complexity into the Job API, turning infrastructure management into a configuration task. This allows the developer to stop worrying about how the weights are moving across the wire and start focusing on why the model is diverging.

One of the most practical features of the NVFlare workspace is the automated handling of global model checkpoints. Because federated learning involves hundreds of communication rounds, the risk of failure is high. The ability to automatically save and reuse checkpoints means that if a training session is interrupted or if a specific hyperparameter setting for $\mu$ leads to divergence, the researcher can roll back to a previous state without restarting the entire process. This drastically shortens the iteration cycle, enabling faster hypothesis testing and more aggressive tuning of the proximal term.

By providing a standardized Client API, NVFlare also eliminates the "environment drift" that often plagues distributed systems. Whether a client is running on a high-end GPU server or a constrained edge device, the interface for receiving weights and returning updates remains the same. This standardization ensures that the performance differences observed between FedAvg and FedProx are a result of the algorithms themselves, not artifacts of the underlying hardware or network instability.

The transition from FedAvg to FedProx within the NVFlare framework represents a broader shift in AI development. The industry is moving away from the assumption of clean, centralized data and toward a reality where models must learn from a chaotic, fragmented edge. The ability to quantify and mitigate client drift through proximal terms and robust orchestration is no longer an academic exercise; it is the prerequisite for deploying intelligence in the real world.