Authors: Tongtian Zhu, Tianyu Zhang, Mingze Wang, Zhanpeng Zhou†, Can Wang

Openreview: https://openreview.net/forum?id=zrFnwRHuQo

We are finalizing the camera-ready version, we provide the major results in this blog.

TL;DR

We often expect frequent communication is necessary for distributed learning. Our research challenges this view and reveals a counterintuitive truth: Distributed agents can train in near-disconnected way, then a single global merging at the very end of training results in a "grokking-like" phase transition: a sudden, steep recovery of performance that matches fully synchronized training. To explain this phenomenon, we provide a new theoretical framework explaining how "inconsistency" of model parameters combined with the loss landscape's high-order geometry can actually accelerate distributed training.

Figure: An Doraemon-style comic strip illustrating the paper's core intuition.

Figure: An Doraemon-style comic strip illustrating the paper's core intuition.

Table of Contents

Figure: Research roadmap of this paper.

Figure: Research roadmap of this paper.

1. Motivation: Optimizing The Marginal Utility of Communication

In distributed training, bandwidth is a scarce resource. This scarcity is critically intensified by the demands of modern AI: training foundation models with billions of parameters turns communication into a heavy burden, requiring the constant transmission of massive model states. Imagine you have limited bandwidth and multiple training jobs running concurrently in a cluster. Which task deserves the bandwidth?

The conventional approach is "fairness": everyone synchronizes all the time. But is this economically optimal? Given the sheer cost of synchronizing such large models, treating communication as a "constant necessity" is a luxury we may not be able to afford.

While existing research has exhaustively optimized who we communicate with (Spatial Allocation via communication topology design), it often overlooks the dimension of time (Temporal Allocation). If bandwidth is a limited budget, we must determine the most effective moment to spend it. This leads us to our core question:

Research Question: How should communication be scheduled over time?

2. A Counterintuitive Phenomenon: The Surprising Effectiveness of a Single Global Merging

We usually expect that if decentralized agents don't communicate frequently, their models will drift hopelessly apart, especially when their data is highly heterogeneous. Our experiment challenged this assumption with a radical test: What if we allow agents to train in near-silence and only synchronize once?

Figure: Global test accuracy (see Definition 1) of CLIP ViT-B/32 (a) and ResNet-18 (b) trained on Tiny ImageNet using FedAvg (blue), decentralized SGD (orange), and one-shot FedAvg (green), distributed across 32 agents with high data heterogeneity (Dirichlet α = 0.1).

Figure: Global test accuracy (see Definition 1) of CLIP ViT-B/32 (a) and ResNet-18 (b) trained on Tiny ImageNet using FedAvg (blue), decentralized SGD (orange), and one-shot FedAvg (green), distributed across 32 agents with high data heterogeneity (Dirichlet α = 0.1).