[ICLR’26 Oral] The "Grokking" Moment in Decentralized Learning: On The Surprising Effectiveness of A Single Global Merging

Authors: Tongtian Zhu, Tianyu Zhang, Mingze Wang, Zhanpeng Zhou†, Can Wang

Openreview: https://openreview.net/forum?id=zrFnwRHuQo

Code: https://github.com/Raiden-Zhu/ICLR-2026-Grokking-in-Decentralized-Learning

https://github.com/Raiden-Zhu/ICLR-2026-Grokking-in-Decentralized-Learning

We release the source code to reproduce the experiments in this paper. Additionally, the repo includes a lightweight simulator for gossip-based decentralized learning with limited compute.

TL;DR

We often expect frequent communication is necessary for distributed learning. Our research challenges this view and reveals a counterintuitive truth: Distributed agents can train in near-disconnected way, then a single global merging at the very end of training results in a "grokking-like" phase transition: a sudden, steep recovery of performance that matches fully synchronized training. To explain this phenomenon, we provide a new theoretical framework explaining how "inconsistency" of model parameters combined with the loss landscape's high-order geometry can actually accelerate distributed training.

Figure 1: An Doraemon-style comic strip illustrating the paper's core intuition.

Table of Contents

Figure 2: Research roadmap of this paper.

1. Motivation: Optimizing The Marginal Utility of Communication

In distributed training, bandwidth is a scarce resource. This scarcity is critically intensified by the demands of modern AI: training foundation models with billions of parameters turns communication into a heavy burden, requiring the constant transmission of massive model states. Imagine you have limited bandwidth and multiple training jobs running concurrently in a cluster. Which task deserves the bandwidth?

The conventional approach is "fairness": everyone synchronizes all the time. But is this economically optimal? Given the sheer cost of synchronizing such large models, treating communication as a "constant necessity" is a luxury we may not be able to afford.

While existing research has exhaustively optimized who we communicate with (Spatial Allocation via communication topology design), it often overlooks the dimension of time (Temporal Allocation). If bandwidth is a limited budget, we must determine the most effective moment to spend it. This leads us to our core question:

Research Question: How should communication be scheduled over time?

2. A Counterintuitive Phenomenon: The Surprising Effectiveness of a Single Global Merging

We usually expect that if decentralized agents don't communicate frequently, their models will drift hopelessly apart, especially when their data is highly heterogeneous. Our experiment challenged this assumption with a radical test: What if we allow agents to train in near-silence and only synchronize once?

The Setup: We simulated a challenging environment where agents hold highly distinct data distributions (Non-IID). For the vast majority of training (the “silent phases”), agents communicated only through sparse, occasional random gossip, allowing their local parameters to drift apart significantly. See Appendix C.1 of our paper for details and the Additional Background of Decentralized Learning section for context.