Distributed Training
From data parallelism to multi-dimensional hybrid parallelism — understanding the core parallel strategies of large model training
Overview
When a model is too large to fit on a single GPU, we need to distribute computation, memory, and communication across multiple GPUs. This series starts from the simplest data parallelism, then progressively introduces ZeRO, FSDP, tensor parallelism, and pipeline parallelism, and ultimately explains how they combine into multi-dimensional hybrid parallelism.
This series assumes you are already familiar with the basic PyTorch training workflow. Readers interested in GPU hardware and the memory hierarchy are encouraged to first study GPU Programming Fundamentals.
Chapter Contents
Communication-volume convention: throughout this series, we consistently measure communication volume as the amount of data each rank sends + receives per training step, expressed in units of the parameter count . The default analysis assumes that the full parameters cached during the forward pass are reused in the backward pass (). If the implementation needs to re-aggregate parameters during the backward pass (e.g., when not caching), the communication volume increases to . Each chapter annotates its specific scenario.
Data Parallelism
Understanding communication primitives and DDP's gradient synchronization mechanism
ZeRO Optimizer
Progressive de-redundancy: three-stage sharding from optimizer states to parameters
Fully Sharded Data Parallel
Understanding FSDP's Intra-Tensor sharding and All-Gather/Reduce-Scatter communication patterns
Tensor Parallelism
The symmetric design of Column Parallel and Row Parallel
Pipeline Parallelism
The principles and bubble analysis of GPipe and 1F1B scheduling strategies
Multi-Dimensional Hybrid Parallelism
The ParallelContext coordinate system and industrial-grade combination of TP+DP+PP
Why Learn These?
| What you want to do | Knowledge required |
|---|---|
| Train large models with 7B+ parameters | DDP, ZeRO, FSDP |
| Understand how PyTorch FSDP works | The sharding differences between ZeRO-3 and FSDP |
| A single layer's parameters are too large for one GPU | Tensor parallelism (Column/Row Parallel) |
| Reduce pipeline bubbles to improve GPU utilization | GPipe, 1F1B scheduling strategies |
| Understand Megatron-LM's parallel strategy | Multi-dimensional hybrid parallelism, ParallelContext |
CookLLM Docs