Systems

Distributed Training

From data parallelism to multi-dimensional hybrid parallelism — understanding the core parallel strategies of large model training

Overview

When a model is too large to fit on a single GPU, we need to distribute computation, memory, and communication across multiple GPUs. This series starts from the simplest data parallelism, then progressively introduces ZeRO, FSDP, tensor parallelism, and pipeline parallelism, and ultimately explains how they combine into multi-dimensional hybrid parallelism.

This series assumes you are already familiar with the basic PyTorch training workflow. Readers interested in GPU hardware and the memory hierarchy are encouraged to first study GPU Programming Fundamentals.

Chapter Contents

Communication-volume convention: throughout this series, we consistently measure communication volume as the amount of data each rank sends + receives per training step, expressed in units of the parameter count $\Phi$ . The default analysis assumes that the full parameters cached during the forward pass are reused in the backward pass ( $2\Phi$ ). If the implementation needs to re-aggregate parameters during the backward pass (e.g., when not caching), the communication volume increases to $3\Phi$ . Each chapter annotates its specific scenario.

What you want to do	Knowledge required
Train large models with 7B+ parameters	DDP, ZeRO, FSDP
Understand how PyTorch FSDP works	The sharding differences between ZeRO-3 and FSDP
A single layer's parameters are too large for one GPU	Tensor parallelism (Column/Row Parallel)
Reduce pipeline bubbles to improve GPU utilization	GPipe, 1F1B scheduling strategies
Understand Megatron-LM's parallel strategy	Multi-dimensional hybrid parallelism, ParallelContext

References

Backward Pass

Implement Flash Attention gradients with recomputation for memory-efficient training.

Data Parallelism

Understanding communication primitives and DDP's gradient synchronization mechanism

Systems

Distributed Training

From data parallelism to multi-dimensional hybrid parallelism — understanding the core parallel strategies of large model training

What you want to do	Knowledge required
Train large models with 7B+ parameters	DDP, ZeRO, FSDP
Understand how PyTorch FSDP works	The sharding differences between ZeRO-3 and FSDP
A single layer's parameters are too large for one GPU	Tensor parallelism (Column/Row Parallel)
Reduce pipeline bubbles to improve GPU utilization	GPipe, 1F1B scheduling strategies
Understand Megatron-LM's parallel strategy	Multi-dimensional hybrid parallelism, ParallelContext

References

Backward Pass

Implement Flash Attention gradients with recomputation for memory-efficient training.

Data Parallelism

Understanding communication primitives and DDP's gradient synchronization mechanism

Distributed Training

Overview

Chapter Contents

Data Parallelism

ZeRO Optimizer

Fully Sharded Data Parallel

Tensor Parallelism

Pipeline Parallelism

Multi-Dimensional Hybrid Parallelism

Why Learn These?

References

Table of Contents

Distributed Training

Overview

Chapter Contents

Data Parallelism

ZeRO Optimizer

Fully Sharded Data Parallel

Tensor Parallelism

Pipeline Parallelism

Multi-Dimensional Hybrid Parallelism

Why Learn These?

References

Table of Contents