LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Transformer LM
From token ids to logitsEmbedding and LM Head
Attention Mechanisms
From Self-Attention to GQAAttention Sink
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass
Distributed Training
Data ParallelismZeRO OptimizerFully Sharded Data Parallel张量并行流水线并行多维混合并行

Hands-on Training

Overview
Pretraining
Pretraining DataTokenizer TrainingModel ArchitectureData PipelineTraining LoopMonitoring and Validation
X (Twitter)
Systems

Distributed Training

From data parallelism to multi-dimensional hybrid parallelism — understanding the core parallel strategies of large model training

Overview

When a model is too large to fit on a single GPU, we need to distribute computation, memory, and communication across multiple GPUs. This series starts from the simplest data parallelism, then progressively introduces ZeRO, FSDP, tensor parallelism, and pipeline parallelism, and ultimately explains how they combine into multi-dimensional hybrid parallelism.

This series assumes you are already familiar with the basic PyTorch training workflow. Readers interested in GPU hardware and the memory hierarchy are encouraged to first study GPU Programming Fundamentals.

Chapter Contents

Communication-volume convention: throughout this series, we consistently measure communication volume as the amount of data each rank sends + receives per training step, expressed in units of the parameter count Φ\PhiΦ. The default analysis assumes that the full parameters cached during the forward pass are reused in the backward pass (2Φ2\Phi2Φ). If the implementation needs to re-aggregate parameters during the backward pass (e.g., when not caching), the communication volume increases to 3Φ3\Phi3Φ. Each chapter annotates its specific scenario.

Data Parallelism

Understanding communication primitives and DDP's gradient synchronization mechanism

ZeRO Optimizer

Progressive de-redundancy: three-stage sharding from optimizer states to parameters

Fully Sharded Data Parallel

Understanding FSDP's Intra-Tensor sharding and All-Gather/Reduce-Scatter communication patterns

Tensor Parallelism

The symmetric design of Column Parallel and Row Parallel

Pipeline Parallelism

The principles and bubble analysis of GPipe and 1F1B scheduling strategies

Multi-Dimensional Hybrid Parallelism

The ParallelContext coordinate system and industrial-grade combination of TP+DP+PP

Why Learn These?

What you want to doKnowledge required
Train large models with 7B+ parametersDDP, ZeRO, FSDP
Understand how PyTorch FSDP worksThe sharding differences between ZeRO-3 and FSDP
A single layer's parameters are too large for one GPUTensor parallelism (Column/Row Parallel)
Reduce pipeline bubbles to improve GPU utilizationGPipe, 1F1B scheduling strategies
Understand Megatron-LM's parallel strategyMulti-dimensional hybrid parallelism, ParallelContext

References

  • ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
  • Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
  • GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism
  • PyTorch FSDP Documentation

Backward Pass

Implement Flash Attention gradients with recomputation for memory-efficient training.

Data Parallelism

Understanding communication primitives and DDP's gradient synchronization mechanism

Table of Contents

Overview
Chapter Contents
Why Learn These?
References