LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Attention Mechanisms
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass

Hands-on Training

X (Twitter)

数据并行

Premium

理解通信原语和 DDP 的梯度同步机制

Companion Code

Log in to continue reading

This is premium content. Please log in to access the full article.

Table of Contents

单卡训练的内存组成
混合精度训练的内存需求
通信原语
Broadcast
All-Reduce
Reduce-Scatter
All-Gather
DataParallel:最朴素的多卡方案
DDP 的工作原理
Ring All-Reduce:高效的梯度同步
梯度同步机制
DDP 的局限
总结