LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Attention Mechanisms
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass

Hands-on Training

X (Twitter)

ZeRO 优化器

Premium

渐进式去冗余,从优化器状态到参数的三级分片

Companion Code

Log in to continue reading

This is premium content. Please log in to access the full article.

Table of Contents

训练状态的冗余分析
ZeRO Stage 1:分片优化器状态
参数分配策略
梯度同步
训练循环对比
ZeRO Stage 2:分片梯度
ZeRO Stage 3:分片参数
参数分片
通信模式
通信开销对比
ZeRO-3 的分片方式:Inter-Tensor
总结