LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Transformer LM
From token ids to logitsEmbedding and LM Head
Attention Mechanisms
From Self-Attention to GQAAttention Sink
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass
Distributed Training
Data ParallelismZeRO OptimizerFully Sharded Data Parallel张量并行流水线并行多维混合并行

Hands-on Training

Overview
Pretraining
Pretraining DataTokenizer TrainingModel ArchitectureData PipelineTraining LoopMonitoring and Validation
X (Twitter)
SystemsDistributed Training

ZeRO Optimizer

Premium

Progressive de-redundancy: three-stage sharding from optimizer states to parameters

Companion Code

Log in to continue reading

This is premium content. Please log in to access the full article.

Data Parallelism

Understanding communication primitives and DDP's gradient synchronization mechanism

Fully Sharded Data Parallel

Understanding FSDP's Intra-Tensor sharding and All-Gather/Reduce-Scatter communication patterns

Table of Contents

Redundancy Analysis of Training State
ZeRO Stage 1: Sharding Optimizer States
Parameter Assignment Strategy
Gradient Synchronization
Training Loop Comparison
ZeRO Stage 2: Sharding Gradients
ZeRO Stage 3: Sharding Parameters
Parameter Sharding
Communication Pattern
Communication Overhead Comparison
ZeRO-3's Sharding Method: Inter-Tensor
Summary