LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Transformer LM
From token ids to logitsEmbedding and LM Head
Attention Mechanisms
From Self-Attention to GQAAttention Sink
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass
Distributed Training
Data ParallelismZeRO OptimizerFully Sharded Data Parallel张量并行流水线并行多维混合并行

Hands-on Training

Overview
Pretraining
Pretraining DataTokenizer TrainingModel ArchitectureData PipelineTraining LoopMonitoring and Validation
X (Twitter)
SystemsDistributed Training

Fully Sharded Data Parallel

Premium

Understanding FSDP's Intra-Tensor sharding and All-Gather/Reduce-Scatter communication patterns

Companion Code
👨‍🍳

Content is cooking...

We're preparing high-quality content for you. Stay tuned!

ZeRO Optimizer

Progressive de-redundancy: three-stage sharding from optimizer states to parameters

张量并行

Column Parallel 和 Row Parallel 的对称设计

Table of Contents

Two Sharding Methods
Parameter Sharding
Forward Pass: All-Gather
Backward Pass: Reduce-Scatter
Communication-Volume Comparison
When to Use ZeRO-3, When to Use FSDP
Summary