LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Transformer LM
From token ids to logitsEmbedding and LM Head
Attention Mechanisms
From Self-Attention to GQAAttention Sink
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass
Distributed Training
Data ParallelismZeRO OptimizerFully Sharded Data Parallel张量并行流水线并行多维混合并行

Hands-on Training

Overview
Pretraining
Pretraining DataTokenizer TrainingModel ArchitectureData PipelineTraining LoopMonitoring and Validation
X (Twitter)
SystemsDistributed Training

流水线并行

Premium

GPipe 和 1F1B 调度策略的原理与气泡分析

Companion Code
👨‍🍳

Content is cooking...

We're preparing high-quality content for you. Stay tuned!

张量并行

Column Parallel 和 Row Parallel 的对称设计

多维混合并行

ParallelContext 坐标系统与 TP+DP+PP 的工业级组合

Table of Contents

层级切分
朴素流水线:气泡问题
GPipe:微批次并行
1F1B:交错前向反向
PP 的通信特点
GPipe vs 1F1B
总结