LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Transformer LM
From token ids to logitsEmbedding and LM Head
Attention Mechanisms
From Self-Attention to GQAAttention Sink
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass
Distributed Training
Data ParallelismZeRO OptimizerFully Sharded Data Parallel张量并行流水线并行多维混合并行

Hands-on Training

Overview
Pretraining
Pretraining DataTokenizer TrainingModel ArchitectureData PipelineTraining LoopMonitoring and Validation
X (Twitter)
SystemsDistributed Training

张量并行

Premium

Column Parallel 和 Row Parallel 的对称设计

Companion Code
👨‍🍳

Content is cooking...

We're preparing high-quality content for you. Stay tuned!

Fully Sharded Data Parallel

Understanding FSDP's Intra-Tensor sharding and All-Gather/Reduce-Scatter communication patterns

流水线并行

GPipe 和 1F1B 调度策略的原理与气泡分析

Table of Contents

为什么需要张量并行
Column Parallel Linear
Row Parallel Linear
Column + Row 的组合
TP vs DP
总结