LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Transformer LM
From token ids to logitsEmbedding and LM Head
Attention Mechanisms
From Self-Attention to GQAAttention Sink
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass
Distributed Training
Data ParallelismZeRO OptimizerFully Sharded Data Parallel张量并行流水线并行多维混合并行

Hands-on Training

Overview
Pretraining
Pretraining DataTokenizer TrainingModel ArchitectureData PipelineTraining LoopMonitoring and Validation
X (Twitter)
FundamentalsModel Architecture

Attention Mechanisms

Premium

From MHA / Causal / GQA to Attention Sink and Gated Attention, understand the design, flaws, and evolution of attention

Log in to continue reading

This is premium content. Please log in to access the full article.

Embedding and LM Head

Understand how token ids enter a continuous vector space, and how hidden states are projected back to vocabulary logits

From Self-Attention to GQA

Starting from Self-Attention, unpack the design trade-offs of Multi-Head, Causal Masking, and GQA / MQA in turn

Table of Contents

Overview
Chapters
References