LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Transformer LM
From token ids to logitsEmbedding and LM Head
Attention Mechanisms
From Self-Attention to GQAAttention Sink
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass
Distributed Training
Data ParallelismZeRO OptimizerFully Sharded Data Parallel张量并行流水线并行多维混合并行

Hands-on Training

Overview
Pretraining
Pretraining DataTokenizer TrainingModel ArchitectureData PipelineTraining LoopMonitoring and Validation
X (Twitter)
FundamentalsModel ArchitectureAttention Mechanisms

From Self-Attention to GQA

Premium

Starting from Self-Attention, unpack the design trade-offs of Multi-Head, Causal Masking, and GQA / MQA in turn

Log in to continue reading

This is premium content. Please log in to access the full article.

Attention Mechanisms

From MHA / Causal / GQA to Attention Sink and Gated Attention, understand the design, flaws, and evolution of attention

Attention Sink

Why the first token absorbs most attention, and the mechanism, cost, and removal paths of this phenomenon

Table of Contents

What Is the Attention Mechanism
The Standard Self-Attention Computation
PyTorch Reference Implementation
Multi-Head Attention (MHA)
Why Multiple Heads?
The Structure of MHA
PyTorch Implementation
Advantages and Challenges of MHA
Causal Attention
What Is Causal Attention?
Mathematical Representation
Why Do We Need Causal Masking?
PyTorch Implementation
The Performance Opportunity in Causal Masking
Application Scenarios
Grouped Query Attention (GQA)
The Evolution from MHA to GQA
The Memory Problem of MHA
Multi-Query Attention (MQA)
Grouped Query Attention (GQA)
The Mathematics of GQA
PyTorch Implementation
Comparison of the Three Mechanisms
Summary