LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Transformer LM
From token ids to logitsEmbedding and LM Head
Attention Mechanisms
From Self-Attention to GQAAttention Sink
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass
Distributed Training
Data ParallelismZeRO OptimizerFully Sharded Data Parallel张量并行流水线并行多维混合并行

Hands-on Training

Overview
Pretraining
Pretraining DataTokenizer TrainingModel ArchitectureData PipelineTraining LoopMonitoring and Validation
X (Twitter)
FundamentalsModel ArchitectureAttention Mechanisms

Attention Sink

Premium

Why the first token absorbs most attention, and the mechanism, cost, and removal paths of this phenomenon

Log in to continue reading

This is premium content. Please log in to access the full article.

From Self-Attention to GQA

Starting from Self-Attention, unpack the design trade-offs of Multi-Head, Causal Masking, and GQA / MQA in turn

Rotary Position Embedding

From position encoding basics to RoPE math, implementation, and length extrapolation

Table of Contents

What Remains After GQA
The Phenomenon: What the First Token Absorbs
A Direct Observation
Unpacking One More Layer: Does Sink Come from Magnitude or Angle?
Unpacking More Thoroughly: Where Does Massive Activation Land?
Why Sink Appears
The "Books Must Balance" Constraint of Softmax
The "Globally Visible" Privilege of the First Token
A Minimal Example
The Cost: What Sink Brings to the System
The KV Cache Cannot Drop the First Token
Long-Context Decay
Quantization Precision Loss
Is Sink a Bug or a Feature
Paths to Eliminate Sink
Toward Gated Attention
Summary