Principles

Tokenization Basics BPE Algorithm GPT Tokenizers BPE Training Engineering

Model Architecture

From token ids to logits Embedding and LM Head

Attention Mechanisms

From Self-Attention to GQA Attention Sink

Position Encoding

Position Encoding Basics RoPE Math Derivation RoPE Implementation Length Extrapolation

GPU Programming Basics

GPU Architecture Basics Tensor Layout Triton Basics: Vector Add

Flash Attention Principles From Naive to Auto-Tuning Block Pointers and Multi-Dim Support Causal Masking Optimization Grouped Query Attention Backward Pass

Distributed Training

Data Parallelism ZeRO Optimizer Fully Sharded Data Parallel 张量并行流水线并行多维混合并行

Hands-on Training

Pretraining Data Tokenizer Training Model Architecture Data Pipeline Training Loop Monitoring and Validation

FundamentalsModel Architecture Attention Mechanisms

Attention Sink

Premium

Why the first token absorbs most attention, and the mechanism, cost, and removal paths of this phenomenon

Log in to continue reading

This is premium content. Please log in to access the full article.

From Self-Attention to GQA

Starting from Self-Attention, unpack the design trade-offs of Multi-Head, Causal Masking, and GQA / MQA in turn

Rotary Position Embedding

From position encoding basics to RoPE math, implementation, and length extrapolation

Table of Contents

What Remains After GQA

The Phenomenon: What the First Token Absorbs

A Direct Observation

Unpacking One More Layer: Does Sink Come from Magnitude or Angle?

Unpacking More Thoroughly: Where Does Massive Activation Land?

Why Sink Appears

The "Books Must Balance" Constraint of Softmax

The "Globally Visible" Privilege of the First Token

A Minimal Example

The Cost: What Sink Brings to the System

The KV Cache Cannot Drop the First Token

Long-Context Decay

Quantization Precision Loss

Is Sink a Bug or a Feature

Paths to Eliminate Sink

Toward Gated Attention