Principles

Tokenization Basics BPE Algorithm GPT Tokenizers BPE Training Engineering

Model Architecture

From token ids to logits Embedding and LM Head

Attention Mechanisms

From Self-Attention to GQA Attention Sink

Position Encoding

Position Encoding Basics RoPE Math Derivation RoPE Implementation Length Extrapolation

GPU Programming Basics

GPU Architecture Basics Tensor Layout Triton Basics: Vector Add

Flash Attention Principles From Naive to Auto-Tuning Block Pointers and Multi-Dim Support Causal Masking Optimization Grouped Query Attention Backward Pass

Distributed Training

Data Parallelism ZeRO Optimizer Fully Sharded Data Parallel 张量并行流水线并行多维混合并行

Hands-on Training

Pretraining Data Tokenizer Training Model Architecture Data Pipeline Training Loop Monitoring and Validation

FundamentalsModel Architecture Attention Mechanisms

From Self-Attention to GQA

Premium

Starting from Self-Attention, unpack the design trade-offs of Multi-Head, Causal Masking, and GQA / MQA in turn

Log in to continue reading

This is premium content. Please log in to access the full article.

Attention Mechanisms

From MHA / Causal / GQA to Attention Sink and Gated Attention, understand the design, flaws, and evolution of attention

Attention Sink

Why the first token absorbs most attention, and the mechanism, cost, and removal paths of this phenomenon

Table of Contents

What Is the Attention Mechanism

The Standard Self-Attention Computation

PyTorch Reference Implementation

Multi-Head Attention (MHA)

Why Multiple Heads?

The Structure of MHA

PyTorch Implementation

Advantages and Challenges of MHA

Causal Attention

What Is Causal Attention?

Mathematical Representation

Why Do We Need Causal Masking?

PyTorch Implementation

The Performance Opportunity in Causal Masking

Application Scenarios

Grouped Query Attention (GQA)

The Evolution from MHA to GQA

The Memory Problem of MHA

Multi-Query Attention (MQA)

Grouped Query Attention (GQA)

The Mathematics of GQA

PyTorch Implementation

Comparison of the Three Mechanisms