LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Attention Mechanisms
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass

Hands-on Training

X (Twitter)
FundamentalsModel Architecture

Attention Mechanisms

Premium

Deeply understand Attention in Transformers, including MHA, Causal Attention, GQA, and MQA.

Log in to continue reading

This is premium content. Please log in to access the full article.

Architecture (Model Architecture)

Deeply understand LLM architecture design

Rotary Position Embedding

From position encoding basics to RoPE math, implementation, and length extrapolation

Table of Contents

What Is Attention
Standard Self-Attention Flow
PyTorch Reference Implementation
Multi-Head Attention (MHA)
Why Multiple Heads?
MHA Structure
PyTorch Implementation
MHA Advantages and Challenges
Causal Attention
What Is Causal Attention?
Mathematical Form
Why Causal Masking?
PyTorch Implementation
Performance Opportunity in Causal Masking
Use Cases
Grouped Query Attention (GQA)
From MHA to GQA
MHA Memory Problem
Multi-Query Attention (MQA)
Grouped Query Attention (GQA)
GQA Math
PyTorch Implementation
Comparing the Three Mechanisms
Summary