LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Attention Mechanisms
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass

Hands-on Training

X (Twitter)
SystemsFlashAttention

Flash Attention Principles

Premium

Use interactive visuals to understand Flash Attention’s core ideas: memory bottlenecks, online softmax, and tiled matmul.

Log in to continue reading

This is premium content. Please log in to access the full article.

Flash Attention

Deeply understand Flash Attention principles and Triton implementation

From Naive to Auto-Tuning

Write your first Flash Attention kernel and optimize it with auto-tune.

Table of Contents

The Memory Bottleneck in Standard Attention
GPU Memory Hierarchy: SRAM vs HBM
The Logical Trap in the Standard Implementation
Bandwidth Gap: SRAM vs HBM
IO-bound Bottleneck
SRAM Capacity Limits
Cost and Density
Capacity Limits
Core Idea: Optimize IO Complexity
Avoid Storing Intermediates
Online Softmax
Limits of Offline Softmax
Online Algorithm and Dynamic Correction
Correction Formula
Numeric Example: [3, 2, 5, 1]
Why This Fits Flash Attention
Tiled Matrix Multiplication (Tiling)
Why Tiling?
Visual Demo: Tiled Compute
What to Observe
Tiling + Attention
Loop Strategy: V1 vs V2
Visualization Guide
Softmax Correction in Tiled Attention
Naive Local Softmax Fails
Solution: Online Rescaling
Initialization
One Iteration Over a K Block
Final Normalization
Full Pseudocode
Algorithm in Perspective
Summary