Systems

Flash Attention

Deeply understand Flash Attention principles and Triton implementation

Overview

Flash Attention is an efficient attention implementation that uses tiling and online softmax to reduce memory IO complexity from $O(N^2)$ to $O(N)$ , greatly speeding up Transformer training and inference.

Systems

Flash Attention

Deeply understand Flash Attention principles and Triton implementation

Overview

What You Want to Do	What You Need
Understand why standard attention is slow	HBM vs SRAM, IO-bound concept
Build your own attention kernel	Online softmax, tiling
Optimize kernel performance	Autotune, pipeline, block pointers
Support long-sequence inference	Understand $O(N^2) \to O(N)$ memory optimization

Flash Attention

Overview

Flash Attention

Overview

Chapters

Flash Attention Principles

From Naive to Auto-Tuning

Block Pointers and Multi-Dim Support

Causal Masking Optimization

Grouped Query Attention

Backward Pass

Why Learn This?

References

Table of Contents