LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Attention Mechanisms
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass

Hands-on Training

X (Twitter)
Systems

Flash Attention

Deeply understand Flash Attention principles and Triton implementation

Overview

Flash Attention is an efficient attention implementation that uses tiling and online softmax to reduce memory IO complexity from O(N2)O(N^2)O(N2) to O(N)O(N)O(N), greatly speeding up Transformer training and inference.

This series assumes you know GPU programming basics. If SIMT, shared memory, etc. are unfamiliar, start with GPU Programming Basics.

Chapters

Flash Attention Principles

Interactive visualizations to understand memory bottlenecks, online softmax, and tiled matmul

From Naive to Auto-Tuning

Write your first Flash Attention kernel and optimize with auto-tune

Block Pointers and Multi-Dim Support

Scale from single sequence to batch/head parallelism and simplify pointer management

Causal Masking Optimization

Implement causal attention and skip upper-triangular compute for ~2x speedup

Grouped Query Attention

Add GQA/MQA support by sharing KV across query heads to reduce KV cache memory

Backward Pass

Implement Flash Attention gradients using recomputation for memory-efficient training

Why Learn This?

What You Want to DoWhat You Need
Understand why standard attention is slowHBM vs SRAM, IO-bound concept
Build your own attention kernelOnline softmax, tiling
Optimize kernel performanceAutotune, pipeline, block pointers
Support long-sequence inferenceUnderstand O(N2)→O(N)O(N^2) \to O(N)O(N2)→O(N) memory optimization

References

  • FlashAttention paper
  • FlashAttention-2 paper
  • Triton Documentation

Triton Basics: Vector Add

Learn Triton’s programming model through a simple vector add example.

Flash Attention Principles

Use interactive visuals to understand Flash Attention’s core ideas: memory bottlenecks, online softmax, and tiled matmul.

Table of Contents

Overview
Chapters
Why Learn This?
References