LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Attention Mechanisms
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass

Hands-on Training

X (Twitter)
SystemsFlashAttention

From Naive to Auto-Tuning

Premium

Write your first Flash Attention kernel and optimize it with auto-tune.

Companion Code

Log in to continue reading

This is premium content. Please log in to access the full article.

Flash Attention Principles

Use interactive visuals to understand Flash Attention’s core ideas: memory bottlenecks, online softmax, and tiled matmul.

Block Pointers and Multi-Dim Support

Scale from single sequence to Batch/Head parallelism and simplify pointer math with block pointers.

Table of Contents

Core Loop Structure
Why tl.constexpr Is Required
Pointer Arithmetic Intuition
Interaction Guide
Verify Numerical Correctness
Auto-Tuning for Best Config
Using @triton.autotune
Pipeline Parallelism
Key Parameters
Summary