LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Attention Mechanisms
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass

Hands-on Training

X (Twitter)
SystemsFlashAttention

Causal Masking Optimization

Premium

Implement causal attention for autoregressive models and skip upper-triangular compute for ~2x speedup.

Companion Code

Log in to continue reading

This is premium content. Please log in to access the full article.

Block Pointers and Multi-Dim Support

Scale from single sequence to Batch/Head parallelism and simplify pointer math with block pointers.

Grouped Query Attention

Add GQA/MQA support so multiple query heads share KV, reducing KV cache memory.

Table of Contents

Quick Review: Causal Attention
Performance Opportunity
Half the Compute
Visual: Skipped Blocks
Implementation
Change 1: Coarse Skip via Loop Bound
Change 2: Fine-Grained Mask Inside Blocks
Why Both Masks?
Performance and Correctness
Correctness
Performance
Speedup vs Sequence Length
Implementation Tips
Compile-Time Causal Flag
Summary