LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Attention Mechanisms
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass

Hands-on Training

X (Twitter)
SystemsFlashAttention

Grouped Query Attention

Premium

Add GQA/MQA support so multiple query heads share KV, reducing KV cache memory.

Companion Code

Log in to continue reading

This is premium content. Please log in to access the full article.

Causal Masking Optimization

Implement causal attention for autoregressive models and skip upper-triangular compute for ~2x speedup.

Backward Pass

Implement Flash Attention gradients with recomputation for memory-efficient training.

Table of Contents

Quick GQA Recap
Problem with Standard PyTorch GQA
Efficient Flash Attention Implementation
Core Idea: Pointer Indexing, Not Data Copy
Concrete Example
Unified Support: MHA/GQA/MQA
Full Implementation
Performance Validation
Correctness
Performance
Autotune Best Config
Tradeoffs and Recommendations
Quality vs Memory
Key Implementation Summary
Summary