LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Transformer LM
From token ids to logitsEmbedding and LM Head
Attention Mechanisms
From Self-Attention to GQAAttention Sink
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass
Distributed Training
Data ParallelismZeRO OptimizerFully Sharded Data Parallel张量并行流水线并行多维混合并行

Hands-on Training

Overview
Pretraining
Pretraining DataTokenizer TrainingModel ArchitectureData PipelineTraining LoopMonitoring and Validation
X (Twitter)
SystemsFlashAttention

Causal Masking Optimization

Premium

Implement causal attention for autoregressive models and skip upper-triangular compute for ~2x speedup.

Companion Code

Log in to continue reading

This is premium content. Please log in to access the full article.

Block Pointers and Multi-Dim Support

Scale from single sequence to Batch/Head parallelism and simplify pointer math with block pointers.

Grouped Query Attention

Add GQA/MQA support so multiple query heads share KV, reducing KV cache memory.

Table of Contents

Quick Review: Causal Attention
Performance Opportunity
Half the Compute
Visual: Skipped Blocks
Implementation
Change 1: Coarse Skip via Loop Bound
Change 2: Fine-Grained Mask Inside Blocks
Why Both Masks?
Performance and Correctness
Correctness
Performance
Speedup vs Sequence Length
Implementation Tips
Compile-Time Causal Flag
Summary