LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Transformer LM
From token ids to logitsEmbedding and LM Head
Attention Mechanisms
From Self-Attention to GQAAttention Sink
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass
Distributed Training
Data ParallelismZeRO OptimizerFully Sharded Data Parallel张量并行流水线并行多维混合并行

Hands-on Training

Overview
Pretraining
Pretraining DataTokenizer TrainingModel ArchitectureData PipelineTraining LoopMonitoring and Validation
X (Twitter)
FundamentalsModel ArchitecturePosition Encoding

RoPE Math Derivation

Premium

From complex rotations to higher-dimensional generalization, understand the core math of rotary position embeddings

Companion Code

Log in to continue reading

This is premium content. Please log in to access the full article.

Position Encoding Basics

Why Transformers need position information, and the methods and limits of absolute position encoding

RoPE Implementation

Inverse frequency computation, cos/sin caching, and a vectorized apply_rotary_pos_emb

Table of Contents

Start From the Requirement
2D Case: Complex View
Treat a Vector as a Complex Number
Encode Position With Rotation
Expand to Real Form
Generalize to Higher Dimensions
Grouped Rotations
Frequency Design
Efficient Implementation of Rotation
Avoid Explicit Matrix Multiplication
Verify the Relative Position Property
Comparison With Other Position Encodings
Summary