LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Attention Mechanisms
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass

Hands-on Training

X (Twitter)
FundamentalsModel ArchitecturePosition Encoding

Length Extrapolation

Premium

NTK-aware Scaling, YaRN, and other methods to let RoPE handle longer sequences

Companion Code

Log in to continue reading

This is premium content. Please log in to access the full article.

RoPE Implementation

Inverse frequency computation, cos/sin caching, and a vectorized apply_rotary_pos_emb

GPU Programming Basics

Learn CUDA and Triton, and write efficient GPU kernels

Table of Contents

The Rotation View: Understanding Extrapolation
Rotating on the Unit Circle
High Frequency vs Low Frequency: Coverage Differences
From Rotation to Solutions
Position Interpolation (PI)
NTK-aware Scaling
Core Idea
Derivation
Implementation
Advantages of NTK-aware
Dynamic NTK
YaRN
Motivation
YaRN’s Three Components
1. NTK-by-parts (piecewise interpolation)
2. Attention Scaling
YaRN Results
Summary Comparison
Summary