LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Attention Mechanisms
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass

Hands-on Training

X (Twitter)
FundamentalsTokenization

BPE Training Engineering

Premium

From toy data to real corpora: memory optimization, parallel pre-tokenization, incremental updates, and time-space tradeoffs

Companion Code

Log in to continue reading

This is premium content. Please log in to access the full article.

GPT Tokenizers

GPT-2/GPT-4 tokenization, regex pre-tokenization, and the tiktoken library

Architecture (Model Architecture)

Deeply understand LLM architecture design

Table of Contents

Training on Real Data
1.1 Baseline Implementation
Data Structure Changes
Frequency Weighting
Baseline Training Function
1.2 Baseline Performance Test
1.3 The OOM Problem
Memory Bottleneck
Solution Directions
2.1 Pre-tokenization and Chunk Boundaries
Why Chunking
Boundary Choice: Do Not Cut Arbitrarily
Parallel Pre-tokenization
2.2 Low-Frequency Pruning
Why Prune
Pruning Strategy
Impact of Pruning
2.3 Incremental Updates vs Full Recompute
Problem: Updating Counts After Each Merge
Option 1: Full Recompute
Option 2: Incremental Updates
What These Indices Do
Incremental Update Steps
Example Data Changes
How to Choose
2.4 Checkpointing
2.5 Performance Comparison
Key Optimization Impact
Summary