LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Attention Mechanisms
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass

Hands-on Training

X (Twitter)
Fundamentals

Tokenization

Deeply understand LLM tokenization, from BPE to GPT implementations

Overview

Tokenization is a foundational component of LLMs. It converts text into numerical sequences the model can process. Although it looks like simple preprocessing, tokenization design directly affects model performance, efficiency, and behavior. Many “weird” LLM behaviors—poor spelling, weak support for some languages, etc.—can be traced back to tokenization choices.

This series takes you from first principles to Byte Pair Encoding (BPE), and then builds a GPT-style tokenizer hands-on.

This series is suitable for readers who already understand LLM basics. If you want to understand how tokenization affects model behavior, or implement your own tokenizer, this series is for you.

Chapters

Tokenization Basics

Why tokenization? From character-level to subword-level, with Unicode and UTF-8

BPE Algorithm

Deep dive into Byte Pair Encoding: manual training, encoding, and decoding

GPT Tokenizers

GPT-2/GPT-4 tokenization, regex pre-tokenization, and the tiktoken library

BPE Training Engineering

Engineering optimizations for large-scale BPE training: parallel pre-tokenization, incremental updates, low-frequency pruning, 20x speedup

Learning Path

What You Want to DoWhat You Need
Understand LLM input processingTokenization basics, Unicode/UTF-8
Implement your own tokenizerBPE algorithm, training pipeline
Use GPT modelstiktoken library, special tokens
Train tokenizers at scaleParallel processing, incremental updates, memory optimizations

References

  • minbpe Repository
  • tiktoken Library
  • GPT-2 Paper

CookLLM

Deeply learn the core technologies and practical applications of large language models

Tokenization Basics

Why tokenization? From character-level to subword-level, with Unicode and UTF-8

Table of Contents

Overview
Chapters
Learning Path
References