Tokenization
Deeply understand LLM tokenization, from BPE to GPT implementations
Overview
Tokenization is a foundational component of LLMs. It converts text into numerical sequences the model can process. Although it looks like simple preprocessing, tokenization design directly affects model performance, efficiency, and behavior. Many “weird” LLM behaviors—poor spelling, weak support for some languages, etc.—can be traced back to tokenization choices.
This series takes you from first principles to Byte Pair Encoding (BPE), and then builds a GPT-style tokenizer hands-on.
This series is suitable for readers who already understand LLM basics. If you want to understand how tokenization affects model behavior, or implement your own tokenizer, this series is for you.
Chapters
Tokenization Basics
Why tokenization? From character-level to subword-level, with Unicode and UTF-8
BPE Algorithm
Deep dive into Byte Pair Encoding: manual training, encoding, and decoding
GPT Tokenizers
GPT-2/GPT-4 tokenization, regex pre-tokenization, and the tiktoken library
BPE Training Engineering
Engineering optimizations for large-scale BPE training: parallel pre-tokenization, incremental updates, low-frequency pruning, 20x speedup
CookLLM Docs