Tokenization

Overview

Tokenization is a foundational component of LLMs. It converts text into numerical sequences the model can process. Although it looks like simple preprocessing, tokenization design directly affects model performance, efficiency, and behavior. Many “weird” LLM behaviors—poor spelling, weak support for some languages, etc.—can be traced back to tokenization choices.

This series takes you from first principles to Byte Pair Encoding (BPE), and then builds a GPT-style tokenizer hands-on.

This series is suitable for readers who already understand LLM basics. If you want to understand how tokenization affects model behavior, or implement your own tokenizer, this series is for you.

What You Want to Do	What You Need
Understand LLM input processing	Tokenization basics, Unicode/UTF-8
Implement your own tokenizer	BPE algorithm, training pipeline
Use GPT models	tiktoken library, special tokens
Train tokenizers at scale	Parallel processing, incremental updates, memory optimizations

Overview

Chapters

Tokenization

Overview

Chapters

Tokenization Basics

BPE Algorithm

GPT Tokenizers

BPE Training Engineering

Learning Path

References

Table of Contents