Tokenization
Deeply understand LLM tokenization, from BPE to GPT implementations
Overview
Tokenization is a foundational component of LLMs. It converts text into numerical sequences the model can process. Although it looks like simple preprocessing, tokenization design directly affects model performance, efficiency, and behavior. Many “weird” LLM behaviors—poor spelling, weak support for some languages, etc.—can be traced back to tokenization choices.
This series takes you from first principles to Byte Pair Encoding (BPE), and then builds a GPT-style tokenizer hands-on.
This series is suitable for readers who already understand LLM basics. If you want to understand how tokenization affects model behavior, or implement your own tokenizer, this series is for you.
CookLLM Docs