LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Transformer LM
From token ids to logitsEmbedding and LM Head
Attention Mechanisms
From Self-Attention to GQAAttention Sink
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass
Distributed Training
Data ParallelismZeRO OptimizerFully Sharded Data Parallel张量并行流水线并行多维混合并行

Hands-on Training

Overview
Pretraining
Pretraining DataTokenizer TrainingModel ArchitectureData PipelineTraining LoopMonitoring and Validation
X (Twitter)
Fundamentals

Tokenization

Deeply understand LLM tokenization, from BPE to GPT implementations

Overview

Tokenization is a foundational component of LLMs. It converts text into numerical sequences the model can process. Although it looks like simple preprocessing, tokenization design directly affects model performance, efficiency, and behavior. Many “weird” LLM behaviors—poor spelling, weak support for some languages, etc.—can be traced back to tokenization choices.

This series takes you from first principles to Byte Pair Encoding (BPE), and then builds a GPT-style tokenizer hands-on.

This series is suitable for readers who already understand LLM basics. If you want to understand how tokenization affects model behavior, or implement your own tokenizer, this series is for you.

Chapters

Tokenization Basics

Why tokenization? From character-level to subword-level, with Unicode and UTF-8

BPE Algorithm

Deep dive into Byte Pair Encoding: manual training, encoding, and decoding

GPT Tokenizers

GPT-2/GPT-4 tokenization, regex pre-tokenization, and the tiktoken library

BPE Training Engineering

Engineering optimizations for large-scale BPE training: parallel pre-tokenization, incremental updates, low-frequency pruning, 20x speedup

Learning Path

What You Want to DoWhat You Need
Understand LLM input processingTokenization basics, Unicode/UTF-8
Implement your own tokenizerBPE algorithm, training pipeline
Use GPT modelstiktoken library, special tokens
Train tokenizers at scaleParallel processing, incremental updates, memory optimizations

References

  • minbpe Repository
  • tiktoken Library
  • GPT-2 Paper

CookLLM

Deeply learn the core technologies and practical applications of large language models

Tokenization Basics

Why tokenization? From character-level to subword-level, with Unicode and UTF-8

Table of Contents

Overview
Chapters
Learning Path
References