LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Attention Mechanisms
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass

Hands-on Training

X (Twitter)
FundamentalsTokenization

Tokenization Basics

Premium

Why tokenization? From character-level to subword-level, with Unicode and UTF-8

Log in to continue reading

This is premium content. Please log in to access the full article.

Tokenization

Deeply understand LLM tokenization, from BPE to GPT implementations

BPE Algorithm

Deep dive into Byte Pair Encoding, with manual training, encoding, and decoding

Table of Contents

Why Tokenization
Start With Character-Level Tokenization
Character-Level Workflow
Limitations of Character-Level Tokenization
Unicode and UTF-8: Multi-language Support
What Is Unicode
UTF-8 Encoding
Why Not Use UTF-8 Bytes Directly?
Summary