LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Attention Mechanisms
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass

Hands-on Training

X (Twitter)
FundamentalsTokenization

GPT Tokenizers

Premium

GPT-2/GPT-4 tokenization, regex pre-tokenization, and the tiktoken library

Companion Code

Log in to continue reading

This is premium content. Please log in to access the full article.

BPE Algorithm

Deep dive into Byte Pair Encoding, with manual training, encoding, and decoding

BPE Training Engineering

From toy data to real corpora: memory optimization, parallel pre-tokenization, incremental updates, and time-space tradeoffs

Table of Contents

GPT-2 Tokenization
Problems With Naive BPE
Regex Pre-tokenization Solution
Workflow
Interactive Demo: BPE Training
GPT-4 Improvements
Vocabulary Size Comparison
Using tiktoken
Install and Basic Usage
Compare Tokenizers
Inspect Token Byte Representation
Special Tokens
Common Special Tokens
Handling Special Tokens
Token Counting and Cost Estimation
Summary