LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Transformer LM
From token ids to logitsEmbedding and LM Head
Attention Mechanisms
From Self-Attention to GQAAttention Sink
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass
Distributed Training
Data ParallelismZeRO OptimizerFully Sharded Data Parallel张量并行流水线并行多维混合并行

Hands-on Training

Overview
Pretraining
Pretraining DataTokenizer TrainingModel ArchitectureData PipelineTraining LoopMonitoring and Validation
X (Twitter)
SystemsGPU Programming Basics

Tensor Layout

Premium

Understand physical memory layout, strides, view vs reshape, and gradient tracking.

Log in to continue reading

This is premium content. Please log in to access the full article.

GPU Architecture Basics

Understand GPU design philosophy, the SIMT model, and hardware hierarchy mapping to build parallel intuition.

Triton Basics: Vector Add

Learn Triton’s programming model through a simple vector add example.

Table of Contents

What Is a Tensor?
Key Concept: Strides
Vector Example (1D)
Matrix Example (2D)
Contiguity Explained
What Breaks Contiguity?
What Happened?
Why Non-contiguous?
View vs Reshape: A Performance Pivot
view(): Zero-copy, But Restricted
reshape(): Smarter, Safer
Gradient Tracking: Clone, Detach, and Their Combination
clone(): Copy Data, Keep Grad History
detach(): Cut Grad, Share Memory
detach().clone(): Common Pattern
Debugging Tips
Layout Types: Row-major vs Column-major
Test Your Understanding
Summary