LogoCookLLM Docs
LogoCookLLM Docs
HomeCookLLM

Principles

Tokenization
Tokenization BasicsBPE AlgorithmGPT TokenizersBPE Training Engineering
Model Architecture
Attention Mechanisms
Position Encoding
Position Encoding BasicsRoPE Math DerivationRoPE ImplementationLength Extrapolation
GPU Programming Basics
GPU Architecture BasicsTensor LayoutTriton Basics: Vector Add
FlashAttention
Flash Attention PrinciplesFrom Naive to Auto-TuningBlock Pointers and Multi-Dim SupportCausal Masking OptimizationGrouped Query AttentionBackward Pass

Hands-on Training

X (Twitter)
Systems

GPU Programming Basics

Learn CUDA and Triton, and write efficient GPU kernels

Overview

Before diving into advanced optimizations like Flash Attention, we need the fundamentals of GPU programming. This module takes you from scratch to understanding how GPUs work and how to write efficient kernels with Triton.

This module is a prerequisite for the Systems track. We recommend completing it before Flash Attention.

Chapters

GPU Architecture Basics

Understand SIMT, memory hierarchy, and hardware limits

Tensor Layout

Go deep into memory: stride, contiguous, and view mechanics

Triton Basics: Vector Add

Write your first Triton kernel from scratch

Why Learn This?

What You Want to DoWhat You Need
Understand Flash Attention implementationShared memory, tiling
Write your own attention kernelTriton programming
Optimize inference speedMemory layout, coalescing
Implement custom quantization kernelsCUDA/Triton fundamentals

References

  • CUDA C++ Programming Guide
  • Triton Documentation
  • GPU Architecture Explained

Length Extrapolation

NTK-aware Scaling, YaRN, and other methods to let RoPE handle longer sequences

GPU Architecture Basics

Understand GPU design philosophy, the SIMT model, and hardware hierarchy mapping to build parallel intuition.

Table of Contents

Overview
Chapters
Why Learn This?
References