Flash Attention
Deeply understand Flash Attention principles and Triton implementation
Overview
Flash Attention is an efficient attention implementation that uses tiling and online softmax to reduce memory IO complexity from to , greatly speeding up Transformer training and inference.
CookLLM Docs