FlashAttention: Optimization Details for Efficient Exact Attention
FlashAttention is a groundbreaking optimization for the self-attention mechanism in Transformer models, introduced in 2022 by Tri Dao et al. It computes exact attention (no approximations) while dramatically reducing memory usage and runtime, particularly for long sequences. By leveraging GPU memory hierarchies—specifically minimizing data movement between slow High Bandwidth Memory (HBM) and fast on-chip SRAM (Shared Memory)—it addresses the quadratic O(N²) memory bottleneck of standard attention, where N is sequence length. This makes it ideal for training and inference in large language models (LLMs) like GPT, enabling longer contexts (e.g., up to 64K tokens) without quality loss. Below, I'll break down the core optimizations, algorithm, and benefits, drawing from the original paper and subsequent improvements.
FlashAttention: Optimization Details for Efficient Exact Attention
FlashAttention: Optimization Details for Efficient Exact Attention
FlashAttention: Optimization Details for Efficient Exact Attention
FlashAttention is a groundbreaking optimization for the self-attention mechanism in Transformer models, introduced in 2022 by Tri Dao et al. It computes exact attention (no approximations) while dramatically reducing memory usage and runtime, particularly for long sequences. By leveraging GPU memory hierarchies—specifically minimizing data movement between slow High Bandwidth Memory (HBM) and fast on-chip SRAM (Shared Memory)—it addresses the quadratic O(N²) memory bottleneck of standard attention, where N is sequence length.
This makes it ideal for training and inference in large language models (LLMs) like GPT, enabling longer contexts (e.g., up to 64K tokens) without quality loss. Below, I'll break down the core optimizations, algorithm, and benefits, drawing from the original paper and subsequent improvements.
Why Standard Attention is Inefficient: The IO Bottleneck
Standard (eager) attention computes:
1. Scores: S = (Q Kᵀ) / √d (N × N matrix)
2. Softmax: P = softmax(S) (N × N, row-wise)
3. Output: O = P V (N × d matrix)
Problems:
- Memory: Stores S and P in HBM (O(N²) space), leading to spills for N > 2K.
- IO Overhead: Writes intermediates (S, P) to HBM, then reads them back for matmuls—up to 2-3× more HBM traffic than compute FLOPs.
- Runtime: Dominated by memory bandwidth, not FLOPs (GPUs are compute-bound only for small N).
FlashAttention flips this by being IO-aware: It tiles computations to keep data in SRAM, reducing HBM accesses by a factor of ~ (d / √M), where d is dimension and M is SRAM size.
Core Optimizations in FlashAttention (Original)
1. Tiling Strategy
- Block the Matrices: Divide Q, K, V into smaller tiles (blocks) of size B_r × d (for rows) and B_c × d (for columns), chosen to fit in SRAM (e.g., B_r = B_c ≈ 64-128 on A100 GPUs).
- Forward Pass Tiling:
- Loop over row tiles of Q (fixed Q tile per iteration).
- For each row tile, loop over column tiles of K/V.
- Compute local S_tile = Q_tile @ K_tileᵀ / √d in SRAM.
- Apply causal mask (lower triangular) if needed.
- Compute row-wise softmax on S_tile, updating running statistics (max and sum per row) to avoid materializing full P.
- Compute local O_tile += P_tile @ V_tile.
- No Full Materialization: Never stores the full N × N S or P in HBM—only tiles and O (linear in N).
- Backward Pass: Recomputes S and P from Q, K, V tiles (using stored O and softmax stats) to compute gradients, keeping memory O(N).
2. Online Softmax with Recomputation
- Online Softmax: Computes softmax incrementally per tile, maintaining per-row m_i (max) and l_i (sum exp) for numerical stability:
[
m_{i,new} = \max(m_i, m_{tile}), \quad l_{i,new} = l_i \cdot \exp(m_i - m_{i,new}) + l_{tile}
]
P_row = exp(S_row - m_i) / l_i (recomputed as needed). - Recomputation: In backward, rescales gradients using these O( N ) stats instead of storing O( N² ) intermediates. This adds ~25% more compute but saves massive IO.
3. Kernel Fusion and Parallelism
- Fuses QKV projection, attention, and projection into one CUDA kernel.
- Uses SRAM for all tile ops; HBM only for input/output.
- Supports causal masking via tiled triangular logic.
IO Complexity: Standard: O(N² d / √M + N d) HBM words. FlashAttention: O(N d² / √M) reads + O(N d) writes—optimal for typical SRAM sizes.
Pseudocode: Forward Pass (Simplified)
def flash_attention_forward(Q, K, V, causal=False): # Shapes: [N, d]
# Tile sizes: Br (rows), Bc (cols)
O = zeros(N, d)
m, l = -inf * ones(N), zeros(N) # Softmax stats per row
for i in range(0, N, Br): # Loop over Q row tiles
Q_tile = Q[i:i+Br] # Load to SRAM
O_tile = O[i:i+Br] # Accumulator in SRAM
for j in range(0, N, Bc): # Loop over K/V col tiles
K_tile, V_tile = K[j:j+Bc], V[j:j+Bc] # Load to SRAM
# Causal mask: Skip if j >= i + Br (for causal)
if causal and j >= i + Br: continue
S_tile = (Q_tile @ K_tile.T) / sqrt(d) # In SRAM
if causal: S_tile = causal_mask(S_tile) # -inf above diagonal
# Online softmax
m_new = max(m[i:i+Br], row_max(S_tile))
l_new = l[i:i+Br] * exp(m[i:i+Br] - m_new) + row_sum(exp(S_tile - m_new))
P_tile = exp(S_tile + m[i:i+Br] - m_new) / l_new[:, None]
O_tile += P_tile @ V_tile
m[i:i+Br], l[i:i+Br] = m_new, l_new
O[i:i+Br] = O_tile # Write back only O
return O, (m, l) # For backward
(Backward recomputes S/P using m, l for dL/dS, etc.)
Advancements: FlashAttention-2 (2023)
Addresses remaining overheads in the original (e.g., suboptimal parallelism):
- Better Work Partitioning: Balances row/column loops; uses thread block per row tile for more parallelism.
- Reduced Non-MM FLOPs: Cuts softmax/reduction overhead by 2× via work-group reductions.
- Sequence Parallelism: Optimizes for multi-GPU.
- Speedups: Up to 2× faster than FlashAttention-1; end-to-end 1.3× on GPT training (72% MFU on A100). On H100: 1.6-2× over baselines.
FlashAttention-3 (2024): Hopper GPU Optimizations
For NVIDIA Hopper (H100+):
- Asynchrony: Overlaps compute with TMA (Tensor Memory Access) for async loads.
- Low-Precision: FP8 support (1.2 PFLOPS peak).
- Speedups: 1.6-2× over FlashAttention-2 in FP16; enables FP8 for 2× more throughput.
Performance Benefits
| Metric | Standard Attention | FlashAttention | Improvement |
|---|---|---|---|
| Memory | O(N²) | O(N) | 10× at N=2K, 20× at N=4K |
| Speed (BERT-large, N=512) | Baseline | 15% faster end-to-end | 15% |
| Speed (GPT-2, N=1K) | Baseline | 3× faster | 3× |
| Speed (Long-Range Arena, N=4K) | Baseline | 2.4× faster | 2.4× |
| Model Quality | GPT-2 perplexity baseline | +0.7 perplexity | Better long-context |
| New Capabilities | <50% on Path-256 (N=64K) | 63.1% accuracy | Enables ultra-long seq |
- Energy Savings: ~15% less power for training.
- Adoption: Integrated in PyTorch 2.0+, Hugging Face, vLLM; used in Llama, Mistral.
Implementation and Usage
- GitHub: Dao-AILab/flash-attention (CUDA kernels).
- PyTorch:
torch.nn.functional.scaled_dot_product_attention(..., use_flash_attention=True). - Limitations: GPU-specific (Ampere+); assumes fused QKV; no support for all sparsity patterns yet.
Further Reading
- Original Paper: FlashAttention (arXiv:2205.14135)
- FlashAttention-2 Blog: CRFM Stanford
- FlashAttention-3: Tri Dao's Blog
- Annotated Explainer: ELI5 on Medium
FlashAttention exemplifies MLSys innovation: algorithm + hardware co-design. For code demos or variants (e.g., Jagged FlashAttention for recsys), let me know!
FlashAttention: Optimization Details for Efficient Exact Attention
FlashAttention is a groundbreaking optimization for the self-attention mechanism in Transformer models, introduced in 2022 by Tri Dao et al. It computes exact attention (no approximations) while dramatically reducing memory usage and runtime, particularly for long sequences. By leveraging GPU memory hierarchies—specifically minimizing data movement between slow High Bandwidth Memory (HBM) and fast on-chip SRAM (Shared Memory)—it addresses the quadratic O(N²) memory bottleneck of standard attention, where N is sequence length. This makes it ideal for training and inference in large language models (LLMs) like GPT, enabling longer contexts (e.g., up to 64K tokens) without quality loss. Below, I'll break down the core optimizations, algorithm, and benefits, drawing from the original paper and subsequent improvements.
FlashAttention: Optimization Details for Efficient Exact Attention
FlashAttention: Optimization Details for Efficient Exact Attention
FlashAttention: Optimization Details for Efficient Exact Attention
FlashAttention is a groundbreaking optimization for the self-attention mechanism in Transformer models, introduced in 2022 by Tri Dao et al. It computes exact attention (no approximations) while dramatically reducing memory usage and runtime, particularly for long sequences. By leveraging GPU memory hierarchies—specifically minimizing data movement between slow High Bandwidth Memory (HBM) and fast on-chip SRAM (Shared Memory)—it addresses the quadratic O(N²) memory bottleneck of standard attention, where N is sequence length.
This makes it ideal for training and inference in large language models (LLMs) like GPT, enabling longer contexts (e.g., up to 64K tokens) without quality loss. Below, I'll break down the core optimizations, algorithm, and benefits, drawing from the original paper and subsequent improvements.
Why Standard Attention is Inefficient: The IO Bottleneck
Standard (eager) attention computes:
1. Scores: S = (Q Kᵀ) / √d (N × N matrix)
2. Softmax: P = softmax(S) (N × N, row-wise)
3. Output: O = P V (N × d matrix)
Problems:
- Memory: Stores S and P in HBM (O(N²) space), leading to spills for N > 2K.
- IO Overhead: Writes intermediates (S, P) to HBM, then reads them back for matmuls—up to 2-3× more HBM traffic than compute FLOPs.
- Runtime: Dominated by memory bandwidth, not FLOPs (GPUs are compute-bound only for small N).
FlashAttention flips this by being IO-aware: It tiles computations to keep data in SRAM, reducing HBM accesses by a factor of ~ (d / √M), where d is dimension and M is SRAM size.
Core Optimizations in FlashAttention (Original)
1. Tiling Strategy
- Block the Matrices: Divide Q, K, V into smaller tiles (blocks) of size B_r × d (for rows) and B_c × d (for columns), chosen to fit in SRAM (e.g., B_r = B_c ≈ 64-128 on A100 GPUs).
- Forward Pass Tiling:
- Loop over row tiles of Q (fixed Q tile per iteration).
- For each row tile, loop over column tiles of K/V.
- Compute local S_tile = Q_tile @ K_tileᵀ / √d in SRAM.
- Apply causal mask (lower triangular) if needed.
- Compute row-wise softmax on S_tile, updating running statistics (max and sum per row) to avoid materializing full P.
- Compute local O_tile += P_tile @ V_tile.
- No Full Materialization: Never stores the full N × N S or P in HBM—only tiles and O (linear in N).
- Backward Pass: Recomputes S and P from Q, K, V tiles (using stored O and softmax stats) to compute gradients, keeping memory O(N).
2. Online Softmax with Recomputation
- Online Softmax: Computes softmax incrementally per tile, maintaining per-row m_i (max) and l_i (sum exp) for numerical stability:
[
m_{i,new} = \max(m_i, m_{tile}), \quad l_{i,new} = l_i \cdot \exp(m_i - m_{i,new}) + l_{tile}
]
P_row = exp(S_row - m_i) / l_i (recomputed as needed). - Recomputation: In backward, rescales gradients using these O( N ) stats instead of storing O( N² ) intermediates. This adds ~25% more compute but saves massive IO.
3. Kernel Fusion and Parallelism
- Fuses QKV projection, attention, and projection into one CUDA kernel.
- Uses SRAM for all tile ops; HBM only for input/output.
- Supports causal masking via tiled triangular logic.
IO Complexity: Standard: O(N² d / √M + N d) HBM words. FlashAttention: O(N d² / √M) reads + O(N d) writes—optimal for typical SRAM sizes.
Pseudocode: Forward Pass (Simplified)
def flash_attention_forward(Q, K, V, causal=False): # Shapes: [N, d]
# Tile sizes: Br (rows), Bc (cols)
O = zeros(N, d)
m, l = -inf * ones(N), zeros(N) # Softmax stats per row
for i in range(0, N, Br): # Loop over Q row tiles
Q_tile = Q[i:i+Br] # Load to SRAM
O_tile = O[i:i+Br] # Accumulator in SRAM
for j in range(0, N, Bc): # Loop over K/V col tiles
K_tile, V_tile = K[j:j+Bc], V[j:j+Bc] # Load to SRAM
# Causal mask: Skip if j >= i + Br (for causal)
if causal and j >= i + Br: continue
S_tile = (Q_tile @ K_tile.T) / sqrt(d) # In SRAM
if causal: S_tile = causal_mask(S_tile) # -inf above diagonal
# Online softmax
m_new = max(m[i:i+Br], row_max(S_tile))
l_new = l[i:i+Br] * exp(m[i:i+Br] - m_new) + row_sum(exp(S_tile - m_new))
P_tile = exp(S_tile + m[i:i+Br] - m_new) / l_new[:, None]
O_tile += P_tile @ V_tile
m[i:i+Br], l[i:i+Br] = m_new, l_new
O[i:i+Br] = O_tile # Write back only O
return O, (m, l) # For backward
(Backward recomputes S/P using m, l for dL/dS, etc.)
Advancements: FlashAttention-2 (2023)
Addresses remaining overheads in the original (e.g., suboptimal parallelism):
- Better Work Partitioning: Balances row/column loops; uses thread block per row tile for more parallelism.
- Reduced Non-MM FLOPs: Cuts softmax/reduction overhead by 2× via work-group reductions.
- Sequence Parallelism: Optimizes for multi-GPU.
- Speedups: Up to 2× faster than FlashAttention-1; end-to-end 1.3× on GPT training (72% MFU on A100). On H100: 1.6-2× over baselines.
FlashAttention-3 (2024): Hopper GPU Optimizations
For NVIDIA Hopper (H100+):
- Asynchrony: Overlaps compute with TMA (Tensor Memory Access) for async loads.
- Low-Precision: FP8 support (1.2 PFLOPS peak).
- Speedups: 1.6-2× over FlashAttention-2 in FP16; enables FP8 for 2× more throughput.
Performance Benefits
| Metric | Standard Attention | FlashAttention | Improvement |
|---|---|---|---|
| Memory | O(N²) | O(N) | 10× at N=2K, 20× at N=4K |
| Speed (BERT-large, N=512) | Baseline | 15% faster end-to-end | 15% |
| Speed (GPT-2, N=1K) | Baseline | 3× faster | 3× |
| Speed (Long-Range Arena, N=4K) | Baseline | 2.4× faster | 2.4× |
| Model Quality | GPT-2 perplexity baseline | +0.7 perplexity | Better long-context |
| New Capabilities | <50% on Path-256 (N=64K) | 63.1% accuracy | Enables ultra-long seq |
- Energy Savings: ~15% less power for training.
- Adoption: Integrated in PyTorch 2.0+, Hugging Face, vLLM; used in Llama, Mistral.
Implementation and Usage
- GitHub: Dao-AILab/flash-attention (CUDA kernels).
- PyTorch:
torch.nn.functional.scaled_dot_product_attention(..., use_flash_attention=True). - Limitations: GPU-specific (Ampere+); assumes fused QKV; no support for all sparsity patterns yet.
Further Reading
- Original Paper: FlashAttention (arXiv:2205.14135)
- FlashAttention-2 Blog: CRFM Stanford
- FlashAttention-3: Tri Dao's Blog
- Annotated Explainer: ELI5 on Medium
FlashAttention exemplifies MLSys innovation: algorithm + hardware co-design. For code demos or variants (e.g., Jagged FlashAttention for recsys), let me know!