FlashAttention: Optimization Details for Efficient Exact Attention

FlashAttention is a groundbreaking optimization for the self-attention mechanism in Transformer models, introduced in 2022 by Tri Dao et al. It computes exact attention (no approximations) while dramatically reducing memory usage and runtime, particularly for long sequences. By leveraging GPU memory hierarchies—specifically minimizing data movement between slow High Bandwidth Memory (HBM) and fast on-chip SRAM (Shared Memory)—it addresses the quadratic O(N²) memory bottleneck of standard attention, where N is sequence length. This makes it ideal for training and inference in large language models (LLMs) like GPT, enabling longer contexts (e.g., up to 64K tokens) without quality loss. Below, I'll break down the core optimizations, algorithm, and benefits, drawing from the original paper and subsequent improvements.

FlashAttention: Optimization Details for Efficient Exact Attention

FlashAttention is a groundbreaking optimization for the self-attention mechanism in Transformer models, introduced in 2022 by Tri Dao et al. It computes exact attention (no approximations) while dramatically reducing memory usage and runtime, particularly for long sequences. By leveraging GPU memory hierarchies—specifically minimizing data movement between slow High Bandwidth Memory (HBM) and fast on-chip SRAM (Shared Memory)—it addresses the quadratic O(N²) memory bottleneck of standard attention, where N is sequence length.

This makes it ideal for training and inference in large language models (LLMs) like GPT, enabling longer contexts (e.g., up to 64K tokens) without quality loss. Below, I'll break down the core optimizations, algorithm, and benefits, drawing from the original paper and subsequent improvements.

Why Standard Attention is Inefficient: The IO Bottleneck

Standard (eager) attention computes:
1. Scores: S = (Q Kᵀ) / √d (N × N matrix)
2. Softmax: P = softmax(S) (N × N, row-wise)
3. Output: O = P V (N × d matrix)

Problems:
- Memory: Stores S and P in HBM (O(N²) space), leading to spills for N > 2K.
- IO Overhead: Writes intermediates (S, P) to HBM, then reads them back for matmuls—up to 2-3× more HBM traffic than compute FLOPs.
- Runtime: Dominated by memory bandwidth, not FLOPs (GPUs are compute-bound only for small N).

FlashAttention flips this by being IO-aware: It tiles computations to keep data in SRAM, reducing HBM accesses by a factor of ~ (d / √M), where d is dimension and M is SRAM size.

Core Optimizations in FlashAttention (Original)

1. Tiling Strategy

Block the Matrices: Divide Q, K, V into smaller tiles (blocks) of size B_r × d (for rows) and B_c × d (for columns), chosen to fit in SRAM (e.g., B_r = B_c ≈ 64-128 on A100 GPUs).
Forward Pass Tiling:
- Loop over row tiles of Q (fixed Q tile per iteration).
- For each row tile, loop over column tiles of K/V.
- Compute local S_tile = Q_tile @ K_tileᵀ / √d in SRAM.
- Apply causal mask (lower triangular) if needed.
- Compute row-wise softmax on S_tile, updating running statistics (max and sum per row) to avoid materializing full P.
- Compute local O_tile += P_tile @ V_tile.
No Full Materialization: Never stores the full N × N S or P in HBM—only tiles and O (linear in N).
Backward Pass: Recomputes S and P from Q, K, V tiles (using stored O and softmax stats) to compute gradients, keeping memory O(N).

2. Online Softmax with Recomputation

Online Softmax: Computes softmax incrementally per tile, maintaining per-row m_i (max) and l_i (sum exp) for numerical stability:
[
m_{i,new} = \max(m_i, m_{tile}), \quad l_{i,new} = l_i \cdot \exp(m_i - m_{i,new}) + l_{tile}
]
P_row = exp(S_row - m_i) / l_i (recomputed as needed).
Recomputation: In backward, rescales gradients using these O( N ) stats instead of storing O( N² ) intermediates. This adds ~25% more compute but saves massive IO.

3. Kernel Fusion and Parallelism

Fuses QKV projection, attention, and projection into one CUDA kernel.
Uses SRAM for all tile ops; HBM only for input/output.
Supports causal masking via tiled triangular logic.

IO Complexity: Standard: O(N² d / √M + N d) HBM words. FlashAttention: O(N d² / √M) reads + O(N d) writes—optimal for typical SRAM sizes.

Pseudocode: Forward Pass (Simplified)

def flash_attention_forward(Q, K, V, causal=False):  # Shapes: [N, d]
    # Tile sizes: Br (rows), Bc (cols)
    O = zeros(N, d)
    m, l = -inf * ones(N), zeros(N)  # Softmax stats per row

    for i in range(0, N, Br):  # Loop over Q row tiles
        Q_tile = Q[i:i+Br]  # Load to SRAM
        O_tile = O[i:i+Br]  # Accumulator in SRAM

        for j in range(0, N, Bc):  # Loop over K/V col tiles
            K_tile, V_tile = K[j:j+Bc], V[j:j+Bc]  # Load to SRAM

            # Causal mask: Skip if j >= i + Br (for causal)
            if causal and j >= i + Br: continue

            S_tile = (Q_tile @ K_tile.T) / sqrt(d)  # In SRAM
            if causal: S_tile = causal_mask(S_tile)  # -inf above diagonal

            # Online softmax
            m_new = max(m[i:i+Br], row_max(S_tile))
            l_new = l[i:i+Br] * exp(m[i:i+Br] - m_new) + row_sum(exp(S_tile - m_new))
            P_tile = exp(S_tile + m[i:i+Br] - m_new) / l_new[:, None]

            O_tile += P_tile @ V_tile
            m[i:i+Br], l[i:i+Br] = m_new, l_new

        O[i:i+Br] = O_tile  # Write back only O

    return O, (m, l)  # For backward

(Backward recomputes S/P using m, l for dL/dS, etc.)

Advancements: FlashAttention-2 (2023)

Addresses remaining overheads in the original (e.g., suboptimal parallelism):
- Better Work Partitioning: Balances row/column loops; uses thread block per row tile for more parallelism.
- Reduced Non-MM FLOPs: Cuts softmax/reduction overhead by 2× via work-group reductions.
- Sequence Parallelism: Optimizes for multi-GPU.
- Speedups: Up to 2× faster than FlashAttention-1; end-to-end 1.3× on GPT training (72% MFU on A100). On H100: 1.6-2× over baselines.

FlashAttention-3 (2024): Hopper GPU Optimizations

For NVIDIA Hopper (H100+):
- Asynchrony: Overlaps compute with TMA (Tensor Memory Access) for async loads.
- Low-Precision: FP8 support (1.2 PFLOPS peak).
- Speedups: 1.6-2× over FlashAttention-2 in FP16; enables FP8 for 2× more throughput.

Performance Benefits

Metric	Standard Attention	FlashAttention	Improvement
Memory	O(N²)	O(N)	10× at N=2K, 20× at N=4K
Speed (BERT-large, N=512)	Baseline	15% faster end-to-end	15%
Speed (GPT-2, N=1K)	Baseline	3× faster	3×
Speed (Long-Range Arena, N=4K)	Baseline	2.4× faster	2.4×
Model Quality	GPT-2 perplexity baseline	+0.7 perplexity	Better long-context
New Capabilities	<50% on Path-256 (N=64K)	63.1% accuracy	Enables ultra-long seq

Energy Savings: ~15% less power for training.
Adoption: Integrated in PyTorch 2.0+, Hugging Face, vLLM; used in Llama, Mistral.

Implementation and Usage

GitHub: Dao-AILab/flash-attention (CUDA kernels).
PyTorch: torch.nn.functional.scaled_dot_product_attention(..., use_flash_attention=True).
Limitations: GPU-specific (Ampere+); assumes fused QKV; no support for all sparsity patterns yet.

FlashAttention: Optimization Details for Efficient Exact Attention

Why Standard Attention is Inefficient: The IO Bottleneck

Standard (eager) attention computes:
1. Scores: S = (Q Kᵀ) / √d (N × N matrix)
2. Softmax: P = softmax(S) (N × N, row-wise)
3. Output: O = P V (N × d matrix)

FlashAttention flips this by being IO-aware: It tiles computations to keep data in SRAM, reducing HBM accesses by a factor of ~ (d / √M), where d is dimension and M is SRAM size.

Core Optimizations in FlashAttention (Original)

1. Tiling Strategy

Block the Matrices: Divide Q, K, V into smaller tiles (blocks) of size B_r × d (for rows) and B_c × d (for columns), chosen to fit in SRAM (e.g., B_r = B_c ≈ 64-128 on A100 GPUs).
Forward Pass Tiling:
- Loop over row tiles of Q (fixed Q tile per iteration).
- For each row tile, loop over column tiles of K/V.
- Compute local S_tile = Q_tile @ K_tileᵀ / √d in SRAM.
- Apply causal mask (lower triangular) if needed.
- Compute row-wise softmax on S_tile, updating running statistics (max and sum per row) to avoid materializing full P.
- Compute local O_tile += P_tile @ V_tile.
No Full Materialization: Never stores the full N × N S or P in HBM—only tiles and O (linear in N).
Backward Pass: Recomputes S and P from Q, K, V tiles (using stored O and softmax stats) to compute gradients, keeping memory O(N).

2. Online Softmax with Recomputation

Online Softmax: Computes softmax incrementally per tile, maintaining per-row m_i (max) and l_i (sum exp) for numerical stability:
[
m_{i,new} = \max(m_i, m_{tile}), \quad l_{i,new} = l_i \cdot \exp(m_i - m_{i,new}) + l_{tile}
]
P_row = exp(S_row - m_i) / l_i (recomputed as needed).
Recomputation: In backward, rescales gradients using these O( N ) stats instead of storing O( N² ) intermediates. This adds ~25% more compute but saves massive IO.

3. Kernel Fusion and Parallelism

Fuses QKV projection, attention, and projection into one CUDA kernel.
Uses SRAM for all tile ops; HBM only for input/output.
Supports causal masking via tiled triangular logic.

IO Complexity: Standard: O(N² d / √M + N d) HBM words. FlashAttention: O(N d² / √M) reads + O(N d) writes—optimal for typical SRAM sizes.

Pseudocode: Forward Pass (Simplified)

def flash_attention_forward(Q, K, V, causal=False):  # Shapes: [N, d]
    # Tile sizes: Br (rows), Bc (cols)
    O = zeros(N, d)
    m, l = -inf * ones(N), zeros(N)  # Softmax stats per row

    for i in range(0, N, Br):  # Loop over Q row tiles
        Q_tile = Q[i:i+Br]  # Load to SRAM
        O_tile = O[i:i+Br]  # Accumulator in SRAM

        for j in range(0, N, Bc):  # Loop over K/V col tiles
            K_tile, V_tile = K[j:j+Bc], V[j:j+Bc]  # Load to SRAM

            # Causal mask: Skip if j >= i + Br (for causal)
            if causal and j >= i + Br: continue

            S_tile = (Q_tile @ K_tile.T) / sqrt(d)  # In SRAM
            if causal: S_tile = causal_mask(S_tile)  # -inf above diagonal

            # Online softmax
            m_new = max(m[i:i+Br], row_max(S_tile))
            l_new = l[i:i+Br] * exp(m[i:i+Br] - m_new) + row_sum(exp(S_tile - m_new))
            P_tile = exp(S_tile + m[i:i+Br] - m_new) / l_new[:, None]

            O_tile += P_tile @ V_tile
            m[i:i+Br], l[i:i+Br] = m_new, l_new

        O[i:i+Br] = O_tile  # Write back only O

    return O, (m, l)  # For backward

(Backward recomputes S/P using m, l for dL/dS, etc.)

Advancements: FlashAttention-2 (2023)

FlashAttention-3 (2024): Hopper GPU Optimizations

Performance Benefits

Metric	Standard Attention	FlashAttention	Improvement
Memory	O(N²)	O(N)	10× at N=2K, 20× at N=4K
Speed (BERT-large, N=512)	Baseline	15% faster end-to-end	15%
Speed (GPT-2, N=1K)	Baseline	3× faster	3×
Speed (Long-Range Arena, N=4K)	Baseline	2.4× faster	2.4×
Model Quality	GPT-2 perplexity baseline	+0.7 perplexity	Better long-context
New Capabilities	<50% on Path-256 (N=64K)	63.1% accuracy	Enables ultra-long seq

Energy Savings: ~15% less power for training.
Adoption: Integrated in PyTorch 2.0+, Hugging Face, vLLM; used in Llama, Mistral.

Implementation and Usage

GitHub: Dao-AILab/flash-attention (CUDA kernels).
PyTorch: torch.nn.functional.scaled_dot_product_attention(..., use_flash_attention=True).
Limitations: GPU-specific (Ampere+); assumes fused QKV; no support for all sparsity patterns yet.

FlashAttention: Optimization Details for Efficient Exact Attention