Swin Transformer Window Attention – Deep, Intuitive & Mathematical Explanation

Why it exists, how it works, and why it destroyed the quadratic bottleneck of ViT

Swin Transformer Window Attention

Swin Transformer Window Attention – Deep, Intuitive & Mathematical Explanation

Why it exists, how it works, and why it destroyed the quadratic bottleneck of ViT

The Core Problem Swin Solves

Model Self-Attention Complexity Can handle 1024×1024 image? Memory (224×224) Memory (512×512)
Original ViT O((HW)²) = O(N²) No, explodes ~1 GB ~20+ GB (dead)
Swin O(HW) ≈ linear Yes, easily ~200 MB ~800 MB

ViT computes attention between all pairs of patches → 14,400 patches (224/16)² → 200 million attention scores → dead on high-res images.

Swin’s genius idea:
“Don’t do global attention. Do attention only inside small local windows.”
→ Complexity drops from O(N²) to O(N)

How Swin Window Attention Works – Step by Step

Step 1: Divide Image into Non-Overlapping Windows

  • Default window size M = 7 → each window is 7×7 = 49 patches
  • Example: 224×224 image, patch_size=4 → feature map 56×56
  • → 8×8 = 64 windows of size 7×7 each
Image → Patches → H×W feature map
      ↓
Divide into M×M windows (non-overlapping)
      ↓
Each window does self-attention independently

Step 2: Regular Window Attention (Like Mini-ViT per Window)

Inside each 7×7 window:
- 49 patches → 49 tokens
- Compute Q, K, V → attention scores (49×49 matrix)
- Apply relative position bias (very important!)
- Output same 49 tokens

Total complexity per layer:
64 windows × (49²) = 64 × 2401 ≈ 153,664 operations
vs ViT’s (56×56)² = 9.8 million operations
→ ~60× cheaper!

Step 3: The Magic – Shifted Windows in Next Block

Problem: Regular windows have no communication between windows → no global context!

Swin’s breakthrough: In every second block, shift the windows by (M/2, M/2) pixels
→ Now windows overlap across boundaries → information flows!

Layer 1: Regular windows
┌─────┬─────┬─────┐
│  A  │  B  │  C  │
├─────┼─────┼─────┤
│  D  │  E  │  F  │
└─────┴─────┴─────┘

Layer 2: Shifted windows (shift by 3 or 4 pixels)
  ┌─────┬─────┐
  │  E  │  F  │
┌─────┼─────┼─────┐
│  B  │  C  │     │
├─────┼─────┼─────┤
│  E  │  F  │     │
└─────┴─────┴─────┘

Now patch in window A can attend to patch in window B through the shifted path!

Step 4: Cyclic Shift Trick (Efficient Implementation)

Instead of actually cropping shifted windows (expensive), Swin does:

# Before attention in shifted block
x_shifted = torch.roll(x, shifts=(-shift_size, -shift_size), dims=(1,2))

# After attention
x = torch.roll(x_shifted, shifts=(shift_size, shift_size), dims=(1,2))

→ Zero overhead, perfect shift!

Step 5: Masking in Shifted Windows

After shifting, some patches in a window come from 4 different original windows
→ If we don’t mask, they would illegally attend to each other.

Solution: Create attention mask
- Patches from different original windows → mask value = -100
- Same window → 0

→ After softmax → zero attention across original window boundaries
→ Preserves locality!

Mathematical Complexity Proof

Method Attention Complexity per Layer Total for 4 stages
Global (ViT) O((HW)²) O(N²)
Swin (Window=7) O(HW × M²) = O(HW × 49) ~O(N)
Swin (with shift) Still O(HW × M²) Linear!

Since M is fixed (7 or 12), complexity is linear in image size → scales to 4K images!

Relative Position Bias (The Secret Sauce)

Swin doesn’t use absolute or learned positional embeddings per patch.

Instead: Learn a small bias table B of size (2M−1)×(2M−1) × num_heads
Example: M=7 → 13×13 = 169 biases per head

For any relative position (Δx, Δy), add B[Δx, Δy] to attention logit
→ Translation invariant + very few parameters!

This is why Swin generalizes so well across resolutions.

Visual Summary – How Information Flows

Layer 1 (Regular Windows)     → Local only
Layer 2 (Shifted Windows)     → Connects adjacent windows
Layer 3 (Regular)             → Local again
Layer 4 (Shifted)             → Connects further
...
After 4–6 stages → Global receptive field!

Just like CNNs build hierarchy, but with attention!

Comparison Table (Memorize This!)

Feature ViT (Global) Swin (Window + Shifted)
Attention Scope Global Local → Global via hierarchy
Complexity Quadratic O(N²) Linear O(N)
Max Resolution (reasonable) 384–512px 2048px+ (used in SAM, Florence)
Translation Invariance Learned Built-in (relative bias + shift)
Inductive Bias None Locality + hierarchy
Best For Large data Detection, segmentation, video
ImageNet-1K Top-1 88.5% (ViT-L) 87.3% (Swin-L) + much faster

Code Snippet – The Heart (Just 10 lines!)

# In shifted block
if self.shift_size > 0:
    x = torch.roll(x, shifts=(-self.shift_size, -self.shift_size), dims=(1,2))

# Partition into windows → attention → reverse
x_windows = window_partition(x, self.window_size)           # → many small windows
x_windows = x_windows.view(-1, M*M, C)
attn_windows = self.attn(x_windows, mask=attn_mask)        # ← only inside window
# ... merge back ...

# Unshift
if self.shift_size > 0:
    x = torch.roll(x, shifts=(self.shift_size, self.shift_size), dims=(1,2))

This 10-line trick made transformers practical for vision.

Why Swin Won Everything After 2021

  • 2021: Beat CNNs and ViT on ImageNet
  • 2022: Backbone of Mask R-CNN, Cascade R-CNN → COCO SOTA
  • 2023: Swin-V2 → ImageNet-22K + 3B params → beats CLIP
  • 2024–2025: Default backbone in YOLOv8, RT-DETR, Florence-2, SAM-2, etc.

Final Summary – Why Window Attention is Genius

Problem ViT Solution Swin Solution
Quadratic complexity Accept it Fixed windows → linear
No locality bias Add pos embed Windows + relative bias → strong
Poor at high resolution Downsample early Hierarchical stages
Slow cross-window info flow None Shifted windows → fast flow

Swin Transformer proved that you can have the best of both worlds:
Transformer flexibility + CNN efficiency and inductive bias.

This is why, in 2025, Swin (and its children: Swin-V2, Swin-MOE, FocalNet, etc.) is the most widely used vision backbone in the world.

You now fully understand why Swin’s window attention is one of the most important ideas in deep learning since ReLU.

Last updated: Nov 30, 2025

Swin Transformer Window Attention – Deep, Intuitive & Mathematical Explanation

Why it exists, how it works, and why it destroyed the quadratic bottleneck of ViT

Swin Transformer Window Attention

Swin Transformer Window Attention – Deep, Intuitive & Mathematical Explanation

Why it exists, how it works, and why it destroyed the quadratic bottleneck of ViT

The Core Problem Swin Solves

Model Self-Attention Complexity Can handle 1024×1024 image? Memory (224×224) Memory (512×512)
Original ViT O((HW)²) = O(N²) No, explodes ~1 GB ~20+ GB (dead)
Swin O(HW) ≈ linear Yes, easily ~200 MB ~800 MB

ViT computes attention between all pairs of patches → 14,400 patches (224/16)² → 200 million attention scores → dead on high-res images.

Swin’s genius idea:
“Don’t do global attention. Do attention only inside small local windows.”
→ Complexity drops from O(N²) to O(N)

How Swin Window Attention Works – Step by Step

Step 1: Divide Image into Non-Overlapping Windows

  • Default window size M = 7 → each window is 7×7 = 49 patches
  • Example: 224×224 image, patch_size=4 → feature map 56×56
  • → 8×8 = 64 windows of size 7×7 each
Image → Patches → H×W feature map
      ↓
Divide into M×M windows (non-overlapping)
      ↓
Each window does self-attention independently

Step 2: Regular Window Attention (Like Mini-ViT per Window)

Inside each 7×7 window:
- 49 patches → 49 tokens
- Compute Q, K, V → attention scores (49×49 matrix)
- Apply relative position bias (very important!)
- Output same 49 tokens

Total complexity per layer:
64 windows × (49²) = 64 × 2401 ≈ 153,664 operations
vs ViT’s (56×56)² = 9.8 million operations
→ ~60× cheaper!

Step 3: The Magic – Shifted Windows in Next Block

Problem: Regular windows have no communication between windows → no global context!

Swin’s breakthrough: In every second block, shift the windows by (M/2, M/2) pixels
→ Now windows overlap across boundaries → information flows!

Layer 1: Regular windows
┌─────┬─────┬─────┐
│  A  │  B  │  C  │
├─────┼─────┼─────┤
│  D  │  E  │  F  │
└─────┴─────┴─────┘

Layer 2: Shifted windows (shift by 3 or 4 pixels)
  ┌─────┬─────┐
  │  E  │  F  │
┌─────┼─────┼─────┐
│  B  │  C  │     │
├─────┼─────┼─────┤
│  E  │  F  │     │
└─────┴─────┴─────┘

Now patch in window A can attend to patch in window B through the shifted path!

Step 4: Cyclic Shift Trick (Efficient Implementation)

Instead of actually cropping shifted windows (expensive), Swin does:

# Before attention in shifted block
x_shifted = torch.roll(x, shifts=(-shift_size, -shift_size), dims=(1,2))

# After attention
x = torch.roll(x_shifted, shifts=(shift_size, shift_size), dims=(1,2))

→ Zero overhead, perfect shift!

Step 5: Masking in Shifted Windows

After shifting, some patches in a window come from 4 different original windows
→ If we don’t mask, they would illegally attend to each other.

Solution: Create attention mask
- Patches from different original windows → mask value = -100
- Same window → 0

→ After softmax → zero attention across original window boundaries
→ Preserves locality!

Mathematical Complexity Proof

Method Attention Complexity per Layer Total for 4 stages
Global (ViT) O((HW)²) O(N²)
Swin (Window=7) O(HW × M²) = O(HW × 49) ~O(N)
Swin (with shift) Still O(HW × M²) Linear!

Since M is fixed (7 or 12), complexity is linear in image size → scales to 4K images!

Relative Position Bias (The Secret Sauce)

Swin doesn’t use absolute or learned positional embeddings per patch.

Instead: Learn a small bias table B of size (2M−1)×(2M−1) × num_heads
Example: M=7 → 13×13 = 169 biases per head

For any relative position (Δx, Δy), add B[Δx, Δy] to attention logit
→ Translation invariant + very few parameters!

This is why Swin generalizes so well across resolutions.

Visual Summary – How Information Flows

Layer 1 (Regular Windows)     → Local only
Layer 2 (Shifted Windows)     → Connects adjacent windows
Layer 3 (Regular)             → Local again
Layer 4 (Shifted)             → Connects further
...
After 4–6 stages → Global receptive field!

Just like CNNs build hierarchy, but with attention!

Comparison Table (Memorize This!)

Feature ViT (Global) Swin (Window + Shifted)
Attention Scope Global Local → Global via hierarchy
Complexity Quadratic O(N²) Linear O(N)
Max Resolution (reasonable) 384–512px 2048px+ (used in SAM, Florence)
Translation Invariance Learned Built-in (relative bias + shift)
Inductive Bias None Locality + hierarchy
Best For Large data Detection, segmentation, video
ImageNet-1K Top-1 88.5% (ViT-L) 87.3% (Swin-L) + much faster

Code Snippet – The Heart (Just 10 lines!)

# In shifted block
if self.shift_size > 0:
    x = torch.roll(x, shifts=(-self.shift_size, -self.shift_size), dims=(1,2))

# Partition into windows → attention → reverse
x_windows = window_partition(x, self.window_size)           # → many small windows
x_windows = x_windows.view(-1, M*M, C)
attn_windows = self.attn(x_windows, mask=attn_mask)        # ← only inside window
# ... merge back ...

# Unshift
if self.shift_size > 0:
    x = torch.roll(x, shifts=(self.shift_size, self.shift_size), dims=(1,2))

This 10-line trick made transformers practical for vision.

Why Swin Won Everything After 2021

  • 2021: Beat CNNs and ViT on ImageNet
  • 2022: Backbone of Mask R-CNN, Cascade R-CNN → COCO SOTA
  • 2023: Swin-V2 → ImageNet-22K + 3B params → beats CLIP
  • 2024–2025: Default backbone in YOLOv8, RT-DETR, Florence-2, SAM-2, etc.

Final Summary – Why Window Attention is Genius

Problem ViT Solution Swin Solution
Quadratic complexity Accept it Fixed windows → linear
No locality bias Add pos embed Windows + relative bias → strong
Poor at high resolution Downsample early Hierarchical stages
Slow cross-window info flow None Shifted windows → fast flow

Swin Transformer proved that you can have the best of both worlds:
Transformer flexibility + CNN efficiency and inductive bias.

This is why, in 2025, Swin (and its children: Swin-V2, Swin-MOE, FocalNet, etc.) is the most widely used vision backbone in the world.

You now fully understand why Swin’s window attention is one of the most important ideas in deep learning since ReLU.

Last updated: Nov 30, 2025