Swin Transformer Window Attention – Deep, Intuitive & Mathematical Explanation
Why it exists, how it works, and why it destroyed the quadratic bottleneck of ViT
Swin Transformer Window Attention
Swin Transformer Window Attention – Deep, Intuitive & Mathematical Explanation
Why it exists, how it works, and why it destroyed the quadratic bottleneck of ViT
The Core Problem Swin Solves
| Model | Self-Attention Complexity | Can handle 1024×1024 image? | Memory (224×224) | Memory (512×512) |
|---|---|---|---|---|
| Original ViT | O((HW)²) = O(N²) | No, explodes | ~1 GB | ~20+ GB (dead) |
| Swin | O(HW) ≈ linear | Yes, easily | ~200 MB | ~800 MB |
ViT computes attention between all pairs of patches → 14,400 patches (224/16)² → 200 million attention scores → dead on high-res images.
Swin’s genius idea:
“Don’t do global attention. Do attention only inside small local windows.”
→ Complexity drops from O(N²) to O(N)
How Swin Window Attention Works – Step by Step
Step 1: Divide Image into Non-Overlapping Windows
- Default window size M = 7 → each window is 7×7 = 49 patches
- Example: 224×224 image, patch_size=4 → feature map 56×56
- → 8×8 = 64 windows of size 7×7 each
Image → Patches → H×W feature map
↓
Divide into M×M windows (non-overlapping)
↓
Each window does self-attention independently
Step 2: Regular Window Attention (Like Mini-ViT per Window)
Inside each 7×7 window:
- 49 patches → 49 tokens
- Compute Q, K, V → attention scores (49×49 matrix)
- Apply relative position bias (very important!)
- Output same 49 tokens
Total complexity per layer:
64 windows × (49²) = 64 × 2401 ≈ 153,664 operations
vs ViT’s (56×56)² = 9.8 million operations
→ ~60× cheaper!
Step 3: The Magic – Shifted Windows in Next Block
Problem: Regular windows have no communication between windows → no global context!
Swin’s breakthrough: In every second block, shift the windows by (M/2, M/2) pixels
→ Now windows overlap across boundaries → information flows!
Layer 1: Regular windows
┌─────┬─────┬─────┐
│ A │ B │ C │
├─────┼─────┼─────┤
│ D │ E │ F │
└─────┴─────┴─────┘
Layer 2: Shifted windows (shift by 3 or 4 pixels)
┌─────┬─────┐
│ E │ F │
┌─────┼─────┼─────┐
│ B │ C │ │
├─────┼─────┼─────┤
│ E │ F │ │
└─────┴─────┴─────┘
Now patch in window A can attend to patch in window B through the shifted path!
Step 4: Cyclic Shift Trick (Efficient Implementation)
Instead of actually cropping shifted windows (expensive), Swin does:
# Before attention in shifted block
x_shifted = torch.roll(x, shifts=(-shift_size, -shift_size), dims=(1,2))
# After attention
x = torch.roll(x_shifted, shifts=(shift_size, shift_size), dims=(1,2))
→ Zero overhead, perfect shift!
Step 5: Masking in Shifted Windows
After shifting, some patches in a window come from 4 different original windows
→ If we don’t mask, they would illegally attend to each other.
Solution: Create attention mask
- Patches from different original windows → mask value = -100
- Same window → 0
→ After softmax → zero attention across original window boundaries
→ Preserves locality!
Mathematical Complexity Proof
| Method | Attention Complexity per Layer | Total for 4 stages |
|---|---|---|
| Global (ViT) | O((HW)²) | O(N²) |
| Swin (Window=7) | O(HW × M²) = O(HW × 49) | ~O(N) |
| Swin (with shift) | Still O(HW × M²) | Linear! |
Since M is fixed (7 or 12), complexity is linear in image size → scales to 4K images!
Relative Position Bias (The Secret Sauce)
Swin doesn’t use absolute or learned positional embeddings per patch.
Instead: Learn a small bias table B of size (2M−1)×(2M−1) × num_heads
Example: M=7 → 13×13 = 169 biases per head
For any relative position (Δx, Δy), add B[Δx, Δy] to attention logit
→ Translation invariant + very few parameters!
This is why Swin generalizes so well across resolutions.
Visual Summary – How Information Flows
Layer 1 (Regular Windows) → Local only
Layer 2 (Shifted Windows) → Connects adjacent windows
Layer 3 (Regular) → Local again
Layer 4 (Shifted) → Connects further
...
After 4–6 stages → Global receptive field!
Just like CNNs build hierarchy, but with attention!
Comparison Table (Memorize This!)
| Feature | ViT (Global) | Swin (Window + Shifted) |
|---|---|---|
| Attention Scope | Global | Local → Global via hierarchy |
| Complexity | Quadratic O(N²) | Linear O(N) |
| Max Resolution (reasonable) | 384–512px | 2048px+ (used in SAM, Florence) |
| Translation Invariance | Learned | Built-in (relative bias + shift) |
| Inductive Bias | None | Locality + hierarchy |
| Best For | Large data | Detection, segmentation, video |
| ImageNet-1K Top-1 | 88.5% (ViT-L) | 87.3% (Swin-L) + much faster |
Code Snippet – The Heart (Just 10 lines!)
# In shifted block
if self.shift_size > 0:
x = torch.roll(x, shifts=(-self.shift_size, -self.shift_size), dims=(1,2))
# Partition into windows → attention → reverse
x_windows = window_partition(x, self.window_size) # → many small windows
x_windows = x_windows.view(-1, M*M, C)
attn_windows = self.attn(x_windows, mask=attn_mask) # ← only inside window
# ... merge back ...
# Unshift
if self.shift_size > 0:
x = torch.roll(x, shifts=(self.shift_size, self.shift_size), dims=(1,2))
This 10-line trick made transformers practical for vision.
Why Swin Won Everything After 2021
- 2021: Beat CNNs and ViT on ImageNet
- 2022: Backbone of Mask R-CNN, Cascade R-CNN → COCO SOTA
- 2023: Swin-V2 → ImageNet-22K + 3B params → beats CLIP
- 2024–2025: Default backbone in YOLOv8, RT-DETR, Florence-2, SAM-2, etc.
Final Summary – Why Window Attention is Genius
| Problem | ViT Solution | Swin Solution |
|---|---|---|
| Quadratic complexity | Accept it | Fixed windows → linear |
| No locality bias | Add pos embed | Windows + relative bias → strong |
| Poor at high resolution | Downsample early | Hierarchical stages |
| Slow cross-window info flow | None | Shifted windows → fast flow |
Swin Transformer proved that you can have the best of both worlds:
Transformer flexibility + CNN efficiency and inductive bias.
This is why, in 2025, Swin (and its children: Swin-V2, Swin-MOE, FocalNet, etc.) is the most widely used vision backbone in the world.
You now fully understand why Swin’s window attention is one of the most important ideas in deep learning since ReLU.
Swin Transformer Window Attention – Deep, Intuitive & Mathematical Explanation
Why it exists, how it works, and why it destroyed the quadratic bottleneck of ViT
Swin Transformer Window Attention
Swin Transformer Window Attention – Deep, Intuitive & Mathematical Explanation
Why it exists, how it works, and why it destroyed the quadratic bottleneck of ViT
The Core Problem Swin Solves
| Model | Self-Attention Complexity | Can handle 1024×1024 image? | Memory (224×224) | Memory (512×512) |
|---|---|---|---|---|
| Original ViT | O((HW)²) = O(N²) | No, explodes | ~1 GB | ~20+ GB (dead) |
| Swin | O(HW) ≈ linear | Yes, easily | ~200 MB | ~800 MB |
ViT computes attention between all pairs of patches → 14,400 patches (224/16)² → 200 million attention scores → dead on high-res images.
Swin’s genius idea:
“Don’t do global attention. Do attention only inside small local windows.”
→ Complexity drops from O(N²) to O(N)
How Swin Window Attention Works – Step by Step
Step 1: Divide Image into Non-Overlapping Windows
- Default window size M = 7 → each window is 7×7 = 49 patches
- Example: 224×224 image, patch_size=4 → feature map 56×56
- → 8×8 = 64 windows of size 7×7 each
Image → Patches → H×W feature map
↓
Divide into M×M windows (non-overlapping)
↓
Each window does self-attention independently
Step 2: Regular Window Attention (Like Mini-ViT per Window)
Inside each 7×7 window:
- 49 patches → 49 tokens
- Compute Q, K, V → attention scores (49×49 matrix)
- Apply relative position bias (very important!)
- Output same 49 tokens
Total complexity per layer:
64 windows × (49²) = 64 × 2401 ≈ 153,664 operations
vs ViT’s (56×56)² = 9.8 million operations
→ ~60× cheaper!
Step 3: The Magic – Shifted Windows in Next Block
Problem: Regular windows have no communication between windows → no global context!
Swin’s breakthrough: In every second block, shift the windows by (M/2, M/2) pixels
→ Now windows overlap across boundaries → information flows!
Layer 1: Regular windows
┌─────┬─────┬─────┐
│ A │ B │ C │
├─────┼─────┼─────┤
│ D │ E │ F │
└─────┴─────┴─────┘
Layer 2: Shifted windows (shift by 3 or 4 pixels)
┌─────┬─────┐
│ E │ F │
┌─────┼─────┼─────┐
│ B │ C │ │
├─────┼─────┼─────┤
│ E │ F │ │
└─────┴─────┴─────┘
Now patch in window A can attend to patch in window B through the shifted path!
Step 4: Cyclic Shift Trick (Efficient Implementation)
Instead of actually cropping shifted windows (expensive), Swin does:
# Before attention in shifted block
x_shifted = torch.roll(x, shifts=(-shift_size, -shift_size), dims=(1,2))
# After attention
x = torch.roll(x_shifted, shifts=(shift_size, shift_size), dims=(1,2))
→ Zero overhead, perfect shift!
Step 5: Masking in Shifted Windows
After shifting, some patches in a window come from 4 different original windows
→ If we don’t mask, they would illegally attend to each other.
Solution: Create attention mask
- Patches from different original windows → mask value = -100
- Same window → 0
→ After softmax → zero attention across original window boundaries
→ Preserves locality!
Mathematical Complexity Proof
| Method | Attention Complexity per Layer | Total for 4 stages |
|---|---|---|
| Global (ViT) | O((HW)²) | O(N²) |
| Swin (Window=7) | O(HW × M²) = O(HW × 49) | ~O(N) |
| Swin (with shift) | Still O(HW × M²) | Linear! |
Since M is fixed (7 or 12), complexity is linear in image size → scales to 4K images!
Relative Position Bias (The Secret Sauce)
Swin doesn’t use absolute or learned positional embeddings per patch.
Instead: Learn a small bias table B of size (2M−1)×(2M−1) × num_heads
Example: M=7 → 13×13 = 169 biases per head
For any relative position (Δx, Δy), add B[Δx, Δy] to attention logit
→ Translation invariant + very few parameters!
This is why Swin generalizes so well across resolutions.
Visual Summary – How Information Flows
Layer 1 (Regular Windows) → Local only
Layer 2 (Shifted Windows) → Connects adjacent windows
Layer 3 (Regular) → Local again
Layer 4 (Shifted) → Connects further
...
After 4–6 stages → Global receptive field!
Just like CNNs build hierarchy, but with attention!
Comparison Table (Memorize This!)
| Feature | ViT (Global) | Swin (Window + Shifted) |
|---|---|---|
| Attention Scope | Global | Local → Global via hierarchy |
| Complexity | Quadratic O(N²) | Linear O(N) |
| Max Resolution (reasonable) | 384–512px | 2048px+ (used in SAM, Florence) |
| Translation Invariance | Learned | Built-in (relative bias + shift) |
| Inductive Bias | None | Locality + hierarchy |
| Best For | Large data | Detection, segmentation, video |
| ImageNet-1K Top-1 | 88.5% (ViT-L) | 87.3% (Swin-L) + much faster |
Code Snippet – The Heart (Just 10 lines!)
# In shifted block
if self.shift_size > 0:
x = torch.roll(x, shifts=(-self.shift_size, -self.shift_size), dims=(1,2))
# Partition into windows → attention → reverse
x_windows = window_partition(x, self.window_size) # → many small windows
x_windows = x_windows.view(-1, M*M, C)
attn_windows = self.attn(x_windows, mask=attn_mask) # ← only inside window
# ... merge back ...
# Unshift
if self.shift_size > 0:
x = torch.roll(x, shifts=(self.shift_size, self.shift_size), dims=(1,2))
This 10-line trick made transformers practical for vision.
Why Swin Won Everything After 2021
- 2021: Beat CNNs and ViT on ImageNet
- 2022: Backbone of Mask R-CNN, Cascade R-CNN → COCO SOTA
- 2023: Swin-V2 → ImageNet-22K + 3B params → beats CLIP
- 2024–2025: Default backbone in YOLOv8, RT-DETR, Florence-2, SAM-2, etc.
Final Summary – Why Window Attention is Genius
| Problem | ViT Solution | Swin Solution |
|---|---|---|
| Quadratic complexity | Accept it | Fixed windows → linear |
| No locality bias | Add pos embed | Windows + relative bias → strong |
| Poor at high resolution | Downsample early | Hierarchical stages |
| Slow cross-window info flow | None | Shifted windows → fast flow |
Swin Transformer proved that you can have the best of both worlds:
Transformer flexibility + CNN efficiency and inductive bias.
This is why, in 2025, Swin (and its children: Swin-V2, Swin-MOE, FocalNet, etc.) is the most widely used vision backbone in the world.
You now fully understand why Swin’s window attention is one of the most important ideas in deep learning since ReLU.