Swin Transformer Window Attention – Deep, Intuitive & Mathematical Explanation

Why it exists, how it works, and why it destroyed the quadratic bottleneck of ViT

Swin Transformer Window Attention

Swin Transformer Window Attention – Deep, Intuitive & Mathematical Explanation

Why it exists, how it works, and why it destroyed the quadratic bottleneck of ViT

The Core Problem Swin Solves

Model	Self-Attention Complexity	Can handle 1024×1024 image?	Memory (224×224)	Memory (512×512)
Original ViT	O((HW)²) = O(N²)	No, explodes	~1 GB	~20+ GB (dead)
Swin	O(HW) ≈ linear	Yes, easily	~200 MB	~800 MB

ViT computes attention between all pairs of patches → 14,400 patches (224/16)² → 200 million attention scores → dead on high-res images.

Swin’s genius idea:
“Don’t do global attention. Do attention only inside small local windows.”
→ Complexity drops from O(N²) to O(N)

How Swin Window Attention Works – Step by Step

Step 1: Divide Image into Non-Overlapping Windows

Default window size M = 7 → each window is 7×7 = 49 patches
Example: 224×224 image, patch_size=4 → feature map 56×56
→ 8×8 = 64 windows of size 7×7 each

Image → Patches → H×W feature map
      ↓
Divide into M×M windows (non-overlapping)
      ↓
Each window does self-attention independently

Step 2: Regular Window Attention (Like Mini-ViT per Window)

Inside each 7×7 window:
- 49 patches → 49 tokens
- Compute Q, K, V → attention scores (49×49 matrix)
- Apply relative position bias (very important!)
- Output same 49 tokens

Total complexity per layer:
64 windows × (49²) = 64 × 2401 ≈ 153,664 operations
vs ViT’s (56×56)² = 9.8 million operations
→ ~60× cheaper!

Step 3: The Magic – Shifted Windows in Next Block

Problem: Regular windows have no communication between windows → no global context!

Swin’s breakthrough: In every second block, shift the windows by (M/2, M/2) pixels
→ Now windows overlap across boundaries → information flows!

Layer 1: Regular windows
┌─────┬─────┬─────┐
│  A  │  B  │  C  │
├─────┼─────┼─────┤
│  D  │  E  │  F  │
└─────┴─────┴─────┘

Layer 2: Shifted windows (shift by 3 or 4 pixels)
  ┌─────┬─────┐
  │  E  │  F  │
┌─────┼─────┼─────┐
│  B  │  C  │     │
├─────┼─────┼─────┤
│  E  │  F  │     │
└─────┴─────┴─────┘

Now patch in window A can attend to patch in window B through the shifted path!

Step 4: Cyclic Shift Trick (Efficient Implementation)

Instead of actually cropping shifted windows (expensive), Swin does:

# Before attention in shifted block
x_shifted = torch.roll(x, shifts=(-shift_size, -shift_size), dims=(1,2))

# After attention
x = torch.roll(x_shifted, shifts=(shift_size, shift_size), dims=(1,2))

→ Zero overhead, perfect shift!

Step 5: Masking in Shifted Windows

After shifting, some patches in a window come from 4 different original windows
→ If we don’t mask, they would illegally attend to each other.

Solution: Create attention mask
- Patches from different original windows → mask value = -100
- Same window → 0

→ After softmax → zero attention across original window boundaries
→ Preserves locality!

Mathematical Complexity Proof

Method	Attention Complexity per Layer	Total for 4 stages
Global (ViT)	O((HW)²)	O(N²)
Swin (Window=7)	O(HW × M²) = O(HW × 49)	~O(N)
Swin (with shift)	Still O(HW × M²)	Linear!

Since M is fixed (7 or 12), complexity is linear in image size → scales to 4K images!

Relative Position Bias (The Secret Sauce)

Swin doesn’t use absolute or learned positional embeddings per patch.

Instead: Learn a small bias table B of size (2M−1)×(2M−1) × num_heads
Example: M=7 → 13×13 = 169 biases per head

For any relative position (Δx, Δy), add B[Δx, Δy] to attention logit
→ Translation invariant + very few parameters!

This is why Swin generalizes so well across resolutions.

Visual Summary – How Information Flows

Layer 1 (Regular Windows)     → Local only
Layer 2 (Shifted Windows)     → Connects adjacent windows
Layer 3 (Regular)             → Local again
Layer 4 (Shifted)             → Connects further
...
After 4–6 stages → Global receptive field!

Just like CNNs build hierarchy, but with attention!

Comparison Table (Memorize This!)

Feature	ViT (Global)	Swin (Window + Shifted)
Attention Scope	Global	Local → Global via hierarchy
Complexity	Quadratic O(N²)	Linear O(N)
Max Resolution (reasonable)	384–512px	2048px+ (used in SAM, Florence)
Translation Invariance	Learned	Built-in (relative bias + shift)
Inductive Bias	None	Locality + hierarchy
Best For	Large data	Detection, segmentation, video
ImageNet-1K Top-1	88.5% (ViT-L)	87.3% (Swin-L) + much faster

Code Snippet – The Heart (Just 10 lines!)

# In shifted block
if self.shift_size > 0:
    x = torch.roll(x, shifts=(-self.shift_size, -self.shift_size), dims=(1,2))

# Partition into windows → attention → reverse
x_windows = window_partition(x, self.window_size)           # → many small windows
x_windows = x_windows.view(-1, M*M, C)
attn_windows = self.attn(x_windows, mask=attn_mask)        # ← only inside window
# ... merge back ...

# Unshift
if self.shift_size > 0:
    x = torch.roll(x, shifts=(self.shift_size, self.shift_size), dims=(1,2))

This 10-line trick made transformers practical for vision.

Why Swin Won Everything After 2021

2021: Beat CNNs and ViT on ImageNet
2022: Backbone of Mask R-CNN, Cascade R-CNN → COCO SOTA
2023: Swin-V2 → ImageNet-22K + 3B params → beats CLIP
2024–2025: Default backbone in YOLOv8, RT-DETR, Florence-2, SAM-2, etc.

Final Summary – Why Window Attention is Genius

Problem	ViT Solution	Swin Solution
Quadratic complexity	Accept it	Fixed windows → linear
No locality bias	Add pos embed	Windows + relative bias → strong
Poor at high resolution	Downsample early	Hierarchical stages
Slow cross-window info flow	None	Shifted windows → fast flow

Swin Transformer proved that you can have the best of both worlds:
Transformer flexibility + CNN efficiency and inductive bias.

This is why, in 2025, Swin (and its children: Swin-V2, Swin-MOE, FocalNet, etc.) is the most widely used vision backbone in the world.

You now fully understand why Swin’s window attention is one of the most important ideas in deep learning since ReLU.

Last updated: Nov 30, 2025

Swin Transformer Window Attention – Deep, Intuitive & Mathematical Explanation

Why it exists, how it works, and why it destroyed the quadratic bottleneck of ViT