"Attention is All You Need" — Feedforward & Residuals
Complete Module: Dynamic Programming, Memoization, LayerNorm + Residual
"Attention is All You Need" — Feedforward & Residuals
Complete Module: Dynamic Programming, Memoization, LayerNorm + Residual
"Attention is All You Need" — Feedforward & Residuals
Complete Module: Dynamic Programming, Memoization, LayerNorm + Residual
Module Objective
Master the Transformer’s Feedforward and Residual pathway — with Dynamic Programming & Memoization intuition, LayerNorm mechanics, and full PyTorch implementation.
1. The Transformer Block: Two Sub-Layers
Input → [Multi-Head Self-Attention] → (+) → [LayerNorm] → x1
x1 → [Feedforward Network] → (+) → [LayerNorm] → Output
Two paths:
1. Attention → context
2. Feedforward + Residual → transformation & stability
2. Residual Connections: Highway for Gradients
Problem: Vanishing/Exploding Gradients in Deep Nets
# Without residual
y = f3(f2(f1(x)))
∂L/∂x = (∂f3/∂f2) × (∂f2/∂f1) × (∂f1/∂x)
# → Product of many terms → 0 or ∞
Solution: Residual (Skip) Connection
y = x + f(x) # Residual
∂L/∂x = I + ∂f/∂x # Identity path!
Gradient flows directly → train 100+ layers
3. Dynamic Programming Analogy
| Neural Net | Dynamic Programming |
|---|---|
x_{l+1} = x_l + f(x_l) |
dp[i] = dp[i-1] + cost(i) |
| Memoization | Reuse previous state |
| Additive update | Incremental improvement |
# DP: Longest Increasing Subsequence
dp[i] = max(dp[j] for j < i if a[j] < a[i]) + 1
# Residual:
x = x + Dropout(GELU(Linear(x)))
Both build solution incrementally, reusing past state
4. Feedforward Network (FFN)
FFN(x) = max(0, xW1 + b1)W2 + b2 # Original (ReLU)
FFN(x) = GELU(xW1)W2 # Modern (GPT, BERT)
Expansion → Compression
d_model → d_ff (4×) → d_model
512 → 2048 → 512
Bottleneck? No — expansion allows richer features
5. Layer Normalization (LayerNorm)
Why not BatchNorm?
- BatchNorm: stats over batch → bad for RNNs/Transformers
- LayerNorm: stats over features → batch-independent
Formula
$$
\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta
$$
- $ \mu, \sigma^2 $: mean/variance per token, over $ d_model $
- $ \gamma, \beta $: learnable scale & bias
6. Pre-Norm vs Post-Norm
| Post-Norm (Original) | Pre-Norm (Modern) |
|---|---|
LayerNorm(x + Attn(x)) |
x + Attn(LayerNorm(x)) |
| Unstable at deep layers | Better training stability |
| Used in early Transformers | Used in GPT, T5, LLaMA |
Pre-Norm wins in practice
7. Full Implementation: Pre-Norm Residual Block
import torch
import torch.nn as nn
class FeedForward(nn.Module):
def __init__(self, d_model, d_ff, dropout=0.1):
super().__init__()
self.net = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model),
nn.Dropout(dropout)
)
def forward(self, x):
return self.net(x)
class TransformerBlock(nn.Module):
def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
super().__init__()
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.attn = MultiHeadAttention(d_model, num_heads, dropout)
self.ff = FeedForward(d_model, d_ff, dropout)
def forward(self, x, mask=None):
# === Pre-Norm Residual ===
# 1. Attention path
attn_in = self.norm1(x)
attn_out, attn_weights = self.attn(attn_in, attn_in, attn_in, mask)
x = x + attn_out # Residual
# 2. Feedforward path
ff_in = self.norm2(x)
ff_out = self.ff(ff_in)
x = x + ff_out # Residual
return x, attn_weights
8. Memoization Intuition: "Remember & Refine"
# Like caching intermediate results
cache = {}
def fib(n):
if n in cache: return cache[n] # Memoization
if n <= 1: return n
cache[n] = fib(n-1) + fib(n-2)
return cache[n]
# Residual = "remember x, refine with f(x)"
x = x + f(x) # x is "cached", f(x) is "update"
Each layer refines the representation, never forgets
9. Visualization: Gradient Flow
import torch
import matplotlib.pyplot as plt
# Simulate 100-layer network
layers = 100
x = torch.randn(1, 32, 512, requires_grad=True)
grads = []
for i in range(layers):
x = x + torch.randn_like(x) * 0.1 # Residual update
x.backward(torch.ones_like(x), retain_graph=True)
grads.append(x.grad.abs().mean().item())
x.grad.zero_()
plt.plot(grads)
plt.title("Gradient Magnitude per Layer (Residual)")
plt.xlabel("Layer")
plt.ylabel("|∇|")
plt.yscale('log')
plt.show()
Gradients stay stable → deep training possible
10. LayerNorm Internals
def layer_norm(x, gamma, beta, eps=1e-5):
mean = x.mean(-1, keepdim=True)
var = x.var(-1, keepdim=True, unbiased=False)
x_norm = (x - mean) / torch.sqrt(var + eps)
return gamma * x_norm + beta
Per-token normalization
Token 1: [0.1, 2.3, -1.2] → μ=0.4, σ=1.5 → normalized
Token 2: [5.0, 5.1, 4.9] → μ=5.0, σ=0.1 → normalized
Each token has its own stats
11. Full Training Loop (Copy Task)
# Model
model = nn.Sequential(
nn.Embedding(10, 512),
TransformerBlock(d_model=512, num_heads=8),
nn.LayerNorm(512),
nn.Linear(512, 10)
)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
for epoch in range(100):
src = torch.randint(0, 5, (32, 20))
tgt = src.clone()
logits = model[0](src)
for block in model[1:-2]: # if stacked
logits, _ = block(logits)
logits = model[-2](logits)
logits = model[-1](logits)
loss = criterion(logits.view(-1, 10), tgt.view(-1))
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch % 20 == 0:
print(f"Epoch {epoch}, Loss: {loss.item():.4f}")
12. Summary Cheat Sheet
| Component | Purpose | Key Property |
|---|---|---|
| Residual | x + f(x) |
Gradient highway |
| LayerNorm | Normalize per token | Training stability |
| Pre-Norm | x + f(LN(x)) |
Better deep training |
| FFN | GELU(xW1)W2 |
Non-linear transform |
| Memoization | Reuse x |
Incremental learning |
13. Practice Exercises
- Ablate Residual: Remove
+ x→ training fails at depth > 6. - Ablate LayerNorm: Replace with identity → unstable.
- Post-Norm vs Pre-Norm: Train 12-layer model → compare loss curves.
- Dynamic Programming: Implement
edit_distancewith DP → map to residual. - Visualize: Plot
x,f(x),x + f(x)for one layer.
14. Key Takeaways
| Check | Insight |
|---|---|
| Check | Residual = Identity + Update = DP Memoization |
| Check | LayerNorm = per-token standardization |
| Check | Pre-Norm > Post-Norm for deep models |
| Check | FFN = expansion for capacity |
| Check | Together: stable, deep, expressive |
Full Copy-Paste Code
import torch
import torch.nn as nn
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, h, dropout=0.1):
super().__init__()
self.d_k = d_model // h
self.h貧 = h
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
Q = self.W_q(x).view(x.size(0), -1, self.h, self.d_k).transpose(1,2)
K = self.W_k(x).view(x.size(0), -1, self.h, self.d_k).transpose(1,2)
V = self.W_v(x).view(x.size(0), -1, self.h, self.d_k).transpose(1,2)
scores = (Q @ K.transpose(-2,-1)) / (self.d_k**0.5)
if mask is not None: scores = scores.masked_fill(mask==0, -1e9)
attn = self.dropout(torch.softmax(scores, dim=-1))
out = (attn @ V).transpose(1,2).contiguous().view(x.size(0), -1, x.size(-1))
return self.W_o(out), attn
class TransformerBlock(nn.Module):
def __init__(self, d_model=512, h=8, d_ff=2048, dropout=0.1):
super().__init__()
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.attn = MultiHeadAttention(d_model, h, dropout)
self.ff = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model),
nn.Dropout(dropout)
)
def forward(self, x, mask=None):
x = x + self.attn(self.norm1(x), mask)[0]
x = x + self.ff(self.norm2(x))
return x, None
Final Words
Residual + LayerNorm = The reason Transformers scale to 175B parameters.
You now understand:
- Why gradients don’t die
- How each layer refines
- Why Pre-Norm is king
- The DP connection
End of Module
You just built the stable backbone of every modern LLM.
Stack 100 layers. Train for a week. Change the world.
"Attention is All You Need" — Feedforward & Residuals
Complete Module: Dynamic Programming, Memoization, LayerNorm + Residual
"Attention is All You Need" — Feedforward & Residuals
Complete Module: Dynamic Programming, Memoization, LayerNorm + Residual
"Attention is All You Need" — Feedforward & Residuals
Complete Module: Dynamic Programming, Memoization, LayerNorm + Residual
Module Objective
Master the Transformer’s Feedforward and Residual pathway — with Dynamic Programming & Memoization intuition, LayerNorm mechanics, and full PyTorch implementation.
1. The Transformer Block: Two Sub-Layers
Input → [Multi-Head Self-Attention] → (+) → [LayerNorm] → x1
x1 → [Feedforward Network] → (+) → [LayerNorm] → Output
Two paths:
1. Attention → context
2. Feedforward + Residual → transformation & stability
2. Residual Connections: Highway for Gradients
Problem: Vanishing/Exploding Gradients in Deep Nets
# Without residual
y = f3(f2(f1(x)))
∂L/∂x = (∂f3/∂f2) × (∂f2/∂f1) × (∂f1/∂x)
# → Product of many terms → 0 or ∞
Solution: Residual (Skip) Connection
y = x + f(x) # Residual
∂L/∂x = I + ∂f/∂x # Identity path!
Gradient flows directly → train 100+ layers
3. Dynamic Programming Analogy
| Neural Net | Dynamic Programming |
|---|---|
x_{l+1} = x_l + f(x_l) |
dp[i] = dp[i-1] + cost(i) |
| Memoization | Reuse previous state |
| Additive update | Incremental improvement |
# DP: Longest Increasing Subsequence
dp[i] = max(dp[j] for j < i if a[j] < a[i]) + 1
# Residual:
x = x + Dropout(GELU(Linear(x)))
Both build solution incrementally, reusing past state
4. Feedforward Network (FFN)
FFN(x) = max(0, xW1 + b1)W2 + b2 # Original (ReLU)
FFN(x) = GELU(xW1)W2 # Modern (GPT, BERT)
Expansion → Compression
d_model → d_ff (4×) → d_model
512 → 2048 → 512
Bottleneck? No — expansion allows richer features
5. Layer Normalization (LayerNorm)
Why not BatchNorm?
- BatchNorm: stats over batch → bad for RNNs/Transformers
- LayerNorm: stats over features → batch-independent
Formula
$$
\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta
$$
- $ \mu, \sigma^2 $: mean/variance per token, over $ d_model $
- $ \gamma, \beta $: learnable scale & bias
6. Pre-Norm vs Post-Norm
| Post-Norm (Original) | Pre-Norm (Modern) |
|---|---|
LayerNorm(x + Attn(x)) |
x + Attn(LayerNorm(x)) |
| Unstable at deep layers | Better training stability |
| Used in early Transformers | Used in GPT, T5, LLaMA |
Pre-Norm wins in practice
7. Full Implementation: Pre-Norm Residual Block
import torch
import torch.nn as nn
class FeedForward(nn.Module):
def __init__(self, d_model, d_ff, dropout=0.1):
super().__init__()
self.net = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model),
nn.Dropout(dropout)
)
def forward(self, x):
return self.net(x)
class TransformerBlock(nn.Module):
def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
super().__init__()
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.attn = MultiHeadAttention(d_model, num_heads, dropout)
self.ff = FeedForward(d_model, d_ff, dropout)
def forward(self, x, mask=None):
# === Pre-Norm Residual ===
# 1. Attention path
attn_in = self.norm1(x)
attn_out, attn_weights = self.attn(attn_in, attn_in, attn_in, mask)
x = x + attn_out # Residual
# 2. Feedforward path
ff_in = self.norm2(x)
ff_out = self.ff(ff_in)
x = x + ff_out # Residual
return x, attn_weights
8. Memoization Intuition: "Remember & Refine"
# Like caching intermediate results
cache = {}
def fib(n):
if n in cache: return cache[n] # Memoization
if n <= 1: return n
cache[n] = fib(n-1) + fib(n-2)
return cache[n]
# Residual = "remember x, refine with f(x)"
x = x + f(x) # x is "cached", f(x) is "update"
Each layer refines the representation, never forgets
9. Visualization: Gradient Flow
import torch
import matplotlib.pyplot as plt
# Simulate 100-layer network
layers = 100
x = torch.randn(1, 32, 512, requires_grad=True)
grads = []
for i in range(layers):
x = x + torch.randn_like(x) * 0.1 # Residual update
x.backward(torch.ones_like(x), retain_graph=True)
grads.append(x.grad.abs().mean().item())
x.grad.zero_()
plt.plot(grads)
plt.title("Gradient Magnitude per Layer (Residual)")
plt.xlabel("Layer")
plt.ylabel("|∇|")
plt.yscale('log')
plt.show()
Gradients stay stable → deep training possible
10. LayerNorm Internals
def layer_norm(x, gamma, beta, eps=1e-5):
mean = x.mean(-1, keepdim=True)
var = x.var(-1, keepdim=True, unbiased=False)
x_norm = (x - mean) / torch.sqrt(var + eps)
return gamma * x_norm + beta
Per-token normalization
Token 1: [0.1, 2.3, -1.2] → μ=0.4, σ=1.5 → normalized
Token 2: [5.0, 5.1, 4.9] → μ=5.0, σ=0.1 → normalized
Each token has its own stats
11. Full Training Loop (Copy Task)
# Model
model = nn.Sequential(
nn.Embedding(10, 512),
TransformerBlock(d_model=512, num_heads=8),
nn.LayerNorm(512),
nn.Linear(512, 10)
)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
for epoch in range(100):
src = torch.randint(0, 5, (32, 20))
tgt = src.clone()
logits = model[0](src)
for block in model[1:-2]: # if stacked
logits, _ = block(logits)
logits = model[-2](logits)
logits = model[-1](logits)
loss = criterion(logits.view(-1, 10), tgt.view(-1))
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch % 20 == 0:
print(f"Epoch {epoch}, Loss: {loss.item():.4f}")
12. Summary Cheat Sheet
| Component | Purpose | Key Property |
|---|---|---|
| Residual | x + f(x) |
Gradient highway |
| LayerNorm | Normalize per token | Training stability |
| Pre-Norm | x + f(LN(x)) |
Better deep training |
| FFN | GELU(xW1)W2 |
Non-linear transform |
| Memoization | Reuse x |
Incremental learning |
13. Practice Exercises
- Ablate Residual: Remove
+ x→ training fails at depth > 6. - Ablate LayerNorm: Replace with identity → unstable.
- Post-Norm vs Pre-Norm: Train 12-layer model → compare loss curves.
- Dynamic Programming: Implement
edit_distancewith DP → map to residual. - Visualize: Plot
x,f(x),x + f(x)for one layer.
14. Key Takeaways
| Check | Insight |
|---|---|
| Check | Residual = Identity + Update = DP Memoization |
| Check | LayerNorm = per-token standardization |
| Check | Pre-Norm > Post-Norm for deep models |
| Check | FFN = expansion for capacity |
| Check | Together: stable, deep, expressive |
Full Copy-Paste Code
import torch
import torch.nn as nn
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, h, dropout=0.1):
super().__init__()
self.d_k = d_model // h
self.h貧 = h
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
Q = self.W_q(x).view(x.size(0), -1, self.h, self.d_k).transpose(1,2)
K = self.W_k(x).view(x.size(0), -1, self.h, self.d_k).transpose(1,2)
V = self.W_v(x).view(x.size(0), -1, self.h, self.d_k).transpose(1,2)
scores = (Q @ K.transpose(-2,-1)) / (self.d_k**0.5)
if mask is not None: scores = scores.masked_fill(mask==0, -1e9)
attn = self.dropout(torch.softmax(scores, dim=-1))
out = (attn @ V).transpose(1,2).contiguous().view(x.size(0), -1, x.size(-1))
return self.W_o(out), attn
class TransformerBlock(nn.Module):
def __init__(self, d_model=512, h=8, d_ff=2048, dropout=0.1):
super().__init__()
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.attn = MultiHeadAttention(d_model, h, dropout)
self.ff = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model),
nn.Dropout(dropout)
)
def forward(self, x, mask=None):
x = x + self.attn(self.norm1(x), mask)[0]
x = x + self.ff(self.norm2(x))
return x, None
Final Words
Residual + LayerNorm = The reason Transformers scale to 175B parameters.
You now understand:
- Why gradients don’t die
- How each layer refines
- Why Pre-Norm is king
- The DP connection
End of Module
You just built the stable backbone of every modern LLM.
Stack 100 layers. Train for a week. Change the world.