"Attention is All You Need" — Feedforward & Residuals

Complete Module: Dynamic Programming, Memoization, LayerNorm + Residual

"Attention is All You Need" — Feedforward & Residuals

Complete Module: Dynamic Programming, Memoization, LayerNorm + Residual

"Attention is All You Need" — Feedforward & Residuals

Complete Module: Dynamic Programming, Memoization, LayerNorm + Residual


Module Objective

Master the Transformer’s Feedforward and Residual pathway — with Dynamic Programming & Memoization intuition, LayerNorm mechanics, and full PyTorch implementation.


1. The Transformer Block: Two Sub-Layers

Input → [Multi-Head Self-Attention] → (+) → [LayerNorm] → x1
        x1 → [Feedforward Network]   → (+) → [LayerNorm] → Output

Two paths:
1. Attention → context
2. Feedforward + Residualtransformation & stability


2. Residual Connections: Highway for Gradients

Problem: Vanishing/Exploding Gradients in Deep Nets

# Without residual
y = f3(f2(f1(x)))
L/x = (f3/f2) × (f2/f1) × (f1/x)
# → Product of many terms → 0 or ∞

Solution: Residual (Skip) Connection

y = x + f(x)   # Residual
L/x = I + f/x   # Identity path!

Gradient flows directlytrain 100+ layers


3. Dynamic Programming Analogy

Neural Net Dynamic Programming
x_{l+1} = x_l + f(x_l) dp[i] = dp[i-1] + cost(i)
Memoization Reuse previous state
Additive update Incremental improvement
# DP: Longest Increasing Subsequence
dp[i] = max(dp[j] for j < i if a[j] < a[i]) + 1

# Residual: 
x = x + Dropout(GELU(Linear(x)))

Both build solution incrementally, reusing past state


4. Feedforward Network (FFN)

FFN(x) = max(0, xW1 + b1)W2 + b2     # Original (ReLU)
FFN(x) = GELU(xW1)W2                 # Modern (GPT, BERT)

Expansion → Compression

d_model → d_ff (4×) → d_model
  512  →  2048   →  512

Bottleneck? No — expansion allows richer features


5. Layer Normalization (LayerNorm)

Why not BatchNorm?

  • BatchNorm: stats over batch → bad for RNNs/Transformers
  • LayerNorm: stats over featuresbatch-independent

Formula

$$
\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta
$$

  • $ \mu, \sigma^2 $: mean/variance per token, over $ d_model $
  • $ \gamma, \beta $: learnable scale & bias

6. Pre-Norm vs Post-Norm

Post-Norm (Original) Pre-Norm (Modern)
LayerNorm(x + Attn(x)) x + Attn(LayerNorm(x))
Unstable at deep layers Better training stability
Used in early Transformers Used in GPT, T5, LLaMA

Pre-Norm wins in practice


7. Full Implementation: Pre-Norm Residual Block

import torch
import torch.nn as nn

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
    def forward(self, x):
        return self.net(x)

class TransformerBlock(nn.Module):
    def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.attn = MultiHeadAttention(d_model, num_heads, dropout)
        self.ff = FeedForward(d_model, d_ff, dropout)

    def forward(self, x, mask=None):
        # === Pre-Norm Residual ===
        # 1. Attention path
        attn_in = self.norm1(x)
        attn_out, attn_weights = self.attn(attn_in, attn_in, attn_in, mask)
        x = x + attn_out  # Residual

        # 2. Feedforward path
        ff_in = self.norm2(x)
        ff_out = self.ff(ff_in)
        x = x + ff_out    # Residual

        return x, attn_weights

8. Memoization Intuition: "Remember & Refine"

# Like caching intermediate results
cache = {}
def fib(n):
    if n in cache: return cache[n]  # Memoization
    if n <= 1: return n
    cache[n] = fib(n-1) + fib(n-2)
    return cache[n]
# Residual = "remember x, refine with f(x)"
x = x + f(x)  # x is "cached", f(x) is "update"

Each layer refines the representation, never forgets


9. Visualization: Gradient Flow

import torch
import matplotlib.pyplot as plt

# Simulate 100-layer network
layers = 100
x = torch.randn(1, 32, 512, requires_grad=True)
grads = []

for i in range(layers):
    x = x + torch.randn_like(x) * 0.1  # Residual update
    x.backward(torch.ones_like(x), retain_graph=True)
    grads.append(x.grad.abs().mean().item())
    x.grad.zero_()

plt.plot(grads)
plt.title("Gradient Magnitude per Layer (Residual)")
plt.xlabel("Layer")
plt.ylabel("|∇|")
plt.yscale('log')
plt.show()

Gradients stay stabledeep training possible


10. LayerNorm Internals

def layer_norm(x, gamma, beta, eps=1e-5):
    mean = x.mean(-1, keepdim=True)
    var = x.var(-1, keepdim=True, unbiased=False)
    x_norm = (x - mean) / torch.sqrt(var + eps)
    return gamma * x_norm + beta

Per-token normalization

Token 1: [0.1, 2.3, -1.2] → μ=0.4, σ=1.5 → normalized
Token 2: [5.0, 5.1, 4.9] → μ=5.0, σ=0.1 → normalized

Each token has its own stats


11. Full Training Loop (Copy Task)

# Model
model = nn.Sequential(
    nn.Embedding(10, 512),
    TransformerBlock(d_model=512, num_heads=8),
    nn.LayerNorm(512),
    nn.Linear(512, 10)
)

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

for epoch in range(100):
    src = torch.randint(0, 5, (32, 20))
    tgt = src.clone()

    logits = model[0](src)
    for block in model[1:-2]:  # if stacked
        logits, _ = block(logits)
    logits = model[-2](logits)
    logits = model[-1](logits)

    loss = criterion(logits.view(-1, 10), tgt.view(-1))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch % 20 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

12. Summary Cheat Sheet

Component Purpose Key Property
Residual x + f(x) Gradient highway
LayerNorm Normalize per token Training stability
Pre-Norm x + f(LN(x)) Better deep training
FFN GELU(xW1)W2 Non-linear transform
Memoization Reuse x Incremental learning

13. Practice Exercises

  1. Ablate Residual: Remove + x → training fails at depth > 6.
  2. Ablate LayerNorm: Replace with identity → unstable.
  3. Post-Norm vs Pre-Norm: Train 12-layer model → compare loss curves.
  4. Dynamic Programming: Implement edit_distance with DP → map to residual.
  5. Visualize: Plot x, f(x), x + f(x) for one layer.

14. Key Takeaways

Check Insight
Check Residual = Identity + Update = DP Memoization
Check LayerNorm = per-token standardization
Check Pre-Norm > Post-Norm for deep models
Check FFN = expansion for capacity
Check Together: stable, deep, expressive

Full Copy-Paste Code

import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, h, dropout=0.1):
        super().__init__()
        self.d_k = d_model // h
        self.h貧 = h
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
    def forward(self, x, mask=None):
        Q = self.W_q(x).view(x.size(0), -1, self.h, self.d_k).transpose(1,2)
        K = self.W_k(x).view(x.size(0), -1, self.h, self.d_k).transpose(1,2)
        V = self.W_v(x).view(x.size(0), -1, self.h, self.d_k).transpose(1,2)
        scores = (Q @ K.transpose(-2,-1)) / (self.d_k**0.5)
        if mask is not None: scores = scores.masked_fill(mask==0, -1e9)
        attn = self.dropout(torch.softmax(scores, dim=-1))
        out = (attn @ V).transpose(1,2).contiguous().view(x.size(0), -1, x.size(-1))
        return self.W_o(out), attn

class TransformerBlock(nn.Module):
    def __init__(self, d_model=512, h=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.attn = MultiHeadAttention(d_model, h, dropout)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
    def forward(self, x, mask=None):
        x = x + self.attn(self.norm1(x), mask)[0]
        x = x + self.ff(self.norm2(x))
        return x, None

Final Words

Residual + LayerNorm = The reason Transformers scale to 175B parameters.

You now understand:
- Why gradients don’t die
- How each layer refines
- Why Pre-Norm is king
- The DP connection


End of Module
You just built the stable backbone of every modern LLM.
Stack 100 layers. Train for a week. Change the world.

Last updated: Nov 13, 2025

"Attention is All You Need" — Feedforward & Residuals

Complete Module: Dynamic Programming, Memoization, LayerNorm + Residual

"Attention is All You Need" — Feedforward & Residuals

Complete Module: Dynamic Programming, Memoization, LayerNorm + Residual

"Attention is All You Need" — Feedforward & Residuals

Complete Module: Dynamic Programming, Memoization, LayerNorm + Residual


Module Objective

Master the Transformer’s Feedforward and Residual pathway — with Dynamic Programming & Memoization intuition, LayerNorm mechanics, and full PyTorch implementation.


1. The Transformer Block: Two Sub-Layers

Input → [Multi-Head Self-Attention] → (+) → [LayerNorm] → x1
        x1 → [Feedforward Network]   → (+) → [LayerNorm] → Output

Two paths:
1. Attention → context
2. Feedforward + Residualtransformation & stability


2. Residual Connections: Highway for Gradients

Problem: Vanishing/Exploding Gradients in Deep Nets

# Without residual
y = f3(f2(f1(x)))
L/x = (f3/f2) × (f2/f1) × (f1/x)
# → Product of many terms → 0 or ∞

Solution: Residual (Skip) Connection

y = x + f(x)   # Residual
L/x = I + f/x   # Identity path!

Gradient flows directlytrain 100+ layers


3. Dynamic Programming Analogy

Neural Net Dynamic Programming
x_{l+1} = x_l + f(x_l) dp[i] = dp[i-1] + cost(i)
Memoization Reuse previous state
Additive update Incremental improvement
# DP: Longest Increasing Subsequence
dp[i] = max(dp[j] for j < i if a[j] < a[i]) + 1

# Residual: 
x = x + Dropout(GELU(Linear(x)))

Both build solution incrementally, reusing past state


4. Feedforward Network (FFN)

FFN(x) = max(0, xW1 + b1)W2 + b2     # Original (ReLU)
FFN(x) = GELU(xW1)W2                 # Modern (GPT, BERT)

Expansion → Compression

d_model → d_ff (4×) → d_model
  512  →  2048   →  512

Bottleneck? No — expansion allows richer features


5. Layer Normalization (LayerNorm)

Why not BatchNorm?

  • BatchNorm: stats over batch → bad for RNNs/Transformers
  • LayerNorm: stats over featuresbatch-independent

Formula

$$
\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta
$$

  • $ \mu, \sigma^2 $: mean/variance per token, over $ d_model $
  • $ \gamma, \beta $: learnable scale & bias

6. Pre-Norm vs Post-Norm

Post-Norm (Original) Pre-Norm (Modern)
LayerNorm(x + Attn(x)) x + Attn(LayerNorm(x))
Unstable at deep layers Better training stability
Used in early Transformers Used in GPT, T5, LLaMA

Pre-Norm wins in practice


7. Full Implementation: Pre-Norm Residual Block

import torch
import torch.nn as nn

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
    def forward(self, x):
        return self.net(x)

class TransformerBlock(nn.Module):
    def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.attn = MultiHeadAttention(d_model, num_heads, dropout)
        self.ff = FeedForward(d_model, d_ff, dropout)

    def forward(self, x, mask=None):
        # === Pre-Norm Residual ===
        # 1. Attention path
        attn_in = self.norm1(x)
        attn_out, attn_weights = self.attn(attn_in, attn_in, attn_in, mask)
        x = x + attn_out  # Residual

        # 2. Feedforward path
        ff_in = self.norm2(x)
        ff_out = self.ff(ff_in)
        x = x + ff_out    # Residual

        return x, attn_weights

8. Memoization Intuition: "Remember & Refine"

# Like caching intermediate results
cache = {}
def fib(n):
    if n in cache: return cache[n]  # Memoization
    if n <= 1: return n
    cache[n] = fib(n-1) + fib(n-2)
    return cache[n]
# Residual = "remember x, refine with f(x)"
x = x + f(x)  # x is "cached", f(x) is "update"

Each layer refines the representation, never forgets


9. Visualization: Gradient Flow

import torch
import matplotlib.pyplot as plt

# Simulate 100-layer network
layers = 100
x = torch.randn(1, 32, 512, requires_grad=True)
grads = []

for i in range(layers):
    x = x + torch.randn_like(x) * 0.1  # Residual update
    x.backward(torch.ones_like(x), retain_graph=True)
    grads.append(x.grad.abs().mean().item())
    x.grad.zero_()

plt.plot(grads)
plt.title("Gradient Magnitude per Layer (Residual)")
plt.xlabel("Layer")
plt.ylabel("|∇|")
plt.yscale('log')
plt.show()

Gradients stay stabledeep training possible


10. LayerNorm Internals

def layer_norm(x, gamma, beta, eps=1e-5):
    mean = x.mean(-1, keepdim=True)
    var = x.var(-1, keepdim=True, unbiased=False)
    x_norm = (x - mean) / torch.sqrt(var + eps)
    return gamma * x_norm + beta

Per-token normalization

Token 1: [0.1, 2.3, -1.2] → μ=0.4, σ=1.5 → normalized
Token 2: [5.0, 5.1, 4.9] → μ=5.0, σ=0.1 → normalized

Each token has its own stats


11. Full Training Loop (Copy Task)

# Model
model = nn.Sequential(
    nn.Embedding(10, 512),
    TransformerBlock(d_model=512, num_heads=8),
    nn.LayerNorm(512),
    nn.Linear(512, 10)
)

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

for epoch in range(100):
    src = torch.randint(0, 5, (32, 20))
    tgt = src.clone()

    logits = model[0](src)
    for block in model[1:-2]:  # if stacked
        logits, _ = block(logits)
    logits = model[-2](logits)
    logits = model[-1](logits)

    loss = criterion(logits.view(-1, 10), tgt.view(-1))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch % 20 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

12. Summary Cheat Sheet

Component Purpose Key Property
Residual x + f(x) Gradient highway
LayerNorm Normalize per token Training stability
Pre-Norm x + f(LN(x)) Better deep training
FFN GELU(xW1)W2 Non-linear transform
Memoization Reuse x Incremental learning

13. Practice Exercises

  1. Ablate Residual: Remove + x → training fails at depth > 6.
  2. Ablate LayerNorm: Replace with identity → unstable.
  3. Post-Norm vs Pre-Norm: Train 12-layer model → compare loss curves.
  4. Dynamic Programming: Implement edit_distance with DP → map to residual.
  5. Visualize: Plot x, f(x), x + f(x) for one layer.

14. Key Takeaways

Check Insight
Check Residual = Identity + Update = DP Memoization
Check LayerNorm = per-token standardization
Check Pre-Norm > Post-Norm for deep models
Check FFN = expansion for capacity
Check Together: stable, deep, expressive

Full Copy-Paste Code

import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, h, dropout=0.1):
        super().__init__()
        self.d_k = d_model // h
        self.h貧 = h
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
    def forward(self, x, mask=None):
        Q = self.W_q(x).view(x.size(0), -1, self.h, self.d_k).transpose(1,2)
        K = self.W_k(x).view(x.size(0), -1, self.h, self.d_k).transpose(1,2)
        V = self.W_v(x).view(x.size(0), -1, self.h, self.d_k).transpose(1,2)
        scores = (Q @ K.transpose(-2,-1)) / (self.d_k**0.5)
        if mask is not None: scores = scores.masked_fill(mask==0, -1e9)
        attn = self.dropout(torch.softmax(scores, dim=-1))
        out = (attn @ V).transpose(1,2).contiguous().view(x.size(0), -1, x.size(-1))
        return self.W_o(out), attn

class TransformerBlock(nn.Module):
    def __init__(self, d_model=512, h=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.attn = MultiHeadAttention(d_model, h, dropout)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
    def forward(self, x, mask=None):
        x = x + self.attn(self.norm1(x), mask)[0]
        x = x + self.ff(self.norm2(x))
        return x, None

Final Words

Residual + LayerNorm = The reason Transformers scale to 175B parameters.

You now understand:
- Why gradients don’t die
- How each layer refines
- Why Pre-Norm is king
- The DP connection


End of Module
You just built the stable backbone of every modern LLM.
Stack 100 layers. Train for a week. Change the world.

Last updated: Nov 13, 2025