"Attention is All You Need" — Positional Encoding
Complete Module: Hash Functions, Signal Processing, Sinusoidal vs Learned PE
"Attention is All You Need" — Positional Encoding
Complete Module: Hash Functions, Signal Processing, Sinusoidal vs Learned PE
"Attention is All You Need" — Positional Encoding
Complete Module: Hash Functions, Signal Processing, Sinusoidal vs Learned PE
Module Objective
Deep dive into Positional Encoding — signal processing, hashing, Fourier theory, and Sinusoidal vs Learned — with math, code, visualization, and ablation.
1. The Problem: Attention is Permutation-Invariant
X = ["the", "cat", "sat"]
Attention(X) == Attention(["sat", "cat", "the"])
No order → no meaning
2. Two Solutions
| Type | Mechanism | Learnable? | Max Length |
|---|---|---|---|
| Sinusoidal (Fixed) | Wave functions | No | Infinite |
| Learned (Trainable) | Embedding table | Yes | Fixed |
3. Sinusoidal PE — Signal Processing View
Formula (Original Paper)
$$
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)
$$
$$
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)
$$
Each dimension = a sine wave with different frequency
4. Signal Processing Interpretation
import torch
import matplotlib.pyplot as plt
import numpy as np
def plot_sinusoidal_pe(d_model=16, max_pos=20):
pos = torch.arange(max_pos).unsqueeze(1)
i = torch.arange(0, d_model, 2)
div_term = torch.exp(i * -torch.log(torch.tensor(10000.0)) / d_model)
pe_even = torch.sin(pos * div_term)
pe_odd = torch.cos(pos * div_term)
pe = torch.zeros(max_pos, d_model)
pe[:, 0::2] = pe_even
pe[:, 1::2] = pe_odd
plt.figure(figsize=(12, 6))
for dim in range(0, d_model, 2):
plt.plot(pos, pe[:, dim], label=f"dim {dim}" if dim < 6 else "")
plt.legend()
plt.xlabel("Position")
plt.ylabel("PE Value")
plt.title("Sinusoidal PE: Different Frequencies per Dimension")
plt.grid(True, alpha=0.3)
plt.show()
plot_sinusoidal_pe()
Low dims → slow waves → long-range patterns
High dims → fast waves → fine-grained local patterns
5. Fourier Basis: Why It Works
Any smooth function can be represented as sum of sines/cosines
→ PE spans a rich frequency space
# Relative distance encoding
pos_i, pos_j = 5, 10
pe_i = pe[pos_i]
pe_j = pe[pos_j]
# Dot product peaks at fixed relative distance
dist = 5
correlations = []
for offset in range(-10, 11):
if 0 <= pos_i + offset < max_pos:
corr = torch.dot(pe[pos_i], pe[pos_i + offset])
correlations.append((offset, corr.item()))
offsets, corrs = zip(*correlations)
plt.plot(offsets, corrs, 'o-')
plt.title("PE Correlation vs Relative Position")
plt.xlabel("Position Offset")
plt.ylabel("Dot Product")
plt.show()
Model can compute relative position via dot product!
6. Hashing Perspective: Sinusoidal PE as Locality-Sensitive Hash
Idea: Similar positions → similar PE vectors
from sklearn.metrics.pairwise import cosine_similarity
pos1, pos2 = 100, 105
pe1 = pe[pos1].unsqueeze(0)
pe2 = pe[pos2].unsqueeze(0)
sim = cosine_similarity(pe1.numpy(), pe2.numpy())[0][0]
print(f"Cosine sim(pos=100, 105) = {sim:.3f}") # ~0.999
LSH property:
$ \text{sim}(PE_i, PE_j) \propto \exp(-|i-j|) $
→ Attention can infer distance without explicit position IDs
7. Learned Positional Encoding
class LearnedPositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=512):
super().__init__()
self.pe = nn.Embedding(max_len, d_model)
nn.init.normal_(self.pe.weight, std=0.02)
def forward(self, x):
seq_len = x.size(1)
pos = torch.arange(seq_len, device=x.device)
return x + self.pe(pos)
8. Sinusoidal vs Learned: Ablation Study
import torch.optim as optim
def train_copy_task(model_cls, use_learned_pe=False, max_len=20):
model = nn.Sequential(
nn.Embedding(10, 16),
model_cls(d_model=16, num_heads=4, use_learned_pe=use_learned_pe),
nn.Linear(16, 10)
)
opt = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()
losses = []
for epoch in range(300):
src = torch.randint(0, 5, (32, max_len))
tgt = src.clone()
logits = model[0](src)
logits = model[1](logits)[0]
logits = model[2](logits)
loss = criterion(logits.view(-1, 10), tgt.view(-1))
opt.zero_grad()
loss.backward()
opt.step()
losses.append(loss.item())
return losses
# Run both
loss_sine = train_copy_task(TransformerBlock, use_learned_pe=False)
loss_learned = train_copy_task(TransformerBlock, use_learned_pe=True)
plt.plot(loss_sine, label="Sinusoidal PE")
plt.plot(loss_learned, label="Learned PE")
plt.legend()
plt.title("Copy Task: Sinusoidal vs Learned PE")
plt.xlabel("Training Step")
plt.ylabel("Loss")
plt.show()
Result:
- Sinusoidal: Faster convergence, better generalization
- Learned: Can overfit to training length
9. Extrapolation Test: Can It Handle Longer Sequences?
# Train on max_len=20
model_sine = ... # trained with sinusoidal
model_learned = ... # trained with learned (max_len=20)
# Test on length 50
long_seq = torch.randint(0, 5, (1, 50))
with torch.no_grad():
out_sine = model_sine(long_seq)
# out_learned → IndexError! (Embedding size = 20)
Sinusoidal: Works for any length
Learned: Limited to training length
10. Hashing Analogy: PE as Embedding Hash
| Concept | Sinusoidal PE | Learned PE |
|---|---|---|
| Hash Function | $ \sin(pos \cdot \omega_i) $ | $ E[pos] $ |
| Collision | Smooth | Discrete |
| Range | $ \mathbb{R} $ | $ \mathbb{R}^d $ |
| Collision Probability | $ \propto \exp(- | i-j |
Sinusoidal = continuous LSH
Learned = perfect hash (but limited domain)
11. Advanced: Rotary Positional Embedding (RoPE)
Used in LLaMA, PaLM — relative + rotation
def apply_rotary_emb(q, k, freqs):
# q, k: (B, H, N, d_k)
q_real, q_imag = q[..., :d_k//2], q[..., d_k//2:]
k_real, k_imag = k[..., :d_k//2], k[..., d_k//2:]
# Rotate
q_rot = torch.cat([-q_imag, q_real], dim=-1) * freqs
k_rot = torch.cat([-k_imag, k_real], dim=-1) * freqs
return q_rot + q, k_rot + k
Preserves absolute position via rotation in complex plane
12. Summary Table
| Feature | Sinusoidal | Learned | RoPE |
|---|---|---|---|
| Learnable | No | Yes | No |
| Max Length | Infinite | Fixed | Infinite |
| Relative Pos | Yes (via dot) | No | Yes (explicit) |
| Signal Theory | Fourier basis | Arbitrary | Rotation |
| Hashing | LSH | Perfect | Geometric |
| Used In | GPT-2, BERT | Early Transformers | LLaMA, PaLM |
13. Visualization: PE Heatmap
pe_sine = SinusoidalPositionalEncoding(128, 100).pe[0].cpu().numpy()
pe_learned = LearnedPositionalEncoding(128, 100).pe.weight.detach().cpu().numpy()
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
sns.heatmap(pe_sine, ax=ax1, cmap="RdYlBu", center=0)
sns.heatmap(pe_learned, ax=ax2, cmap="RdYlBu", center=0)
ax1.set_title("Sinusoidal PE")
ax2.set_title("Learned PE (Random Init)")
plt.show()
14. Practice Exercises
- Fourier Analysis: Compute FFT of PE across positions.
- Hash Collision: Measure cosine sim for $ |i-j| = 1, 5, 10 $.
- Ablation: Train without PE → accuracy drops to ~10%.
- Hybrid: Use sinusoidal + learned (T5-style).
- RoPE: Implement and compare with sinusoidal.
15. Key Takeaways
| Check | Insight |
|---|---|
| Check | Sinusoidal PE = Fourier basis + LSH |
| Check | Learned PE = flexible but length-limited |
| Check | Relative position emerges from dot product |
| Check | Sinusoidal generalizes to any length |
| Check | RoPE = modern geometric alternative |
Full Code: Sinusoidal vs Learned
import torch
import torch.nn as nn
# === Sinusoidal ===
class SinusoidalPositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()
pe = torch.zeros(max_len, d_model)
pos = torch.arange(0, max_len).unsqueeze(1).float()
div = torch.exp(torch.arange(0, d_model, 2).float() * -(torch.log(torch.tensor(10000.0)) / d_model))
pe[:, 0::2] = torch.sin(pos * div)
pe[:, 1::2] = torch.cos(pos * div)
self.register_buffer('pe', pe.unsqueeze(0))
def forward(self, x):
return x + self.pe[:, :x.size(1)]
# === Learned ===
class LearnedPositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=512):
super().__init__()
self.pe = nn.Embedding(max_len, d_model)
def forward(self, x):
pos = torch.arange(x.size(1), device=x.device)
return x + self.pe(pos)
Final Words
Positional Encoding is not just a hack
→ It’s signal processing, hashing, and geometry in disguise.
You now understand:
- Why sinusoidal works
- Why learned fails to extrapolate
- How relative position emerges
- Modern RoPE alternative
End of Module
You control time in neural networks.
Next: Stack 12 layers → build a Transformer!
"Attention is All You Need" — Positional Encoding
Complete Module: Hash Functions, Signal Processing, Sinusoidal vs Learned PE
"Attention is All You Need" — Positional Encoding
Complete Module: Hash Functions, Signal Processing, Sinusoidal vs Learned PE
"Attention is All You Need" — Positional Encoding
Complete Module: Hash Functions, Signal Processing, Sinusoidal vs Learned PE
Module Objective
Deep dive into Positional Encoding — signal processing, hashing, Fourier theory, and Sinusoidal vs Learned — with math, code, visualization, and ablation.
1. The Problem: Attention is Permutation-Invariant
X = ["the", "cat", "sat"]
Attention(X) == Attention(["sat", "cat", "the"])
No order → no meaning
2. Two Solutions
| Type | Mechanism | Learnable? | Max Length |
|---|---|---|---|
| Sinusoidal (Fixed) | Wave functions | No | Infinite |
| Learned (Trainable) | Embedding table | Yes | Fixed |
3. Sinusoidal PE — Signal Processing View
Formula (Original Paper)
$$
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)
$$
$$
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)
$$
Each dimension = a sine wave with different frequency
4. Signal Processing Interpretation
import torch
import matplotlib.pyplot as plt
import numpy as np
def plot_sinusoidal_pe(d_model=16, max_pos=20):
pos = torch.arange(max_pos).unsqueeze(1)
i = torch.arange(0, d_model, 2)
div_term = torch.exp(i * -torch.log(torch.tensor(10000.0)) / d_model)
pe_even = torch.sin(pos * div_term)
pe_odd = torch.cos(pos * div_term)
pe = torch.zeros(max_pos, d_model)
pe[:, 0::2] = pe_even
pe[:, 1::2] = pe_odd
plt.figure(figsize=(12, 6))
for dim in range(0, d_model, 2):
plt.plot(pos, pe[:, dim], label=f"dim {dim}" if dim < 6 else "")
plt.legend()
plt.xlabel("Position")
plt.ylabel("PE Value")
plt.title("Sinusoidal PE: Different Frequencies per Dimension")
plt.grid(True, alpha=0.3)
plt.show()
plot_sinusoidal_pe()
Low dims → slow waves → long-range patterns
High dims → fast waves → fine-grained local patterns
5. Fourier Basis: Why It Works
Any smooth function can be represented as sum of sines/cosines
→ PE spans a rich frequency space
# Relative distance encoding
pos_i, pos_j = 5, 10
pe_i = pe[pos_i]
pe_j = pe[pos_j]
# Dot product peaks at fixed relative distance
dist = 5
correlations = []
for offset in range(-10, 11):
if 0 <= pos_i + offset < max_pos:
corr = torch.dot(pe[pos_i], pe[pos_i + offset])
correlations.append((offset, corr.item()))
offsets, corrs = zip(*correlations)
plt.plot(offsets, corrs, 'o-')
plt.title("PE Correlation vs Relative Position")
plt.xlabel("Position Offset")
plt.ylabel("Dot Product")
plt.show()
Model can compute relative position via dot product!
6. Hashing Perspective: Sinusoidal PE as Locality-Sensitive Hash
Idea: Similar positions → similar PE vectors
from sklearn.metrics.pairwise import cosine_similarity
pos1, pos2 = 100, 105
pe1 = pe[pos1].unsqueeze(0)
pe2 = pe[pos2].unsqueeze(0)
sim = cosine_similarity(pe1.numpy(), pe2.numpy())[0][0]
print(f"Cosine sim(pos=100, 105) = {sim:.3f}") # ~0.999
LSH property:
$ \text{sim}(PE_i, PE_j) \propto \exp(-|i-j|) $
→ Attention can infer distance without explicit position IDs
7. Learned Positional Encoding
class LearnedPositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=512):
super().__init__()
self.pe = nn.Embedding(max_len, d_model)
nn.init.normal_(self.pe.weight, std=0.02)
def forward(self, x):
seq_len = x.size(1)
pos = torch.arange(seq_len, device=x.device)
return x + self.pe(pos)
8. Sinusoidal vs Learned: Ablation Study
import torch.optim as optim
def train_copy_task(model_cls, use_learned_pe=False, max_len=20):
model = nn.Sequential(
nn.Embedding(10, 16),
model_cls(d_model=16, num_heads=4, use_learned_pe=use_learned_pe),
nn.Linear(16, 10)
)
opt = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()
losses = []
for epoch in range(300):
src = torch.randint(0, 5, (32, max_len))
tgt = src.clone()
logits = model[0](src)
logits = model[1](logits)[0]
logits = model[2](logits)
loss = criterion(logits.view(-1, 10), tgt.view(-1))
opt.zero_grad()
loss.backward()
opt.step()
losses.append(loss.item())
return losses
# Run both
loss_sine = train_copy_task(TransformerBlock, use_learned_pe=False)
loss_learned = train_copy_task(TransformerBlock, use_learned_pe=True)
plt.plot(loss_sine, label="Sinusoidal PE")
plt.plot(loss_learned, label="Learned PE")
plt.legend()
plt.title("Copy Task: Sinusoidal vs Learned PE")
plt.xlabel("Training Step")
plt.ylabel("Loss")
plt.show()
Result:
- Sinusoidal: Faster convergence, better generalization
- Learned: Can overfit to training length
9. Extrapolation Test: Can It Handle Longer Sequences?
# Train on max_len=20
model_sine = ... # trained with sinusoidal
model_learned = ... # trained with learned (max_len=20)
# Test on length 50
long_seq = torch.randint(0, 5, (1, 50))
with torch.no_grad():
out_sine = model_sine(long_seq)
# out_learned → IndexError! (Embedding size = 20)
Sinusoidal: Works for any length
Learned: Limited to training length
10. Hashing Analogy: PE as Embedding Hash
| Concept | Sinusoidal PE | Learned PE |
|---|---|---|
| Hash Function | $ \sin(pos \cdot \omega_i) $ | $ E[pos] $ |
| Collision | Smooth | Discrete |
| Range | $ \mathbb{R} $ | $ \mathbb{R}^d $ |
| Collision Probability | $ \propto \exp(- | i-j |
Sinusoidal = continuous LSH
Learned = perfect hash (but limited domain)
11. Advanced: Rotary Positional Embedding (RoPE)
Used in LLaMA, PaLM — relative + rotation
def apply_rotary_emb(q, k, freqs):
# q, k: (B, H, N, d_k)
q_real, q_imag = q[..., :d_k//2], q[..., d_k//2:]
k_real, k_imag = k[..., :d_k//2], k[..., d_k//2:]
# Rotate
q_rot = torch.cat([-q_imag, q_real], dim=-1) * freqs
k_rot = torch.cat([-k_imag, k_real], dim=-1) * freqs
return q_rot + q, k_rot + k
Preserves absolute position via rotation in complex plane
12. Summary Table
| Feature | Sinusoidal | Learned | RoPE |
|---|---|---|---|
| Learnable | No | Yes | No |
| Max Length | Infinite | Fixed | Infinite |
| Relative Pos | Yes (via dot) | No | Yes (explicit) |
| Signal Theory | Fourier basis | Arbitrary | Rotation |
| Hashing | LSH | Perfect | Geometric |
| Used In | GPT-2, BERT | Early Transformers | LLaMA, PaLM |
13. Visualization: PE Heatmap
pe_sine = SinusoidalPositionalEncoding(128, 100).pe[0].cpu().numpy()
pe_learned = LearnedPositionalEncoding(128, 100).pe.weight.detach().cpu().numpy()
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
sns.heatmap(pe_sine, ax=ax1, cmap="RdYlBu", center=0)
sns.heatmap(pe_learned, ax=ax2, cmap="RdYlBu", center=0)
ax1.set_title("Sinusoidal PE")
ax2.set_title("Learned PE (Random Init)")
plt.show()
14. Practice Exercises
- Fourier Analysis: Compute FFT of PE across positions.
- Hash Collision: Measure cosine sim for $ |i-j| = 1, 5, 10 $.
- Ablation: Train without PE → accuracy drops to ~10%.
- Hybrid: Use sinusoidal + learned (T5-style).
- RoPE: Implement and compare with sinusoidal.
15. Key Takeaways
| Check | Insight |
|---|---|
| Check | Sinusoidal PE = Fourier basis + LSH |
| Check | Learned PE = flexible but length-limited |
| Check | Relative position emerges from dot product |
| Check | Sinusoidal generalizes to any length |
| Check | RoPE = modern geometric alternative |
Full Code: Sinusoidal vs Learned
import torch
import torch.nn as nn
# === Sinusoidal ===
class SinusoidalPositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()
pe = torch.zeros(max_len, d_model)
pos = torch.arange(0, max_len).unsqueeze(1).float()
div = torch.exp(torch.arange(0, d_model, 2).float() * -(torch.log(torch.tensor(10000.0)) / d_model))
pe[:, 0::2] = torch.sin(pos * div)
pe[:, 1::2] = torch.cos(pos * div)
self.register_buffer('pe', pe.unsqueeze(0))
def forward(self, x):
return x + self.pe[:, :x.size(1)]
# === Learned ===
class LearnedPositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=512):
super().__init__()
self.pe = nn.Embedding(max_len, d_model)
def forward(self, x):
pos = torch.arange(x.size(1), device=x.device)
return x + self.pe(pos)
Final Words
Positional Encoding is not just a hack
→ It’s signal processing, hashing, and geometry in disguise.
You now understand:
- Why sinusoidal works
- Why learned fails to extrapolate
- How relative position emerges
- Modern RoPE alternative
End of Module
You control time in neural networks.
Next: Stack 12 layers → build a Transformer!