"Attention is All You Need" — Positional Encoding

Complete Module: Hash Functions, Signal Processing, Sinusoidal vs Learned PE

"Attention is All You Need" — Positional Encoding

Complete Module: Hash Functions, Signal Processing, Sinusoidal vs Learned PE

"Attention is All You Need" — Positional Encoding

Complete Module: Hash Functions, Signal Processing, Sinusoidal vs Learned PE

Module Objective

Deep dive into Positional Encoding — signal processing, hashing, Fourier theory, and Sinusoidal vs Learned — with math, code, visualization, and ablation.

1. The Problem: Attention is Permutation-Invariant

X = ["the", "cat", "sat"]
Attention(X) == Attention(["sat", "cat", "the"])

No order → no meaning

2. Two Solutions

Type	Mechanism	Learnable?	Max Length
Sinusoidal (Fixed)	Wave functions	No	Infinite
Learned (Trainable)	Embedding table	Yes	Fixed

3. Sinusoidal PE — Signal Processing View

Formula (Original Paper)

$$
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)
$$
$$
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)
$$

Each dimension = a sine wave with different frequency

4. Signal Processing Interpretation

import torch
import matplotlib.pyplot as plt
import numpy as np

def plot_sinusoidal_pe(d_model=16, max_pos=20):
    pos = torch.arange(max_pos).unsqueeze(1)
    i = torch.arange(0, d_model, 2)
    div_term = torch.exp(i * -torch.log(torch.tensor(10000.0)) / d_model)
    pe_even = torch.sin(pos * div_term)
    pe_odd = torch.cos(pos * div_term)

    pe = torch.zeros(max_pos, d_model)
    pe[:, 0::2] = pe_even
    pe[:, 1::2] = pe_odd

    plt.figure(figsize=(12, 6))
    for dim in range(0, d_model, 2):
        plt.plot(pos, pe[:, dim], label=f"dim {dim}" if dim < 6 else "")
    plt.legend()
    plt.xlabel("Position")
    plt.ylabel("PE Value")
    plt.title("Sinusoidal PE: Different Frequencies per Dimension")
    plt.grid(True, alpha=0.3)
    plt.show()

plot_sinusoidal_pe()

Low dims → slow waves → long-range patterns
High dims → fast waves → fine-grained local patterns

5. Fourier Basis: Why It Works

Any smooth function can be represented as sum of sines/cosines
→ PE spans a rich frequency space

# Relative distance encoding
pos_i, pos_j = 5, 10
pe_i = pe[pos_i]
pe_j = pe[pos_j]

# Dot product peaks at fixed relative distance
dist = 5
correlations = []
for offset in range(-10, 11):
    if 0 <= pos_i + offset < max_pos:
        corr = torch.dot(pe[pos_i], pe[pos_i + offset])
        correlations.append((offset, corr.item()))

offsets, corrs = zip(*correlations)
plt.plot(offsets, corrs, 'o-')
plt.title("PE Correlation vs Relative Position")
plt.xlabel("Position Offset")
plt.ylabel("Dot Product")
plt.show()

Model can compute relative position via dot product!

6. Hashing Perspective: Sinusoidal PE as Locality-Sensitive Hash

Idea: Similar positions → similar PE vectors

from sklearn.metrics.pairwise import cosine_similarity

pos1, pos2 = 100, 105
pe1 = pe[pos1].unsqueeze(0)
pe2 = pe[pos2].unsqueeze(0)
sim = cosine_similarity(pe1.numpy(), pe2.numpy())[0][0]
print(f"Cosine sim(pos=100, 105) = {sim:.3f}")  # ~0.999

LSH property:
$ \text{sim}(PE_i, PE_j) \propto \exp(-|i-j|) $
→ Attention can infer distance without explicit position IDs

7. Learned Positional Encoding

class LearnedPositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=512):
        super().__init__()
        self.pe = nn.Embedding(max_len, d_model)
        nn.init.normal_(self.pe.weight, std=0.02)

    def forward(self, x):
        seq_len = x.size(1)
        pos = torch.arange(seq_len, device=x.device)
        return x + self.pe(pos)

8. Sinusoidal vs Learned: Ablation Study

import torch.optim as optim

def train_copy_task(model_cls, use_learned_pe=False, max_len=20):
    model = nn.Sequential(
        nn.Embedding(10, 16),
        model_cls(d_model=16, num_heads=4, use_learned_pe=use_learned_pe),
        nn.Linear(16, 10)
    )
    opt = optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()

    losses = []
    for epoch in range(300):
        src = torch.randint(0, 5, (32, max_len))
        tgt = src.clone()

        logits = model[0](src)
        logits = model[1](logits)[0]
        logits = model[2](logits)

        loss = criterion(logits.view(-1, 10), tgt.view(-1))
        opt.zero_grad()
        loss.backward()
        opt.step()
        losses.append(loss.item())

    return losses

# Run both
loss_sine = train_copy_task(TransformerBlock, use_learned_pe=False)
loss_learned = train_copy_task(TransformerBlock, use_learned_pe=True)

plt.plot(loss_sine, label="Sinusoidal PE")
plt.plot(loss_learned, label="Learned PE")
plt.legend()
plt.title("Copy Task: Sinusoidal vs Learned PE")
plt.xlabel("Training Step")
plt.ylabel("Loss")
plt.show()

Result:
- Sinusoidal: Faster convergence, better generalization
- Learned: Can overfit to training length

9. Extrapolation Test: Can It Handle Longer Sequences?

# Train on max_len=20
model_sine = ...  # trained with sinusoidal
model_learned = ...  # trained with learned (max_len=20)

# Test on length 50
long_seq = torch.randint(0, 5, (1, 50))
with torch.no_grad():
    out_sine = model_sine(long_seq)
    # out_learned → IndexError! (Embedding size = 20)

Sinusoidal: Works for any length
Learned: Limited to training length

10. Hashing Analogy: PE as Embedding Hash

Concept	Sinusoidal PE	Learned PE
Hash Function	$ \sin(pos \cdot \omega_i) $	$ E[pos] $
Collision	Smooth	Discrete
Range	$ \mathbb{R} $	$ \mathbb{R}^d $
Collision Probability	$ \propto \exp(-	i-j

Sinusoidal = continuous LSH
Learned = perfect hash (but limited domain)

11. Advanced: Rotary Positional Embedding (RoPE)

Used in LLaMA, PaLM — relative + rotation

def apply_rotary_emb(q, k, freqs):
    # q, k: (B, H, N, d_k)
    q_real, q_imag = q[..., :d_k//2], q[..., d_k//2:]
    k_real, k_imag = k[..., :d_k//2], k[..., d_k//2:]

    # Rotate
    q_rot = torch.cat([-q_imag, q_real], dim=-1) * freqs
    k_rot = torch.cat([-k_imag, k_real], dim=-1) * freqs

    return q_rot + q, k_rot + k

Preserves absolute position via rotation in complex plane

12. Summary Table

Feature	Sinusoidal	Learned	RoPE
Learnable	No	Yes	No
Max Length	Infinite	Fixed	Infinite
Relative Pos	Yes (via dot)	No	Yes (explicit)
Signal Theory	Fourier basis	Arbitrary	Rotation
Hashing	LSH	Perfect	Geometric
Used In	GPT-2, BERT	Early Transformers	LLaMA, PaLM

13. Visualization: PE Heatmap

pe_sine = SinusoidalPositionalEncoding(128, 100).pe[0].cpu().numpy()
pe_learned = LearnedPositionalEncoding(128, 100).pe.weight.detach().cpu().numpy()

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
sns.heatmap(pe_sine, ax=ax1, cmap="RdYlBu", center=0)
sns.heatmap(pe_learned, ax=ax2, cmap="RdYlBu", center=0)
ax1.set_title("Sinusoidal PE")
ax2.set_title("Learned PE (Random Init)")
plt.show()

14. Practice Exercises

Fourier Analysis: Compute FFT of PE across positions.
Hash Collision: Measure cosine sim for $ |i-j| = 1, 5, 10 $.
Ablation: Train without PE → accuracy drops to ~10%.
Hybrid: Use sinusoidal + learned (T5-style).
RoPE: Implement and compare with sinusoidal.

15. Key Takeaways

Check	Insight
Check	Sinusoidal PE = Fourier basis + LSH
Check	Learned PE = flexible but length-limited
Check	Relative position emerges from dot product
Check	Sinusoidal generalizes to any length
Check	RoPE = modern geometric alternative

Full Code: Sinusoidal vs Learned

import torch
import torch.nn as nn

# === Sinusoidal ===
class SinusoidalPositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        pos = torch.arange(0, max_len).unsqueeze(1).float()
        div = torch.exp(torch.arange(0, d_model, 2).float() * -(torch.log(torch.tensor(10000.0)) / d_model))
        pe[:, 0::2] = torch.sin(pos * div)
        pe[:, 1::2] = torch.cos(pos * div)
        self.register_buffer('pe', pe.unsqueeze(0))
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

# === Learned ===
class LearnedPositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=512):
        super().__init__()
        self.pe = nn.Embedding(max_len, d_model)
    def forward(self, x):
        pos = torch.arange(x.size(1), device=x.device)
        return x + self.pe(pos)

Final Words

Positional Encoding is not just a hack
→ It’s signal processing, hashing, and geometry in disguise.

You now understand:
- Why sinusoidal works
- Why learned fails to extrapolate
- How relative position emerges
- Modern RoPE alternative

End of Module
You control time in neural networks.
Next: Stack 12 layers → build a Transformer!

Last updated: Nov 13, 2025

"Attention is All You Need" — Positional Encoding

Complete Module: Hash Functions, Signal Processing, Sinusoidal vs Learned PE