"Attention is All You Need" — Positional Encoding

Complete Module: Hash Functions, Signal Processing, Sinusoidal vs Learned PE

"Attention is All You Need" — Positional Encoding

Complete Module: Hash Functions, Signal Processing, Sinusoidal vs Learned PE

"Attention is All You Need" — Positional Encoding

Complete Module: Hash Functions, Signal Processing, Sinusoidal vs Learned PE


Module Objective

Deep dive into Positional Encodingsignal processing, hashing, Fourier theory, and Sinusoidal vs Learned — with math, code, visualization, and ablation.


1. The Problem: Attention is Permutation-Invariant

X = ["the", "cat", "sat"]
Attention(X) == Attention(["sat", "cat", "the"])

No order → no meaning


2. Two Solutions

Type Mechanism Learnable? Max Length
Sinusoidal (Fixed) Wave functions No Infinite
Learned (Trainable) Embedding table Yes Fixed

3. Sinusoidal PE — Signal Processing View

Formula (Original Paper)

$$
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)
$$
$$
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)
$$

Each dimension = a sine wave with different frequency


4. Signal Processing Interpretation

import torch
import matplotlib.pyplot as plt
import numpy as np

def plot_sinusoidal_pe(d_model=16, max_pos=20):
    pos = torch.arange(max_pos).unsqueeze(1)
    i = torch.arange(0, d_model, 2)
    div_term = torch.exp(i * -torch.log(torch.tensor(10000.0)) / d_model)
    pe_even = torch.sin(pos * div_term)
    pe_odd = torch.cos(pos * div_term)

    pe = torch.zeros(max_pos, d_model)
    pe[:, 0::2] = pe_even
    pe[:, 1::2] = pe_odd

    plt.figure(figsize=(12, 6))
    for dim in range(0, d_model, 2):
        plt.plot(pos, pe[:, dim], label=f"dim {dim}" if dim < 6 else "")
    plt.legend()
    plt.xlabel("Position")
    plt.ylabel("PE Value")
    plt.title("Sinusoidal PE: Different Frequencies per Dimension")
    plt.grid(True, alpha=0.3)
    plt.show()

plot_sinusoidal_pe()

Low dims → slow waves → long-range patterns
High dims → fast waves → fine-grained local patterns


5. Fourier Basis: Why It Works

Any smooth function can be represented as sum of sines/cosines
PE spans a rich frequency space

# Relative distance encoding
pos_i, pos_j = 5, 10
pe_i = pe[pos_i]
pe_j = pe[pos_j]

# Dot product peaks at fixed relative distance
dist = 5
correlations = []
for offset in range(-10, 11):
    if 0 <= pos_i + offset < max_pos:
        corr = torch.dot(pe[pos_i], pe[pos_i + offset])
        correlations.append((offset, corr.item()))

offsets, corrs = zip(*correlations)
plt.plot(offsets, corrs, 'o-')
plt.title("PE Correlation vs Relative Position")
plt.xlabel("Position Offset")
plt.ylabel("Dot Product")
plt.show()

Model can compute relative position via dot product!


6. Hashing Perspective: Sinusoidal PE as Locality-Sensitive Hash

Idea: Similar positions → similar PE vectors

from sklearn.metrics.pairwise import cosine_similarity

pos1, pos2 = 100, 105
pe1 = pe[pos1].unsqueeze(0)
pe2 = pe[pos2].unsqueeze(0)
sim = cosine_similarity(pe1.numpy(), pe2.numpy())[0][0]
print(f"Cosine sim(pos=100, 105) = {sim:.3f}")  # ~0.999

LSH property:
$ \text{sim}(PE_i, PE_j) \propto \exp(-|i-j|) $
Attention can infer distance without explicit position IDs


7. Learned Positional Encoding

class LearnedPositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=512):
        super().__init__()
        self.pe = nn.Embedding(max_len, d_model)
        nn.init.normal_(self.pe.weight, std=0.02)

    def forward(self, x):
        seq_len = x.size(1)
        pos = torch.arange(seq_len, device=x.device)
        return x + self.pe(pos)

8. Sinusoidal vs Learned: Ablation Study

import torch.optim as optim

def train_copy_task(model_cls, use_learned_pe=False, max_len=20):
    model = nn.Sequential(
        nn.Embedding(10, 16),
        model_cls(d_model=16, num_heads=4, use_learned_pe=use_learned_pe),
        nn.Linear(16, 10)
    )
    opt = optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()

    losses = []
    for epoch in range(300):
        src = torch.randint(0, 5, (32, max_len))
        tgt = src.clone()

        logits = model[0](src)
        logits = model[1](logits)[0]
        logits = model[2](logits)

        loss = criterion(logits.view(-1, 10), tgt.view(-1))
        opt.zero_grad()
        loss.backward()
        opt.step()
        losses.append(loss.item())

    return losses

# Run both
loss_sine = train_copy_task(TransformerBlock, use_learned_pe=False)
loss_learned = train_copy_task(TransformerBlock, use_learned_pe=True)

plt.plot(loss_sine, label="Sinusoidal PE")
plt.plot(loss_learned, label="Learned PE")
plt.legend()
plt.title("Copy Task: Sinusoidal vs Learned PE")
plt.xlabel("Training Step")
plt.ylabel("Loss")
plt.show()

Result:
- Sinusoidal: Faster convergence, better generalization
- Learned: Can overfit to training length


9. Extrapolation Test: Can It Handle Longer Sequences?

# Train on max_len=20
model_sine = ...  # trained with sinusoidal
model_learned = ...  # trained with learned (max_len=20)

# Test on length 50
long_seq = torch.randint(0, 5, (1, 50))
with torch.no_grad():
    out_sine = model_sine(long_seq)
    # out_learned → IndexError! (Embedding size = 20)

Sinusoidal: Works for any length
Learned: Limited to training length


10. Hashing Analogy: PE as Embedding Hash

Concept Sinusoidal PE Learned PE
Hash Function $ \sin(pos \cdot \omega_i) $ $ E[pos] $
Collision Smooth Discrete
Range $ \mathbb{R} $ $ \mathbb{R}^d $
Collision Probability $ \propto \exp(- i-j

Sinusoidal = continuous LSH
Learned = perfect hash (but limited domain)


11. Advanced: Rotary Positional Embedding (RoPE)

Used in LLaMA, PaLMrelative + rotation

def apply_rotary_emb(q, k, freqs):
    # q, k: (B, H, N, d_k)
    q_real, q_imag = q[..., :d_k//2], q[..., d_k//2:]
    k_real, k_imag = k[..., :d_k//2], k[..., d_k//2:]

    # Rotate
    q_rot = torch.cat([-q_imag, q_real], dim=-1) * freqs
    k_rot = torch.cat([-k_imag, k_real], dim=-1) * freqs

    return q_rot + q, k_rot + k

Preserves absolute position via rotation in complex plane


12. Summary Table

Feature Sinusoidal Learned RoPE
Learnable No Yes No
Max Length Infinite Fixed Infinite
Relative Pos Yes (via dot) No Yes (explicit)
Signal Theory Fourier basis Arbitrary Rotation
Hashing LSH Perfect Geometric
Used In GPT-2, BERT Early Transformers LLaMA, PaLM

13. Visualization: PE Heatmap

pe_sine = SinusoidalPositionalEncoding(128, 100).pe[0].cpu().numpy()
pe_learned = LearnedPositionalEncoding(128, 100).pe.weight.detach().cpu().numpy()

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
sns.heatmap(pe_sine, ax=ax1, cmap="RdYlBu", center=0)
sns.heatmap(pe_learned, ax=ax2, cmap="RdYlBu", center=0)
ax1.set_title("Sinusoidal PE")
ax2.set_title("Learned PE (Random Init)")
plt.show()

14. Practice Exercises

  1. Fourier Analysis: Compute FFT of PE across positions.
  2. Hash Collision: Measure cosine sim for $ |i-j| = 1, 5, 10 $.
  3. Ablation: Train without PE → accuracy drops to ~10%.
  4. Hybrid: Use sinusoidal + learned (T5-style).
  5. RoPE: Implement and compare with sinusoidal.

15. Key Takeaways

Check Insight
Check Sinusoidal PE = Fourier basis + LSH
Check Learned PE = flexible but length-limited
Check Relative position emerges from dot product
Check Sinusoidal generalizes to any length
Check RoPE = modern geometric alternative

Full Code: Sinusoidal vs Learned

import torch
import torch.nn as nn

# === Sinusoidal ===
class SinusoidalPositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        pos = torch.arange(0, max_len).unsqueeze(1).float()
        div = torch.exp(torch.arange(0, d_model, 2).float() * -(torch.log(torch.tensor(10000.0)) / d_model))
        pe[:, 0::2] = torch.sin(pos * div)
        pe[:, 1::2] = torch.cos(pos * div)
        self.register_buffer('pe', pe.unsqueeze(0))
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

# === Learned ===
class LearnedPositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=512):
        super().__init__()
        self.pe = nn.Embedding(max_len, d_model)
    def forward(self, x):
        pos = torch.arange(x.size(1), device=x.device)
        return x + self.pe(pos)

Final Words

Positional Encoding is not just a hack
→ It’s signal processing, hashing, and geometry in disguise.

You now understand:
- Why sinusoidal works
- Why learned fails to extrapolate
- How relative position emerges
- Modern RoPE alternative


End of Module
You control time in neural networks.
Next: Stack 12 layers → build a Transformer!

Last updated: Nov 13, 2025

"Attention is All You Need" — Positional Encoding

Complete Module: Hash Functions, Signal Processing, Sinusoidal vs Learned PE

"Attention is All You Need" — Positional Encoding

Complete Module: Hash Functions, Signal Processing, Sinusoidal vs Learned PE

"Attention is All You Need" — Positional Encoding

Complete Module: Hash Functions, Signal Processing, Sinusoidal vs Learned PE


Module Objective

Deep dive into Positional Encodingsignal processing, hashing, Fourier theory, and Sinusoidal vs Learned — with math, code, visualization, and ablation.


1. The Problem: Attention is Permutation-Invariant

X = ["the", "cat", "sat"]
Attention(X) == Attention(["sat", "cat", "the"])

No order → no meaning


2. Two Solutions

Type Mechanism Learnable? Max Length
Sinusoidal (Fixed) Wave functions No Infinite
Learned (Trainable) Embedding table Yes Fixed

3. Sinusoidal PE — Signal Processing View

Formula (Original Paper)

$$
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)
$$
$$
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)
$$

Each dimension = a sine wave with different frequency


4. Signal Processing Interpretation

import torch
import matplotlib.pyplot as plt
import numpy as np

def plot_sinusoidal_pe(d_model=16, max_pos=20):
    pos = torch.arange(max_pos).unsqueeze(1)
    i = torch.arange(0, d_model, 2)
    div_term = torch.exp(i * -torch.log(torch.tensor(10000.0)) / d_model)
    pe_even = torch.sin(pos * div_term)
    pe_odd = torch.cos(pos * div_term)

    pe = torch.zeros(max_pos, d_model)
    pe[:, 0::2] = pe_even
    pe[:, 1::2] = pe_odd

    plt.figure(figsize=(12, 6))
    for dim in range(0, d_model, 2):
        plt.plot(pos, pe[:, dim], label=f"dim {dim}" if dim < 6 else "")
    plt.legend()
    plt.xlabel("Position")
    plt.ylabel("PE Value")
    plt.title("Sinusoidal PE: Different Frequencies per Dimension")
    plt.grid(True, alpha=0.3)
    plt.show()

plot_sinusoidal_pe()

Low dims → slow waves → long-range patterns
High dims → fast waves → fine-grained local patterns


5. Fourier Basis: Why It Works

Any smooth function can be represented as sum of sines/cosines
PE spans a rich frequency space

# Relative distance encoding
pos_i, pos_j = 5, 10
pe_i = pe[pos_i]
pe_j = pe[pos_j]

# Dot product peaks at fixed relative distance
dist = 5
correlations = []
for offset in range(-10, 11):
    if 0 <= pos_i + offset < max_pos:
        corr = torch.dot(pe[pos_i], pe[pos_i + offset])
        correlations.append((offset, corr.item()))

offsets, corrs = zip(*correlations)
plt.plot(offsets, corrs, 'o-')
plt.title("PE Correlation vs Relative Position")
plt.xlabel("Position Offset")
plt.ylabel("Dot Product")
plt.show()

Model can compute relative position via dot product!


6. Hashing Perspective: Sinusoidal PE as Locality-Sensitive Hash

Idea: Similar positions → similar PE vectors

from sklearn.metrics.pairwise import cosine_similarity

pos1, pos2 = 100, 105
pe1 = pe[pos1].unsqueeze(0)
pe2 = pe[pos2].unsqueeze(0)
sim = cosine_similarity(pe1.numpy(), pe2.numpy())[0][0]
print(f"Cosine sim(pos=100, 105) = {sim:.3f}")  # ~0.999

LSH property:
$ \text{sim}(PE_i, PE_j) \propto \exp(-|i-j|) $
Attention can infer distance without explicit position IDs


7. Learned Positional Encoding

class LearnedPositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=512):
        super().__init__()
        self.pe = nn.Embedding(max_len, d_model)
        nn.init.normal_(self.pe.weight, std=0.02)

    def forward(self, x):
        seq_len = x.size(1)
        pos = torch.arange(seq_len, device=x.device)
        return x + self.pe(pos)

8. Sinusoidal vs Learned: Ablation Study

import torch.optim as optim

def train_copy_task(model_cls, use_learned_pe=False, max_len=20):
    model = nn.Sequential(
        nn.Embedding(10, 16),
        model_cls(d_model=16, num_heads=4, use_learned_pe=use_learned_pe),
        nn.Linear(16, 10)
    )
    opt = optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()

    losses = []
    for epoch in range(300):
        src = torch.randint(0, 5, (32, max_len))
        tgt = src.clone()

        logits = model[0](src)
        logits = model[1](logits)[0]
        logits = model[2](logits)

        loss = criterion(logits.view(-1, 10), tgt.view(-1))
        opt.zero_grad()
        loss.backward()
        opt.step()
        losses.append(loss.item())

    return losses

# Run both
loss_sine = train_copy_task(TransformerBlock, use_learned_pe=False)
loss_learned = train_copy_task(TransformerBlock, use_learned_pe=True)

plt.plot(loss_sine, label="Sinusoidal PE")
plt.plot(loss_learned, label="Learned PE")
plt.legend()
plt.title("Copy Task: Sinusoidal vs Learned PE")
plt.xlabel("Training Step")
plt.ylabel("Loss")
plt.show()

Result:
- Sinusoidal: Faster convergence, better generalization
- Learned: Can overfit to training length


9. Extrapolation Test: Can It Handle Longer Sequences?

# Train on max_len=20
model_sine = ...  # trained with sinusoidal
model_learned = ...  # trained with learned (max_len=20)

# Test on length 50
long_seq = torch.randint(0, 5, (1, 50))
with torch.no_grad():
    out_sine = model_sine(long_seq)
    # out_learned → IndexError! (Embedding size = 20)

Sinusoidal: Works for any length
Learned: Limited to training length


10. Hashing Analogy: PE as Embedding Hash

Concept Sinusoidal PE Learned PE
Hash Function $ \sin(pos \cdot \omega_i) $ $ E[pos] $
Collision Smooth Discrete
Range $ \mathbb{R} $ $ \mathbb{R}^d $
Collision Probability $ \propto \exp(- i-j

Sinusoidal = continuous LSH
Learned = perfect hash (but limited domain)


11. Advanced: Rotary Positional Embedding (RoPE)

Used in LLaMA, PaLMrelative + rotation

def apply_rotary_emb(q, k, freqs):
    # q, k: (B, H, N, d_k)
    q_real, q_imag = q[..., :d_k//2], q[..., d_k//2:]
    k_real, k_imag = k[..., :d_k//2], k[..., d_k//2:]

    # Rotate
    q_rot = torch.cat([-q_imag, q_real], dim=-1) * freqs
    k_rot = torch.cat([-k_imag, k_real], dim=-1) * freqs

    return q_rot + q, k_rot + k

Preserves absolute position via rotation in complex plane


12. Summary Table

Feature Sinusoidal Learned RoPE
Learnable No Yes No
Max Length Infinite Fixed Infinite
Relative Pos Yes (via dot) No Yes (explicit)
Signal Theory Fourier basis Arbitrary Rotation
Hashing LSH Perfect Geometric
Used In GPT-2, BERT Early Transformers LLaMA, PaLM

13. Visualization: PE Heatmap

pe_sine = SinusoidalPositionalEncoding(128, 100).pe[0].cpu().numpy()
pe_learned = LearnedPositionalEncoding(128, 100).pe.weight.detach().cpu().numpy()

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
sns.heatmap(pe_sine, ax=ax1, cmap="RdYlBu", center=0)
sns.heatmap(pe_learned, ax=ax2, cmap="RdYlBu", center=0)
ax1.set_title("Sinusoidal PE")
ax2.set_title("Learned PE (Random Init)")
plt.show()

14. Practice Exercises

  1. Fourier Analysis: Compute FFT of PE across positions.
  2. Hash Collision: Measure cosine sim for $ |i-j| = 1, 5, 10 $.
  3. Ablation: Train without PE → accuracy drops to ~10%.
  4. Hybrid: Use sinusoidal + learned (T5-style).
  5. RoPE: Implement and compare with sinusoidal.

15. Key Takeaways

Check Insight
Check Sinusoidal PE = Fourier basis + LSH
Check Learned PE = flexible but length-limited
Check Relative position emerges from dot product
Check Sinusoidal generalizes to any length
Check RoPE = modern geometric alternative

Full Code: Sinusoidal vs Learned

import torch
import torch.nn as nn

# === Sinusoidal ===
class SinusoidalPositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        pos = torch.arange(0, max_len).unsqueeze(1).float()
        div = torch.exp(torch.arange(0, d_model, 2).float() * -(torch.log(torch.tensor(10000.0)) / d_model))
        pe[:, 0::2] = torch.sin(pos * div)
        pe[:, 1::2] = torch.cos(pos * div)
        self.register_buffer('pe', pe.unsqueeze(0))
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

# === Learned ===
class LearnedPositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=512):
        super().__init__()
        self.pe = nn.Embedding(max_len, d_model)
    def forward(self, x):
        pos = torch.arange(x.size(1), device=x.device)
        return x + self.pe(pos)

Final Words

Positional Encoding is not just a hack
→ It’s signal processing, hashing, and geometry in disguise.

You now understand:
- Why sinusoidal works
- Why learned fails to extrapolate
- How relative position emerges
- Modern RoPE alternative


End of Module
You control time in neural networks.
Next: Stack 12 layers → build a Transformer!

Last updated: Nov 13, 2025