Advanced SQL Window Functions for Data Science
(Phase 1.5 – Week 3 Deep Dive | 1 Week | 6–8 hrs/day) Goal: Master Window Functions — the secret weapon of top Data Scientists & Analysts. Used in 70% of hard SQL interviews Enables rankings, running totals, cohorts, funnels, time series Replaces complex self-joins & subqueries
Advanced SQL Window Functions for Data Science
Advanced SQL Window Functions for Data Science
Advanced SQL Window Functions for Data Science
(Phase 1.5 – Week 3 Deep Dive | 1 Week | 6–8 hrs/day)
Goal: Master Window Functions — the secret weapon of top Data Scientists & Analysts.
Used in 70% of hard SQL interviews
Enables rankings, running totals, cohorts, funnels, time series
Replaces complex self-joins & subqueries
Why Window Functions?
| Without Window | With Window |
|---|---|
| 3+ subqueries + CTEs | 1 clean query |
| Hard to read | Elegant & fast |
| Slow performance | Optimized by DB engine |
Core Syntax
function() OVER (
[PARTITION BY col1, col2]
[ORDER BY col3]
[ROWS/RANGE frame_spec]
)
| Clause | Purpose |
|---|---|
PARTITION BY |
Group (like GROUP BY) |
ORDER BY |
Define sequence |
ROWS BETWEEN |
Sliding window (e.g., last 3 rows) |
1. Ranking Functions
| Function | Use | Gaps? | Same Rank |
|---|---|---|---|
ROW_NUMBER() |
Unique ID | No | Different |
RANK() |
Competition ranking | Yes | Same → gap |
DENSE_RANK() |
No gaps | No | Same → no gap |
NTILE(n) |
Quartiles, deciles | — | — |
Example: Rank Tracks by Length per Genre
SELECT
t.Name,
g.Name AS genre,
t.Milliseconds,
ROW_NUMBER() OVER (PARTITION BY g.Name ORDER BY t.Milliseconds DESC) AS row_num,
RANK() OVER (PARTITION BY g.Name ORDER BY t.Milliseconds DESC) AS rank_gap,
DENSE_RANK() OVER (PARTITION BY g.Name ORDER BY t.Milliseconds DESC) AS dense_rank
FROM Track t
JOIN Genre g ON t.GenreId = g.GenreId
WHERE g.Name = 'Rock'
LIMIT 10;
| Name | Milliseconds | row_num | rank_gap | dense_rank |
|---|---|---|---|---|
| Occupation | 5286953 | 1 | 1 | 1 |
| Dazed and Confused | 4193280 | 2 | 2 | 2 |
| Dazed and Confused | 4193280 | 3 | 2 | 2 |
2. Aggregate Window Functions
Running Total (Cumulative Sum)
-- Monthly revenue with running total
SELECT
strftime('%Y-%m', InvoiceDate) AS month,
SUM(Total) AS monthly_revenue,
SUM(SUM(Total)) OVER (ORDER BY strftime('%Y-%m')) AS running_total
FROM Invoice
GROUP BY month
ORDER BY month;
Moving Average (Last 3 Months)
SELECT
month,
monthly_revenue,
AVG(monthly_revenue) OVER (
ORDER BY month
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS moving_avg_3m
FROM (
SELECT strftime('%Y-%m') AS month, SUM(Total) AS monthly_revenue
FROM Invoice
GROUP BY month
) sub;
3. Lead / Lag – Time Series & Funnels
-- Next purchase date for each customer
SELECT
CustomerId,
InvoiceDate,
LEAD(InvoiceDate) OVER (PARTITION BY CustomerId ORDER BY InvoiceDate) AS next_purchase,
JULIANDAY(LEAD(InvoiceDate) OVER (PARTITION BY CustomerId ORDER BY InvoiceDate))
- JULIANDAY(InvoiceDate) AS days_to_next
FROM Invoice
ORDER BY CustomerId, InvoiceDate;
Use Case:
Customer retention, churn prediction, session analysis
4. First / Last Value
-- First and last purchase per customer
SELECT DISTINCT
CustomerId,
FIRST_VALUE(InvoiceDate) OVER (PARTITION BY CustomerId ORDER BY InvoiceDate) AS first_purchase,
LAST_VALUE(InvoiceDate) OVER (
PARTITION BY CustomerId
ORDER BY InvoiceDate
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS last_purchase
FROM Invoice;
Warning:
LAST_VALUEneeds full frame!
5. Frame Specification (Advanced)
| Frame | Meaning |
|---|---|
UNBOUNDED PRECEDING |
From start |
CURRENT ROW |
This row |
UNBOUNDED FOLLOWING |
To end |
n PRECEDING |
Last n rows |
-- 3-month trailing average
AVG(revenue) OVER (
ORDER BY month
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
)
Real-World Business Problems (Solved with Window)
1. Customer Lifetime Value (CLV) with Cohorts
WITH first_purchase AS (
SELECT
CustomerId,
MIN(InvoiceDate) AS cohort_date
FROM Invoice
GROUP BY CustomerId
),
monthly_spend AS (
SELECT
c.CustomerId,
strftime('%Y-%m', i.InvoiceDate) AS invoice_month,
strftime('%Y-%m', fp.cohort_date) AS cohort_month,
SUM(i.Total) AS spend
FROM Invoice i
JOIN first_purchase fp ON i.CustomerId = fp.CustomerId
GROUP BY c.CustomerId, invoice_month, cohort_month
)
SELECT
cohort_month,
invoice_month,
COUNT(DISTINCT CustomerId) AS customers,
SUM(spend) AS revenue,
ROUND(AVG(spend), 2) AS arpu
FROM monthly_spend
GROUP BY cohort_month, invoice_month
ORDER BY cohort_month, invoice_month;
2. Sessionization (Web Analytics)
-- Assign session ID when gap > 30 min
WITH flagged AS (
SELECT
user_id,
event_time,
LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time,
CASE
WHEN JULIANDAY(event_time) - JULIANDAY(LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time)) > 30/1440
THEN 1 ELSE 0
END AS new_session
FROM events
),
sessioned AS (
SELECT *,
SUM(new_session) OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
FROM flagged
)
SELECT user_id, session_id, MIN(event_time), MAX(event_time)
FROM sessioned
GROUP BY user_id, session_id;
3. Top N per Group (Interview Classic)
-- Top 3 longest tracks per genre
WITH ranked AS (
SELECT
t.*,
g.Name AS genre,
ROW_NUMBER() OVER (PARTITION BY g.Name ORDER BY Milliseconds DESC) AS rn
FROM Track t
JOIN Genre g ON t.GenreId = g.GenreId
)
SELECT TrackId, Name, genre, Milliseconds
FROM ranked
WHERE rn <= 3
ORDER BY genre, rn;
Interview Questions (Solve These!)
| # | Question | Hint |
|---|---|---|
| 1 | Find 2nd highest salary per department | DENSE_RANK() |
| 2 | Running total of sales | SUM() OVER (ORDER BY) |
| 3 | Customers with 3+ consecutive months of purchase | LAG() + flag |
| 4 | % of total revenue per invoice | SUM() OVER() |
| 5 | Identify churned users (no activity in 30 days) | LEAD() |
Practice Plan (7 Days)
| Day | Task |
|---|---|
| 1 | ROW_NUMBER, RANK, DENSE_RANK → 20 queries |
| 2 | SUM, AVG over windows → running totals |
| 3 | LEAD/LAG → time gaps, retention |
| 4 | FIRST_VALUE/LAST_VALUE + frame |
| 5 | Top N per group → solve 10 variations |
| 6 | Cohort analysis on Chinook |
| 7 | Mock interview → explain 3 queries aloud |
Resources
| Resource | Link |
|---|---|
| Mode Analytics – Window Functions | mode.com/sql-tutorial/sql-window-functions |
| PostgreSQL Docs | postgresql.org/docs/current/tutorial-window.html |
| LeetCode: Window Problems | Search "window" in SQL |
| DB Fiddle | db-fiddle.com |
Final Project: Advanced Analytics Dashboard (SQL Only)
Repo: yourname/sql-window-mastery
-- 01_running_revenue.sql
-- 02_top3_per_genre.sql
-- 03_customer_retention_cohort.sql
-- 04_sessionization_example.sql
-- README.md
README.md Snippet
# SQL Window Functions Mastery
## Key Queries
- **Running Revenue**: Cumulative sales over time
- **Top 3 Tracks per Genre**: Using `ROW_NUMBER()`
- **Cohort Retention**: Monthly active users by signup cohort
- **Sessionization**: Group events into sessions
## How to Run
```sql
-- SQLite
sqlite3 chinook.db < 01_running_revenue.sql
---
## Cheat Sheet (Copy-Paste)
```sql
-- RANKING
ROW_NUMBER() OVER (PARTITION BY x ORDER BY y)
RANK() OVER (PARTITION BY x ORDER BY y DESC)
DENSE_RANK() OVER (PARTITION BY x ORDER BY y DESC)
-- AGGREGATE
SUM(col) OVER (PARTITION BY cat ORDER BY date) AS running_total
AVG(col) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS ma_3
-- LEAD/LAG
LAG(col, 1) OVER (PARTITION BY user ORDER BY time) AS prev_val
LEAD(col) OVER (PARTITION BY user ORDER BY time) AS next_val
-- FRAME
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING
Assessment: Can You Write This?
| Query | Yes/No |
|---|---|
| Rank employees by sales, no gaps | ☐ |
| 3-month moving average | ☐ |
| Next purchase date per customer | ☐ |
| Top 2 invoices per day | ☐ |
| Cohort table (month 0,1,2…) | ☐ |
All Yes → You’re a Window Function Ninja!
Next: Phase 2 – Statistics & Math
Now you can extract and rank — time to explain with stats.
Start Now:
1. Open Chinook.db
2. Run:
SELECT
CustomerId,
InvoiceDate,
Total,
SUM(Total) OVER (PARTITION BY CustomerId ORDER BY InvoiceDate) AS running_spend
FROM Invoice
ORDER BY CustomerId, InvoiceDate
LIMIT 10;
Tag me when you push your window function repo!
You’re now in the top 10% of SQL users.
Advanced SQL Window Functions for Data Science
(Phase 1.5 – Week 3 Deep Dive | 1 Week | 6–8 hrs/day) Goal: Master Window Functions — the secret weapon of top Data Scientists & Analysts. Used in 70% of hard SQL interviews Enables rankings, running totals, cohorts, funnels, time series Replaces complex self-joins & subqueries
Advanced SQL Window Functions for Data Science
Advanced SQL Window Functions for Data Science
Advanced SQL Window Functions for Data Science
(Phase 1.5 – Week 3 Deep Dive | 1 Week | 6–8 hrs/day)
Goal: Master Window Functions — the secret weapon of top Data Scientists & Analysts.
Used in 70% of hard SQL interviews
Enables rankings, running totals, cohorts, funnels, time series
Replaces complex self-joins & subqueries
Why Window Functions?
| Without Window | With Window |
|---|---|
| 3+ subqueries + CTEs | 1 clean query |
| Hard to read | Elegant & fast |
| Slow performance | Optimized by DB engine |
Core Syntax
function() OVER (
[PARTITION BY col1, col2]
[ORDER BY col3]
[ROWS/RANGE frame_spec]
)
| Clause | Purpose |
|---|---|
PARTITION BY |
Group (like GROUP BY) |
ORDER BY |
Define sequence |
ROWS BETWEEN |
Sliding window (e.g., last 3 rows) |
1. Ranking Functions
| Function | Use | Gaps? | Same Rank |
|---|---|---|---|
ROW_NUMBER() |
Unique ID | No | Different |
RANK() |
Competition ranking | Yes | Same → gap |
DENSE_RANK() |
No gaps | No | Same → no gap |
NTILE(n) |
Quartiles, deciles | — | — |
Example: Rank Tracks by Length per Genre
SELECT
t.Name,
g.Name AS genre,
t.Milliseconds,
ROW_NUMBER() OVER (PARTITION BY g.Name ORDER BY t.Milliseconds DESC) AS row_num,
RANK() OVER (PARTITION BY g.Name ORDER BY t.Milliseconds DESC) AS rank_gap,
DENSE_RANK() OVER (PARTITION BY g.Name ORDER BY t.Milliseconds DESC) AS dense_rank
FROM Track t
JOIN Genre g ON t.GenreId = g.GenreId
WHERE g.Name = 'Rock'
LIMIT 10;
| Name | Milliseconds | row_num | rank_gap | dense_rank |
|---|---|---|---|---|
| Occupation | 5286953 | 1 | 1 | 1 |
| Dazed and Confused | 4193280 | 2 | 2 | 2 |
| Dazed and Confused | 4193280 | 3 | 2 | 2 |
2. Aggregate Window Functions
Running Total (Cumulative Sum)
-- Monthly revenue with running total
SELECT
strftime('%Y-%m', InvoiceDate) AS month,
SUM(Total) AS monthly_revenue,
SUM(SUM(Total)) OVER (ORDER BY strftime('%Y-%m')) AS running_total
FROM Invoice
GROUP BY month
ORDER BY month;
Moving Average (Last 3 Months)
SELECT
month,
monthly_revenue,
AVG(monthly_revenue) OVER (
ORDER BY month
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS moving_avg_3m
FROM (
SELECT strftime('%Y-%m') AS month, SUM(Total) AS monthly_revenue
FROM Invoice
GROUP BY month
) sub;
3. Lead / Lag – Time Series & Funnels
-- Next purchase date for each customer
SELECT
CustomerId,
InvoiceDate,
LEAD(InvoiceDate) OVER (PARTITION BY CustomerId ORDER BY InvoiceDate) AS next_purchase,
JULIANDAY(LEAD(InvoiceDate) OVER (PARTITION BY CustomerId ORDER BY InvoiceDate))
- JULIANDAY(InvoiceDate) AS days_to_next
FROM Invoice
ORDER BY CustomerId, InvoiceDate;
Use Case:
Customer retention, churn prediction, session analysis
4. First / Last Value
-- First and last purchase per customer
SELECT DISTINCT
CustomerId,
FIRST_VALUE(InvoiceDate) OVER (PARTITION BY CustomerId ORDER BY InvoiceDate) AS first_purchase,
LAST_VALUE(InvoiceDate) OVER (
PARTITION BY CustomerId
ORDER BY InvoiceDate
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS last_purchase
FROM Invoice;
Warning:
LAST_VALUEneeds full frame!
5. Frame Specification (Advanced)
| Frame | Meaning |
|---|---|
UNBOUNDED PRECEDING |
From start |
CURRENT ROW |
This row |
UNBOUNDED FOLLOWING |
To end |
n PRECEDING |
Last n rows |
-- 3-month trailing average
AVG(revenue) OVER (
ORDER BY month
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
)
Real-World Business Problems (Solved with Window)
1. Customer Lifetime Value (CLV) with Cohorts
WITH first_purchase AS (
SELECT
CustomerId,
MIN(InvoiceDate) AS cohort_date
FROM Invoice
GROUP BY CustomerId
),
monthly_spend AS (
SELECT
c.CustomerId,
strftime('%Y-%m', i.InvoiceDate) AS invoice_month,
strftime('%Y-%m', fp.cohort_date) AS cohort_month,
SUM(i.Total) AS spend
FROM Invoice i
JOIN first_purchase fp ON i.CustomerId = fp.CustomerId
GROUP BY c.CustomerId, invoice_month, cohort_month
)
SELECT
cohort_month,
invoice_month,
COUNT(DISTINCT CustomerId) AS customers,
SUM(spend) AS revenue,
ROUND(AVG(spend), 2) AS arpu
FROM monthly_spend
GROUP BY cohort_month, invoice_month
ORDER BY cohort_month, invoice_month;
2. Sessionization (Web Analytics)
-- Assign session ID when gap > 30 min
WITH flagged AS (
SELECT
user_id,
event_time,
LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time,
CASE
WHEN JULIANDAY(event_time) - JULIANDAY(LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time)) > 30/1440
THEN 1 ELSE 0
END AS new_session
FROM events
),
sessioned AS (
SELECT *,
SUM(new_session) OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
FROM flagged
)
SELECT user_id, session_id, MIN(event_time), MAX(event_time)
FROM sessioned
GROUP BY user_id, session_id;
3. Top N per Group (Interview Classic)
-- Top 3 longest tracks per genre
WITH ranked AS (
SELECT
t.*,
g.Name AS genre,
ROW_NUMBER() OVER (PARTITION BY g.Name ORDER BY Milliseconds DESC) AS rn
FROM Track t
JOIN Genre g ON t.GenreId = g.GenreId
)
SELECT TrackId, Name, genre, Milliseconds
FROM ranked
WHERE rn <= 3
ORDER BY genre, rn;
Interview Questions (Solve These!)
| # | Question | Hint |
|---|---|---|
| 1 | Find 2nd highest salary per department | DENSE_RANK() |
| 2 | Running total of sales | SUM() OVER (ORDER BY) |
| 3 | Customers with 3+ consecutive months of purchase | LAG() + flag |
| 4 | % of total revenue per invoice | SUM() OVER() |
| 5 | Identify churned users (no activity in 30 days) | LEAD() |
Practice Plan (7 Days)
| Day | Task |
|---|---|
| 1 | ROW_NUMBER, RANK, DENSE_RANK → 20 queries |
| 2 | SUM, AVG over windows → running totals |
| 3 | LEAD/LAG → time gaps, retention |
| 4 | FIRST_VALUE/LAST_VALUE + frame |
| 5 | Top N per group → solve 10 variations |
| 6 | Cohort analysis on Chinook |
| 7 | Mock interview → explain 3 queries aloud |
Resources
| Resource | Link |
|---|---|
| Mode Analytics – Window Functions | mode.com/sql-tutorial/sql-window-functions |
| PostgreSQL Docs | postgresql.org/docs/current/tutorial-window.html |
| LeetCode: Window Problems | Search "window" in SQL |
| DB Fiddle | db-fiddle.com |
Final Project: Advanced Analytics Dashboard (SQL Only)
Repo: yourname/sql-window-mastery
-- 01_running_revenue.sql
-- 02_top3_per_genre.sql
-- 03_customer_retention_cohort.sql
-- 04_sessionization_example.sql
-- README.md
README.md Snippet
# SQL Window Functions Mastery
## Key Queries
- **Running Revenue**: Cumulative sales over time
- **Top 3 Tracks per Genre**: Using `ROW_NUMBER()`
- **Cohort Retention**: Monthly active users by signup cohort
- **Sessionization**: Group events into sessions
## How to Run
```sql
-- SQLite
sqlite3 chinook.db < 01_running_revenue.sql
---
## Cheat Sheet (Copy-Paste)
```sql
-- RANKING
ROW_NUMBER() OVER (PARTITION BY x ORDER BY y)
RANK() OVER (PARTITION BY x ORDER BY y DESC)
DENSE_RANK() OVER (PARTITION BY x ORDER BY y DESC)
-- AGGREGATE
SUM(col) OVER (PARTITION BY cat ORDER BY date) AS running_total
AVG(col) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS ma_3
-- LEAD/LAG
LAG(col, 1) OVER (PARTITION BY user ORDER BY time) AS prev_val
LEAD(col) OVER (PARTITION BY user ORDER BY time) AS next_val
-- FRAME
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING
Assessment: Can You Write This?
| Query | Yes/No |
|---|---|
| Rank employees by sales, no gaps | ☐ |
| 3-month moving average | ☐ |
| Next purchase date per customer | ☐ |
| Top 2 invoices per day | ☐ |
| Cohort table (month 0,1,2…) | ☐ |
All Yes → You’re a Window Function Ninja!
Next: Phase 2 – Statistics & Math
Now you can extract and rank — time to explain with stats.
Start Now:
1. Open Chinook.db
2. Run:
SELECT
CustomerId,
InvoiceDate,
Total,
SUM(Total) OVER (PARTITION BY CustomerId ORDER BY InvoiceDate) AS running_spend
FROM Invoice
ORDER BY CustomerId, InvoiceDate
LIMIT 10;
Tag me when you push your window function repo!
You’re now in the top 10% of SQL users.