Advanced SQL Window Functions for Data Science

(Phase 1.5 – Week 3 Deep Dive | 1 Week | 6–8 hrs/day) Goal: Master Window Functions — the secret weapon of top Data Scientists & Analysts. Used in 70% of hard SQL interviews Enables rankings, running totals, cohorts, funnels, time series Replaces complex self-joins & subqueries

Advanced SQL Window Functions for Data Science

Advanced SQL Window Functions for Data Science

Advanced SQL Window Functions for Data Science

(Phase 1.5 – Week 3 Deep Dive | 1 Week | 6–8 hrs/day)

Goal: Master Window Functions — the secret weapon of top Data Scientists & Analysts.

Used in 70% of hard SQL interviews
Enables rankings, running totals, cohorts, funnels, time series
Replaces complex self-joins & subqueries


Why Window Functions?

Without Window With Window
3+ subqueries + CTEs 1 clean query
Hard to read Elegant & fast
Slow performance Optimized by DB engine

Core Syntax

function() OVER (
    [PARTITION BY col1, col2]
    [ORDER BY col3]
    [ROWS/RANGE frame_spec]
)
Clause Purpose
PARTITION BY Group (like GROUP BY)
ORDER BY Define sequence
ROWS BETWEEN Sliding window (e.g., last 3 rows)

1. Ranking Functions

Function Use Gaps? Same Rank
ROW_NUMBER() Unique ID No Different
RANK() Competition ranking Yes Same → gap
DENSE_RANK() No gaps No Same → no gap
NTILE(n) Quartiles, deciles

Example: Rank Tracks by Length per Genre

SELECT 
    t.Name,
    g.Name AS genre,
    t.Milliseconds,
    ROW_NUMBER() OVER (PARTITION BY g.Name ORDER BY t.Milliseconds DESC) AS row_num,
    RANK() OVER (PARTITION BY g.Name ORDER BY t.Milliseconds DESC) AS rank_gap,
    DENSE_RANK() OVER (PARTITION BY g.Name ORDER BY t.Milliseconds DESC) AS dense_rank
FROM Track t
JOIN Genre g ON t.GenreId = g.GenreId
WHERE g.Name = 'Rock'
LIMIT 10;
Name Milliseconds row_num rank_gap dense_rank
Occupation 5286953 1 1 1
Dazed and Confused 4193280 2 2 2
Dazed and Confused 4193280 3 2 2

2. Aggregate Window Functions

Running Total (Cumulative Sum)

-- Monthly revenue with running total
SELECT 
    strftime('%Y-%m', InvoiceDate) AS month,
    SUM(Total) AS monthly_revenue,
    SUM(SUM(Total)) OVER (ORDER BY strftime('%Y-%m')) AS running_total
FROM Invoice
GROUP BY month
ORDER BY month;

Moving Average (Last 3 Months)

SELECT 
    month,
    monthly_revenue,
    AVG(monthly_revenue) OVER (
        ORDER BY month 
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ) AS moving_avg_3m
FROM (
    SELECT strftime('%Y-%m') AS month, SUM(Total) AS monthly_revenue
    FROM Invoice
    GROUP BY month
) sub;

3. Lead / Lag – Time Series & Funnels

-- Next purchase date for each customer
SELECT 
    CustomerId,
    InvoiceDate,
    LEAD(InvoiceDate) OVER (PARTITION BY CustomerId ORDER BY InvoiceDate) AS next_purchase,
    JULIANDAY(LEAD(InvoiceDate) OVER (PARTITION BY CustomerId ORDER BY InvoiceDate)) 
        - JULIANDAY(InvoiceDate) AS days_to_next
FROM Invoice
ORDER BY CustomerId, InvoiceDate;

Use Case:

Customer retention, churn prediction, session analysis


4. First / Last Value

-- First and last purchase per customer
SELECT DISTINCT
    CustomerId,
    FIRST_VALUE(InvoiceDate) OVER (PARTITION BY CustomerId ORDER BY InvoiceDate) AS first_purchase,
    LAST_VALUE(InvoiceDate) OVER (
        PARTITION BY CustomerId 
        ORDER BY InvoiceDate 
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS last_purchase
FROM Invoice;

Warning: LAST_VALUE needs full frame!


5. Frame Specification (Advanced)

Frame Meaning
UNBOUNDED PRECEDING From start
CURRENT ROW This row
UNBOUNDED FOLLOWING To end
n PRECEDING Last n rows
-- 3-month trailing average
AVG(revenue) OVER (
    ORDER BY month
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
)

Real-World Business Problems (Solved with Window)


1. Customer Lifetime Value (CLV) with Cohorts

WITH first_purchase AS (
    SELECT 
        CustomerId,
        MIN(InvoiceDate) AS cohort_date
    FROM Invoice
    GROUP BY CustomerId
),
monthly_spend AS (
    SELECT 
        c.CustomerId,
        strftime('%Y-%m', i.InvoiceDate) AS invoice_month,
        strftime('%Y-%m', fp.cohort_date) AS cohort_month,
        SUM(i.Total) AS spend
    FROM Invoice i
    JOIN first_purchase fp ON i.CustomerId = fp.CustomerId
    GROUP BY c.CustomerId, invoice_month, cohort_month
)
SELECT 
    cohort_month,
    invoice_month,
    COUNT(DISTINCT CustomerId) AS customers,
    SUM(spend) AS revenue,
    ROUND(AVG(spend), 2) AS arpu
FROM monthly_spend
GROUP BY cohort_month, invoice_month
ORDER BY cohort_month, invoice_month;

2. Sessionization (Web Analytics)

-- Assign session ID when gap > 30 min
WITH flagged AS (
    SELECT 
        user_id,
        event_time,
        LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time,
        CASE 
            WHEN JULIANDAY(event_time) - JULIANDAY(LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time)) > 30/1440 
            THEN 1 ELSE 0 
        END AS new_session
    FROM events
),
sessioned AS (
    SELECT *,
           SUM(new_session) OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
    FROM flagged
)
SELECT user_id, session_id, MIN(event_time), MAX(event_time)
FROM sessioned
GROUP BY user_id, session_id;

3. Top N per Group (Interview Classic)

-- Top 3 longest tracks per genre
WITH ranked AS (
    SELECT 
        t.*,
        g.Name AS genre,
        ROW_NUMBER() OVER (PARTITION BY g.Name ORDER BY Milliseconds DESC) AS rn
    FROM Track t
    JOIN Genre g ON t.GenreId = g.GenreId
)
SELECT TrackId, Name, genre, Milliseconds
FROM ranked
WHERE rn <= 3
ORDER BY genre, rn;

Interview Questions (Solve These!)

# Question Hint
1 Find 2nd highest salary per department DENSE_RANK()
2 Running total of sales SUM() OVER (ORDER BY)
3 Customers with 3+ consecutive months of purchase LAG() + flag
4 % of total revenue per invoice SUM() OVER()
5 Identify churned users (no activity in 30 days) LEAD()

Practice Plan (7 Days)

Day Task
1 ROW_NUMBER, RANK, DENSE_RANK → 20 queries
2 SUM, AVG over windows → running totals
3 LEAD/LAG → time gaps, retention
4 FIRST_VALUE/LAST_VALUE + frame
5 Top N per group → solve 10 variations
6 Cohort analysis on Chinook
7 Mock interview → explain 3 queries aloud

Resources

Resource Link
Mode Analytics – Window Functions mode.com/sql-tutorial/sql-window-functions
PostgreSQL Docs postgresql.org/docs/current/tutorial-window.html
LeetCode: Window Problems Search "window" in SQL
DB Fiddle db-fiddle.com

Final Project: Advanced Analytics Dashboard (SQL Only)

Repo: yourname/sql-window-mastery

-- 01_running_revenue.sql
-- 02_top3_per_genre.sql
-- 03_customer_retention_cohort.sql
-- 04_sessionization_example.sql
-- README.md

README.md Snippet

# SQL Window Functions Mastery

## Key Queries
- **Running Revenue**: Cumulative sales over time
- **Top 3 Tracks per Genre**: Using `ROW_NUMBER()`
- **Cohort Retention**: Monthly active users by signup cohort
- **Sessionization**: Group events into sessions

## How to Run
```sql
-- SQLite
sqlite3 chinook.db < 01_running_revenue.sql
---

## Cheat Sheet (Copy-Paste)

```sql
-- RANKING
ROW_NUMBER() OVER (PARTITION BY x ORDER BY y)
RANK() OVER (PARTITION BY x ORDER BY y DESC)
DENSE_RANK() OVER (PARTITION BY x ORDER BY y DESC)

-- AGGREGATE
SUM(col) OVER (PARTITION BY cat ORDER BY date) AS running_total
AVG(col) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS ma_3

-- LEAD/LAG
LAG(col, 1) OVER (PARTITION BY user ORDER BY time) AS prev_val
LEAD(col) OVER (PARTITION BY user ORDER BY time) AS next_val

-- FRAME
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING

Assessment: Can You Write This?

Query Yes/No
Rank employees by sales, no gaps
3-month moving average
Next purchase date per customer
Top 2 invoices per day
Cohort table (month 0,1,2…)

All Yes → You’re a Window Function Ninja!


Next: Phase 2 – Statistics & Math

Now you can extract and rank — time to explain with stats.


Start Now:
1. Open Chinook.db
2. Run:

SELECT 
    CustomerId,
    InvoiceDate,
    Total,
    SUM(Total) OVER (PARTITION BY CustomerId ORDER BY InvoiceDate) AS running_spend
FROM Invoice
ORDER BY CustomerId, InvoiceDate
LIMIT 10;

Tag me when you push your window function repo!
You’re now in the top 10% of SQL users.

Last updated: Nov 09, 2025

Advanced SQL Window Functions for Data Science

(Phase 1.5 – Week 3 Deep Dive | 1 Week | 6–8 hrs/day) Goal: Master Window Functions — the secret weapon of top Data Scientists & Analysts. Used in 70% of hard SQL interviews Enables rankings, running totals, cohorts, funnels, time series Replaces complex self-joins & subqueries

Advanced SQL Window Functions for Data Science

Advanced SQL Window Functions for Data Science

Advanced SQL Window Functions for Data Science

(Phase 1.5 – Week 3 Deep Dive | 1 Week | 6–8 hrs/day)

Goal: Master Window Functions — the secret weapon of top Data Scientists & Analysts.

Used in 70% of hard SQL interviews
Enables rankings, running totals, cohorts, funnels, time series
Replaces complex self-joins & subqueries


Why Window Functions?

Without Window With Window
3+ subqueries + CTEs 1 clean query
Hard to read Elegant & fast
Slow performance Optimized by DB engine

Core Syntax

function() OVER (
    [PARTITION BY col1, col2]
    [ORDER BY col3]
    [ROWS/RANGE frame_spec]
)
Clause Purpose
PARTITION BY Group (like GROUP BY)
ORDER BY Define sequence
ROWS BETWEEN Sliding window (e.g., last 3 rows)

1. Ranking Functions

Function Use Gaps? Same Rank
ROW_NUMBER() Unique ID No Different
RANK() Competition ranking Yes Same → gap
DENSE_RANK() No gaps No Same → no gap
NTILE(n) Quartiles, deciles

Example: Rank Tracks by Length per Genre

SELECT 
    t.Name,
    g.Name AS genre,
    t.Milliseconds,
    ROW_NUMBER() OVER (PARTITION BY g.Name ORDER BY t.Milliseconds DESC) AS row_num,
    RANK() OVER (PARTITION BY g.Name ORDER BY t.Milliseconds DESC) AS rank_gap,
    DENSE_RANK() OVER (PARTITION BY g.Name ORDER BY t.Milliseconds DESC) AS dense_rank
FROM Track t
JOIN Genre g ON t.GenreId = g.GenreId
WHERE g.Name = 'Rock'
LIMIT 10;
Name Milliseconds row_num rank_gap dense_rank
Occupation 5286953 1 1 1
Dazed and Confused 4193280 2 2 2
Dazed and Confused 4193280 3 2 2

2. Aggregate Window Functions

Running Total (Cumulative Sum)

-- Monthly revenue with running total
SELECT 
    strftime('%Y-%m', InvoiceDate) AS month,
    SUM(Total) AS monthly_revenue,
    SUM(SUM(Total)) OVER (ORDER BY strftime('%Y-%m')) AS running_total
FROM Invoice
GROUP BY month
ORDER BY month;

Moving Average (Last 3 Months)

SELECT 
    month,
    monthly_revenue,
    AVG(monthly_revenue) OVER (
        ORDER BY month 
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ) AS moving_avg_3m
FROM (
    SELECT strftime('%Y-%m') AS month, SUM(Total) AS monthly_revenue
    FROM Invoice
    GROUP BY month
) sub;

3. Lead / Lag – Time Series & Funnels

-- Next purchase date for each customer
SELECT 
    CustomerId,
    InvoiceDate,
    LEAD(InvoiceDate) OVER (PARTITION BY CustomerId ORDER BY InvoiceDate) AS next_purchase,
    JULIANDAY(LEAD(InvoiceDate) OVER (PARTITION BY CustomerId ORDER BY InvoiceDate)) 
        - JULIANDAY(InvoiceDate) AS days_to_next
FROM Invoice
ORDER BY CustomerId, InvoiceDate;

Use Case:

Customer retention, churn prediction, session analysis


4. First / Last Value

-- First and last purchase per customer
SELECT DISTINCT
    CustomerId,
    FIRST_VALUE(InvoiceDate) OVER (PARTITION BY CustomerId ORDER BY InvoiceDate) AS first_purchase,
    LAST_VALUE(InvoiceDate) OVER (
        PARTITION BY CustomerId 
        ORDER BY InvoiceDate 
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS last_purchase
FROM Invoice;

Warning: LAST_VALUE needs full frame!


5. Frame Specification (Advanced)

Frame Meaning
UNBOUNDED PRECEDING From start
CURRENT ROW This row
UNBOUNDED FOLLOWING To end
n PRECEDING Last n rows
-- 3-month trailing average
AVG(revenue) OVER (
    ORDER BY month
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
)

Real-World Business Problems (Solved with Window)


1. Customer Lifetime Value (CLV) with Cohorts

WITH first_purchase AS (
    SELECT 
        CustomerId,
        MIN(InvoiceDate) AS cohort_date
    FROM Invoice
    GROUP BY CustomerId
),
monthly_spend AS (
    SELECT 
        c.CustomerId,
        strftime('%Y-%m', i.InvoiceDate) AS invoice_month,
        strftime('%Y-%m', fp.cohort_date) AS cohort_month,
        SUM(i.Total) AS spend
    FROM Invoice i
    JOIN first_purchase fp ON i.CustomerId = fp.CustomerId
    GROUP BY c.CustomerId, invoice_month, cohort_month
)
SELECT 
    cohort_month,
    invoice_month,
    COUNT(DISTINCT CustomerId) AS customers,
    SUM(spend) AS revenue,
    ROUND(AVG(spend), 2) AS arpu
FROM monthly_spend
GROUP BY cohort_month, invoice_month
ORDER BY cohort_month, invoice_month;

2. Sessionization (Web Analytics)

-- Assign session ID when gap > 30 min
WITH flagged AS (
    SELECT 
        user_id,
        event_time,
        LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time,
        CASE 
            WHEN JULIANDAY(event_time) - JULIANDAY(LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time)) > 30/1440 
            THEN 1 ELSE 0 
        END AS new_session
    FROM events
),
sessioned AS (
    SELECT *,
           SUM(new_session) OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
    FROM flagged
)
SELECT user_id, session_id, MIN(event_time), MAX(event_time)
FROM sessioned
GROUP BY user_id, session_id;

3. Top N per Group (Interview Classic)

-- Top 3 longest tracks per genre
WITH ranked AS (
    SELECT 
        t.*,
        g.Name AS genre,
        ROW_NUMBER() OVER (PARTITION BY g.Name ORDER BY Milliseconds DESC) AS rn
    FROM Track t
    JOIN Genre g ON t.GenreId = g.GenreId
)
SELECT TrackId, Name, genre, Milliseconds
FROM ranked
WHERE rn <= 3
ORDER BY genre, rn;

Interview Questions (Solve These!)

# Question Hint
1 Find 2nd highest salary per department DENSE_RANK()
2 Running total of sales SUM() OVER (ORDER BY)
3 Customers with 3+ consecutive months of purchase LAG() + flag
4 % of total revenue per invoice SUM() OVER()
5 Identify churned users (no activity in 30 days) LEAD()

Practice Plan (7 Days)

Day Task
1 ROW_NUMBER, RANK, DENSE_RANK → 20 queries
2 SUM, AVG over windows → running totals
3 LEAD/LAG → time gaps, retention
4 FIRST_VALUE/LAST_VALUE + frame
5 Top N per group → solve 10 variations
6 Cohort analysis on Chinook
7 Mock interview → explain 3 queries aloud

Resources

Resource Link
Mode Analytics – Window Functions mode.com/sql-tutorial/sql-window-functions
PostgreSQL Docs postgresql.org/docs/current/tutorial-window.html
LeetCode: Window Problems Search "window" in SQL
DB Fiddle db-fiddle.com

Final Project: Advanced Analytics Dashboard (SQL Only)

Repo: yourname/sql-window-mastery

-- 01_running_revenue.sql
-- 02_top3_per_genre.sql
-- 03_customer_retention_cohort.sql
-- 04_sessionization_example.sql
-- README.md

README.md Snippet

# SQL Window Functions Mastery

## Key Queries
- **Running Revenue**: Cumulative sales over time
- **Top 3 Tracks per Genre**: Using `ROW_NUMBER()`
- **Cohort Retention**: Monthly active users by signup cohort
- **Sessionization**: Group events into sessions

## How to Run
```sql
-- SQLite
sqlite3 chinook.db < 01_running_revenue.sql
---

## Cheat Sheet (Copy-Paste)

```sql
-- RANKING
ROW_NUMBER() OVER (PARTITION BY x ORDER BY y)
RANK() OVER (PARTITION BY x ORDER BY y DESC)
DENSE_RANK() OVER (PARTITION BY x ORDER BY y DESC)

-- AGGREGATE
SUM(col) OVER (PARTITION BY cat ORDER BY date) AS running_total
AVG(col) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS ma_3

-- LEAD/LAG
LAG(col, 1) OVER (PARTITION BY user ORDER BY time) AS prev_val
LEAD(col) OVER (PARTITION BY user ORDER BY time) AS next_val

-- FRAME
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING

Assessment: Can You Write This?

Query Yes/No
Rank employees by sales, no gaps
3-month moving average
Next purchase date per customer
Top 2 invoices per day
Cohort table (month 0,1,2…)

All Yes → You’re a Window Function Ninja!


Next: Phase 2 – Statistics & Math

Now you can extract and rank — time to explain with stats.


Start Now:
1. Open Chinook.db
2. Run:

SELECT 
    CustomerId,
    InvoiceDate,
    Total,
    SUM(Total) OVER (PARTITION BY CustomerId ORDER BY InvoiceDate) AS running_spend
FROM Invoice
ORDER BY CustomerId, InvoiceDate
LIMIT 10;

Tag me when you push your window function repo!
You’re now in the top 10% of SQL users.

Last updated: Nov 09, 2025