Cohort Analysis with SQL & Python

Real-World Examples for Data Scientists & Analysts Goal: Master cohort analysis — the #1 framework for understanding user retention, churn, LTV, and product health.

Cohort Analysis with SQL & Python

Cohort Analysis with SQL & Python

Cohort Analysis with SQL & Python

Real-World Examples for Data Scientists & Analysts

Goal: Master cohort analysis — the #1 framework for understanding user retention, churn, LTV, and product health.

Used by: Netflix, Spotify, Amazon, Uber
Appears in: 90% of Senior DS interviews
Salary impact: +15–25%


What is Cohort Analysis?

Group users by when they startedtrack behavior over time

Cohort Month 0 Month 1 Month 2 Month 3
Jan 2025 100 users 40 25 18
Feb 2025 120 users 55 32

Key Metric: Retention Rate = (Users in Month N) / (Users in Month 0)


3 Real-World Examples

# Business Cohort Type Metric
1 E-commerce First Purchase Month % of customers who buy again
2 SaaS Signup Month % active users (logins)
3 Gaming Install Day Daily Active Users (DAU)

Example 1: E-commerce Retention (SQL)

Dataset: orders table

CREATE TABLE orders (
    order_id INT,
    customer_id INT,
    order_date DATE,
    revenue DECIMAL(10,2)
);

Step 1: Find First Purchase (Cohort Month)

WITH first_purchase AS (
    SELECT 
        customer_id,
        MIN(DATE_TRUNC('month', order_date)) AS cohort_month
    FROM orders
    GROUP BY customer_id
),

Step 2: Assign Activity Month

monthly_activity AS (
    SELECT 
        o.customer_id,
        fp.cohort_month,
        DATE_TRUNC('month', o.order_date) AS activity_month,
        COUNT(DISTINCT o.order_id) AS orders,
        SUM(o.revenue) AS revenue
    FROM orders o
    JOIN first_purchase fp ON o.customer_id = fp.customer_id
    GROUP BY o.customer_id, fp.cohort_month, activity_month
),

Step 3: Build Cohort Table

cohort_data AS (
    SELECT 
        cohort_month,
        activity_month,
        DATEDIFF('month', cohort_month, activity_month) AS month_number,
        COUNT(DISTINCT customer_id) AS active_customers,
        SUM(revenue) AS total_revenue
    FROM monthly_activity
    GROUP BY cohort_month, activity_month, month_number
),

Step 4: Final Retention Matrix

SELECT 
    cohort_month,
    month_number,
    active_customers,
    FIRST_VALUE(active_customers) OVER (PARTITION BY cohort_month ORDER BY month_number) AS cohort_size,
    ROUND(100.0 * active_customers / 
          FIRST_VALUE(active_customers) OVER (PARTITION BY cohort_month ORDER BY month_number), 2) AS retention_rate
FROM cohort_data
ORDER BY cohort_month, month_number;

Output

cohort_month month_number active_customers cohort_size retention_rate
2025-01-01 0 500 500 100.00
2025-01-01 1 180 500 36.00
2025-01-01 2 90 500 18.00
2025-02-01 0 520 520 100.00
2025-02-01 1 210 520 40.38

Insight: Feb cohort retains better → investigate onboarding changes!


Example 2: SaaS User Retention (Python + Pandas)

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Sample data
data = pd.read_csv('saas_logins.csv')
data['login_date'] = pd.to_datetime(data['login_date'])
data['signup_date'] = pd.to_datetime(data['signup_date'])

# Step 1: Cohort month
data['cohort_month'] = data['signup_date'].dt.to_period('M')
data['activity_month'] = data['login_date'].dt.to_period('M')

# Step 2: Group
cohort_data = data.groupby(['cohort_month', 'activity_month']).agg(
    active_users=('user_id', 'nunique')
).reset_index()

# Step 3: Month number
cohort_data['month_number'] = (
    cohort_data['activity_month'] - cohort_data['cohort_month']
).apply(lambda x: x.n)

# Step 4: Cohort size
cohort_size = cohort_data[cohort_data['month_number'] == 0][['cohort_month', 'active_users']]
cohort_size = cohort_size.rename(columns={'active_users': 'cohort_size'})

# Merge
final = cohort_data.merge(cohort_size, on='cohort_month')
final['retention'] = final['active_users'] / final['cohort_size']

# Pivot for heatmap
pivot = final.pivot_table(
    index='cohort_month', 
    columns='month_number', 
    values='retention'
)

# Plot
plt.figure(figsize=(10, 8))
sns.heatmap(pivot, annot=True, fmt='.0%', cmap='Blues', cbar_kws={'label': 'Retention %'})
plt.title('Monthly User Retention by Signup Cohort')
plt.ylabel('Cohort Month')
plt.xlabel('Month Number')
plt.show()

Output Heatmap

Month →  0      1      2      3
Cohort ↓
2025-01  100%   45%    28%    19%
2025-02  100%   52%    35%    —
2025-03  100%   48%    —      —

Example 3: Gaming DAU Retention (Daily Cohorts)

Goal: Track Day 0, Day 1, Day 7 retention for mobile game installs

WITH installs AS (
    SELECT 
        user_id,
        MIN(event_date) AS install_date
    FROM events 
    WHERE event_type = 'install'
    GROUP BY user_id
),
daily_active AS (
    SELECT 
        i.user_id,
        i.install_date,
        e.event_date,
        DATEDIFF(e.event_date, i.install_date) AS day_number
    FROM events e
    JOIN installs i ON e.user_id = i.user_id
    WHERE e.event_type = 'session_start'
),
retention AS (
    SELECT 
        install_date,
        day_number,
        COUNT(DISTINCT user_id) AS dau
    FROM daily_active
    WHERE day_number IN (0, 1, 3, 7, 14, 30)
    GROUP BY install_date, day_number
),
cohort_size AS (
    SELECT install_date, dau AS cohort_size
    FROM retention
    WHERE day_number = 0
)
SELECT 
    r.install_date,
    r.day_number,
    r.dau,
    c.cohort_size,
    ROUND(100.0 * r.dau / c.cohort_size, 2) AS retention_pct
FROM retention r
JOIN cohort_size c ON r.install_date = c.install_date
ORDER BY install_date, day_number;

Output

install_date day_number dau cohort_size retention_pct
2025-04-01 0 1000 1000 100.00
2025-04-01 1 420 1000 42.00
2025-04-01 7 180 1000 18.00
2025-04-02 0 980 980 100.00
2025-04-02 1 460 980 46.94

Cohort Types Summary

Type Granularity Use Case
Acquisition First purchase, signup Classic retention
Behavioral First use of feature Feature adoption
Time-based Week, Day, Hour Gaming, news apps
Segmented By channel, plan Marketing ROI

Advanced: Revenue per Cohort (LTV)

-- Add revenue to e-commerce cohort
SELECT 
    cohort_month,
    month_number,
    SUM(total_revenue) AS revenue,
    SUM(total_revenue) / FIRST_VALUE(SUM(total_revenue)) OVER (PARTITION BY cohort_month ORDER BY month_number) AS revenue_index
FROM cohort_data
GROUP BY cohort_month, month_number;

LTV at Month 6 = Sum of revenue from Month 0 to 6


Interview Question (Solve Live!)

"A ride-sharing app wants to improve Day 3 retention. Using the rides table, build a daily cohort retention table and identify which cohort has the highest drop-off from Day 0 to Day 1."

Hint:

-- Use DATEDIFF, GROUP BY install_date, day_number
-- Filter day_number IN (0,1)
-- Find cohort with max(1 - retention_d1)

Project: Build Your Own Cohort Dashboard

Repo: yourname/cohort-analysis

cohort-analysis/
├── sql/
│   ├── ecommerce_retention.sql
│   ├── gaming_dau.sql
│   └── ltv_cohort.sql
├── python/
│   ├── saas_heatmap.ipynb
│   └── cohort_class.py
├── data/
│   └── sample_orders.csv
└── README.md

README.md

# Cohort Analysis Toolkit

## Features
- SQL: E-commerce, Gaming, SaaS retention
- Python: Heatmaps, LTV, segmentation
- Reusable: Works with any user-event data

## Run
```bash
# SQL
sqlite3 data.db < sql/ecommerce_retention.sql

# Python
jupyter notebook python/saas_heatmap.ipynb

```


Tools & Libraries

Tool Use
SQL Raw computation
Pandas Data wrangling
Seaborn Heatmaps
Plotly Interactive dashboards
dbt Production cohorts

Pro Tips

  1. Always normalize by cohort size → avoid misleading %
  2. Cap at 12 months → long-tail distorts
  3. Segment cohorts → by acquisition channel, region, plan
  4. Automate → schedule in Airflow/Cron
  5. Visualize → heatmaps > tables

Free Datasets to Practice

Dataset Link
Online Retail archive.ics.uci.edu/ml/datasets/online+retail
Instacart instacart.com/datasets
Kaggle: E-commerce kaggle.com/c/instacart-market-basket-analysis

Final Checklist: Can You Build This?

Task Yes/No
SQL cohort table with retention %
Python heatmap with seaborn
Daily gaming retention (D1, D7)
LTV per cohort
Identify worst-performing cohort

All Yes → You’re cohort-ready for FAANG!


Next: Phase 2 – Statistics & A/B Testing

Now you can measure retention → learn to improve it.


Start Now:
1. Download Online Retail Dataset
2. Load into SQLite
3. Run the e-commerce SQL cohort query

Tag me when you push your cohort repo!
You now speak the language of product growth.

Last updated: Nov 09, 2025

Cohort Analysis with SQL & Python

Real-World Examples for Data Scientists & Analysts Goal: Master cohort analysis — the #1 framework for understanding user retention, churn, LTV, and product health.

Cohort Analysis with SQL & Python

Cohort Analysis with SQL & Python

Cohort Analysis with SQL & Python

Real-World Examples for Data Scientists & Analysts

Goal: Master cohort analysis — the #1 framework for understanding user retention, churn, LTV, and product health.

Used by: Netflix, Spotify, Amazon, Uber
Appears in: 90% of Senior DS interviews
Salary impact: +15–25%


What is Cohort Analysis?

Group users by when they startedtrack behavior over time

Cohort Month 0 Month 1 Month 2 Month 3
Jan 2025 100 users 40 25 18
Feb 2025 120 users 55 32

Key Metric: Retention Rate = (Users in Month N) / (Users in Month 0)


3 Real-World Examples

# Business Cohort Type Metric
1 E-commerce First Purchase Month % of customers who buy again
2 SaaS Signup Month % active users (logins)
3 Gaming Install Day Daily Active Users (DAU)

Example 1: E-commerce Retention (SQL)

Dataset: orders table

CREATE TABLE orders (
    order_id INT,
    customer_id INT,
    order_date DATE,
    revenue DECIMAL(10,2)
);

Step 1: Find First Purchase (Cohort Month)

WITH first_purchase AS (
    SELECT 
        customer_id,
        MIN(DATE_TRUNC('month', order_date)) AS cohort_month
    FROM orders
    GROUP BY customer_id
),

Step 2: Assign Activity Month

monthly_activity AS (
    SELECT 
        o.customer_id,
        fp.cohort_month,
        DATE_TRUNC('month', o.order_date) AS activity_month,
        COUNT(DISTINCT o.order_id) AS orders,
        SUM(o.revenue) AS revenue
    FROM orders o
    JOIN first_purchase fp ON o.customer_id = fp.customer_id
    GROUP BY o.customer_id, fp.cohort_month, activity_month
),

Step 3: Build Cohort Table

cohort_data AS (
    SELECT 
        cohort_month,
        activity_month,
        DATEDIFF('month', cohort_month, activity_month) AS month_number,
        COUNT(DISTINCT customer_id) AS active_customers,
        SUM(revenue) AS total_revenue
    FROM monthly_activity
    GROUP BY cohort_month, activity_month, month_number
),

Step 4: Final Retention Matrix

SELECT 
    cohort_month,
    month_number,
    active_customers,
    FIRST_VALUE(active_customers) OVER (PARTITION BY cohort_month ORDER BY month_number) AS cohort_size,
    ROUND(100.0 * active_customers / 
          FIRST_VALUE(active_customers) OVER (PARTITION BY cohort_month ORDER BY month_number), 2) AS retention_rate
FROM cohort_data
ORDER BY cohort_month, month_number;

Output

cohort_month month_number active_customers cohort_size retention_rate
2025-01-01 0 500 500 100.00
2025-01-01 1 180 500 36.00
2025-01-01 2 90 500 18.00
2025-02-01 0 520 520 100.00
2025-02-01 1 210 520 40.38

Insight: Feb cohort retains better → investigate onboarding changes!


Example 2: SaaS User Retention (Python + Pandas)

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Sample data
data = pd.read_csv('saas_logins.csv')
data['login_date'] = pd.to_datetime(data['login_date'])
data['signup_date'] = pd.to_datetime(data['signup_date'])

# Step 1: Cohort month
data['cohort_month'] = data['signup_date'].dt.to_period('M')
data['activity_month'] = data['login_date'].dt.to_period('M')

# Step 2: Group
cohort_data = data.groupby(['cohort_month', 'activity_month']).agg(
    active_users=('user_id', 'nunique')
).reset_index()

# Step 3: Month number
cohort_data['month_number'] = (
    cohort_data['activity_month'] - cohort_data['cohort_month']
).apply(lambda x: x.n)

# Step 4: Cohort size
cohort_size = cohort_data[cohort_data['month_number'] == 0][['cohort_month', 'active_users']]
cohort_size = cohort_size.rename(columns={'active_users': 'cohort_size'})

# Merge
final = cohort_data.merge(cohort_size, on='cohort_month')
final['retention'] = final['active_users'] / final['cohort_size']

# Pivot for heatmap
pivot = final.pivot_table(
    index='cohort_month', 
    columns='month_number', 
    values='retention'
)

# Plot
plt.figure(figsize=(10, 8))
sns.heatmap(pivot, annot=True, fmt='.0%', cmap='Blues', cbar_kws={'label': 'Retention %'})
plt.title('Monthly User Retention by Signup Cohort')
plt.ylabel('Cohort Month')
plt.xlabel('Month Number')
plt.show()

Output Heatmap

Month →  0      1      2      3
Cohort ↓
2025-01  100%   45%    28%    19%
2025-02  100%   52%    35%    —
2025-03  100%   48%    —      —

Example 3: Gaming DAU Retention (Daily Cohorts)

Goal: Track Day 0, Day 1, Day 7 retention for mobile game installs

WITH installs AS (
    SELECT 
        user_id,
        MIN(event_date) AS install_date
    FROM events 
    WHERE event_type = 'install'
    GROUP BY user_id
),
daily_active AS (
    SELECT 
        i.user_id,
        i.install_date,
        e.event_date,
        DATEDIFF(e.event_date, i.install_date) AS day_number
    FROM events e
    JOIN installs i ON e.user_id = i.user_id
    WHERE e.event_type = 'session_start'
),
retention AS (
    SELECT 
        install_date,
        day_number,
        COUNT(DISTINCT user_id) AS dau
    FROM daily_active
    WHERE day_number IN (0, 1, 3, 7, 14, 30)
    GROUP BY install_date, day_number
),
cohort_size AS (
    SELECT install_date, dau AS cohort_size
    FROM retention
    WHERE day_number = 0
)
SELECT 
    r.install_date,
    r.day_number,
    r.dau,
    c.cohort_size,
    ROUND(100.0 * r.dau / c.cohort_size, 2) AS retention_pct
FROM retention r
JOIN cohort_size c ON r.install_date = c.install_date
ORDER BY install_date, day_number;

Output

install_date day_number dau cohort_size retention_pct
2025-04-01 0 1000 1000 100.00
2025-04-01 1 420 1000 42.00
2025-04-01 7 180 1000 18.00
2025-04-02 0 980 980 100.00
2025-04-02 1 460 980 46.94

Cohort Types Summary

Type Granularity Use Case
Acquisition First purchase, signup Classic retention
Behavioral First use of feature Feature adoption
Time-based Week, Day, Hour Gaming, news apps
Segmented By channel, plan Marketing ROI

Advanced: Revenue per Cohort (LTV)

-- Add revenue to e-commerce cohort
SELECT 
    cohort_month,
    month_number,
    SUM(total_revenue) AS revenue,
    SUM(total_revenue) / FIRST_VALUE(SUM(total_revenue)) OVER (PARTITION BY cohort_month ORDER BY month_number) AS revenue_index
FROM cohort_data
GROUP BY cohort_month, month_number;

LTV at Month 6 = Sum of revenue from Month 0 to 6


Interview Question (Solve Live!)

"A ride-sharing app wants to improve Day 3 retention. Using the rides table, build a daily cohort retention table and identify which cohort has the highest drop-off from Day 0 to Day 1."

Hint:

-- Use DATEDIFF, GROUP BY install_date, day_number
-- Filter day_number IN (0,1)
-- Find cohort with max(1 - retention_d1)

Project: Build Your Own Cohort Dashboard

Repo: yourname/cohort-analysis

cohort-analysis/
├── sql/
│   ├── ecommerce_retention.sql
│   ├── gaming_dau.sql
│   └── ltv_cohort.sql
├── python/
│   ├── saas_heatmap.ipynb
│   └── cohort_class.py
├── data/
│   └── sample_orders.csv
└── README.md

README.md

# Cohort Analysis Toolkit

## Features
- SQL: E-commerce, Gaming, SaaS retention
- Python: Heatmaps, LTV, segmentation
- Reusable: Works with any user-event data

## Run
```bash
# SQL
sqlite3 data.db < sql/ecommerce_retention.sql

# Python
jupyter notebook python/saas_heatmap.ipynb

```


Tools & Libraries

Tool Use
SQL Raw computation
Pandas Data wrangling
Seaborn Heatmaps
Plotly Interactive dashboards
dbt Production cohorts

Pro Tips

  1. Always normalize by cohort size → avoid misleading %
  2. Cap at 12 months → long-tail distorts
  3. Segment cohorts → by acquisition channel, region, plan
  4. Automate → schedule in Airflow/Cron
  5. Visualize → heatmaps > tables

Free Datasets to Practice

Dataset Link
Online Retail archive.ics.uci.edu/ml/datasets/online+retail
Instacart instacart.com/datasets
Kaggle: E-commerce kaggle.com/c/instacart-market-basket-analysis

Final Checklist: Can You Build This?

Task Yes/No
SQL cohort table with retention %
Python heatmap with seaborn
Daily gaming retention (D1, D7)
LTV per cohort
Identify worst-performing cohort

All Yes → You’re cohort-ready for FAANG!


Next: Phase 2 – Statistics & A/B Testing

Now you can measure retention → learn to improve it.


Start Now:
1. Download Online Retail Dataset
2. Load into SQLite
3. Run the e-commerce SQL cohort query

Tag me when you push your cohort repo!
You now speak the language of product growth.

Last updated: Nov 09, 2025