Cohort Analysis with SQL & Python
Real-World Examples for Data Scientists & Analysts Goal: Master cohort analysis — the #1 framework for understanding user retention, churn, LTV, and product health.
Cohort Analysis with SQL & Python
Cohort Analysis with SQL & Python
Cohort Analysis with SQL & Python
Real-World Examples for Data Scientists & Analysts
Goal: Master cohort analysis — the #1 framework for understanding user retention, churn, LTV, and product health.
Used by: Netflix, Spotify, Amazon, Uber
Appears in: 90% of Senior DS interviews
Salary impact: +15–25%
What is Cohort Analysis?
Group users by when they started → track behavior over time
| Cohort | Month 0 | Month 1 | Month 2 | Month 3 |
|---|---|---|---|---|
| Jan 2025 | 100 users | 40 | 25 | 18 |
| Feb 2025 | 120 users | 55 | 32 | — |
Key Metric: Retention Rate = (Users in Month N) / (Users in Month 0)
3 Real-World Examples
| # | Business | Cohort Type | Metric |
|---|---|---|---|
| 1 | E-commerce | First Purchase Month | % of customers who buy again |
| 2 | SaaS | Signup Month | % active users (logins) |
| 3 | Gaming | Install Day | Daily Active Users (DAU) |
Example 1: E-commerce Retention (SQL)
Dataset: orders table
CREATE TABLE orders (
order_id INT,
customer_id INT,
order_date DATE,
revenue DECIMAL(10,2)
);
Step 1: Find First Purchase (Cohort Month)
WITH first_purchase AS (
SELECT
customer_id,
MIN(DATE_TRUNC('month', order_date)) AS cohort_month
FROM orders
GROUP BY customer_id
),
Step 2: Assign Activity Month
monthly_activity AS (
SELECT
o.customer_id,
fp.cohort_month,
DATE_TRUNC('month', o.order_date) AS activity_month,
COUNT(DISTINCT o.order_id) AS orders,
SUM(o.revenue) AS revenue
FROM orders o
JOIN first_purchase fp ON o.customer_id = fp.customer_id
GROUP BY o.customer_id, fp.cohort_month, activity_month
),
Step 3: Build Cohort Table
cohort_data AS (
SELECT
cohort_month,
activity_month,
DATEDIFF('month', cohort_month, activity_month) AS month_number,
COUNT(DISTINCT customer_id) AS active_customers,
SUM(revenue) AS total_revenue
FROM monthly_activity
GROUP BY cohort_month, activity_month, month_number
),
Step 4: Final Retention Matrix
SELECT
cohort_month,
month_number,
active_customers,
FIRST_VALUE(active_customers) OVER (PARTITION BY cohort_month ORDER BY month_number) AS cohort_size,
ROUND(100.0 * active_customers /
FIRST_VALUE(active_customers) OVER (PARTITION BY cohort_month ORDER BY month_number), 2) AS retention_rate
FROM cohort_data
ORDER BY cohort_month, month_number;
Output
| cohort_month | month_number | active_customers | cohort_size | retention_rate |
|---|---|---|---|---|
| 2025-01-01 | 0 | 500 | 500 | 100.00 |
| 2025-01-01 | 1 | 180 | 500 | 36.00 |
| 2025-01-01 | 2 | 90 | 500 | 18.00 |
| 2025-02-01 | 0 | 520 | 520 | 100.00 |
| 2025-02-01 | 1 | 210 | 520 | 40.38 |
Insight: Feb cohort retains better → investigate onboarding changes!
Example 2: SaaS User Retention (Python + Pandas)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Sample data
data = pd.read_csv('saas_logins.csv')
data['login_date'] = pd.to_datetime(data['login_date'])
data['signup_date'] = pd.to_datetime(data['signup_date'])
# Step 1: Cohort month
data['cohort_month'] = data['signup_date'].dt.to_period('M')
data['activity_month'] = data['login_date'].dt.to_period('M')
# Step 2: Group
cohort_data = data.groupby(['cohort_month', 'activity_month']).agg(
active_users=('user_id', 'nunique')
).reset_index()
# Step 3: Month number
cohort_data['month_number'] = (
cohort_data['activity_month'] - cohort_data['cohort_month']
).apply(lambda x: x.n)
# Step 4: Cohort size
cohort_size = cohort_data[cohort_data['month_number'] == 0][['cohort_month', 'active_users']]
cohort_size = cohort_size.rename(columns={'active_users': 'cohort_size'})
# Merge
final = cohort_data.merge(cohort_size, on='cohort_month')
final['retention'] = final['active_users'] / final['cohort_size']
# Pivot for heatmap
pivot = final.pivot_table(
index='cohort_month',
columns='month_number',
values='retention'
)
# Plot
plt.figure(figsize=(10, 8))
sns.heatmap(pivot, annot=True, fmt='.0%', cmap='Blues', cbar_kws={'label': 'Retention %'})
plt.title('Monthly User Retention by Signup Cohort')
plt.ylabel('Cohort Month')
plt.xlabel('Month Number')
plt.show()
Output Heatmap
Month → 0 1 2 3
Cohort ↓
2025-01 100% 45% 28% 19%
2025-02 100% 52% 35% —
2025-03 100% 48% — —
Example 3: Gaming DAU Retention (Daily Cohorts)
Goal: Track Day 0, Day 1, Day 7 retention for mobile game installs
WITH installs AS (
SELECT
user_id,
MIN(event_date) AS install_date
FROM events
WHERE event_type = 'install'
GROUP BY user_id
),
daily_active AS (
SELECT
i.user_id,
i.install_date,
e.event_date,
DATEDIFF(e.event_date, i.install_date) AS day_number
FROM events e
JOIN installs i ON e.user_id = i.user_id
WHERE e.event_type = 'session_start'
),
retention AS (
SELECT
install_date,
day_number,
COUNT(DISTINCT user_id) AS dau
FROM daily_active
WHERE day_number IN (0, 1, 3, 7, 14, 30)
GROUP BY install_date, day_number
),
cohort_size AS (
SELECT install_date, dau AS cohort_size
FROM retention
WHERE day_number = 0
)
SELECT
r.install_date,
r.day_number,
r.dau,
c.cohort_size,
ROUND(100.0 * r.dau / c.cohort_size, 2) AS retention_pct
FROM retention r
JOIN cohort_size c ON r.install_date = c.install_date
ORDER BY install_date, day_number;
Output
| install_date | day_number | dau | cohort_size | retention_pct |
|---|---|---|---|---|
| 2025-04-01 | 0 | 1000 | 1000 | 100.00 |
| 2025-04-01 | 1 | 420 | 1000 | 42.00 |
| 2025-04-01 | 7 | 180 | 1000 | 18.00 |
| 2025-04-02 | 0 | 980 | 980 | 100.00 |
| 2025-04-02 | 1 | 460 | 980 | 46.94 |
Cohort Types Summary
| Type | Granularity | Use Case |
|---|---|---|
| Acquisition | First purchase, signup | Classic retention |
| Behavioral | First use of feature | Feature adoption |
| Time-based | Week, Day, Hour | Gaming, news apps |
| Segmented | By channel, plan | Marketing ROI |
Advanced: Revenue per Cohort (LTV)
-- Add revenue to e-commerce cohort
SELECT
cohort_month,
month_number,
SUM(total_revenue) AS revenue,
SUM(total_revenue) / FIRST_VALUE(SUM(total_revenue)) OVER (PARTITION BY cohort_month ORDER BY month_number) AS revenue_index
FROM cohort_data
GROUP BY cohort_month, month_number;
LTV at Month 6 = Sum of revenue from Month 0 to 6
Interview Question (Solve Live!)
"A ride-sharing app wants to improve Day 3 retention. Using the
ridestable, build a daily cohort retention table and identify which cohort has the highest drop-off from Day 0 to Day 1."
Hint:
-- Use DATEDIFF, GROUP BY install_date, day_number
-- Filter day_number IN (0,1)
-- Find cohort with max(1 - retention_d1)
Project: Build Your Own Cohort Dashboard
Repo: yourname/cohort-analysis
cohort-analysis/
├── sql/
│ ├── ecommerce_retention.sql
│ ├── gaming_dau.sql
│ └── ltv_cohort.sql
├── python/
│ ├── saas_heatmap.ipynb
│ └── cohort_class.py
├── data/
│ └── sample_orders.csv
└── README.md
README.md
# Cohort Analysis Toolkit
## Features
- SQL: E-commerce, Gaming, SaaS retention
- Python: Heatmaps, LTV, segmentation
- Reusable: Works with any user-event data
## Run
```bash
# SQL
sqlite3 data.db < sql/ecommerce_retention.sql
# Python
jupyter notebook python/saas_heatmap.ipynb
```
Tools & Libraries
| Tool | Use |
|---|---|
| SQL | Raw computation |
| Pandas | Data wrangling |
| Seaborn | Heatmaps |
| Plotly | Interactive dashboards |
| dbt | Production cohorts |
Pro Tips
- Always normalize by cohort size → avoid misleading %
- Cap at 12 months → long-tail distorts
- Segment cohorts → by acquisition channel, region, plan
- Automate → schedule in Airflow/Cron
- Visualize → heatmaps > tables
Free Datasets to Practice
| Dataset | Link |
|---|---|
| Online Retail | archive.ics.uci.edu/ml/datasets/online+retail |
| Instacart | instacart.com/datasets |
| Kaggle: E-commerce | kaggle.com/c/instacart-market-basket-analysis |
Final Checklist: Can You Build This?
| Task | Yes/No |
|---|---|
| SQL cohort table with retention % | ☐ |
| Python heatmap with seaborn | ☐ |
| Daily gaming retention (D1, D7) | ☐ |
| LTV per cohort | ☐ |
| Identify worst-performing cohort | ☐ |
All Yes → You’re cohort-ready for FAANG!
Next: Phase 2 – Statistics & A/B Testing
Now you can measure retention → learn to improve it.
Start Now:
1. Download Online Retail Dataset
2. Load into SQLite
3. Run the e-commerce SQL cohort query
Tag me when you push your cohort repo!
You now speak the language of product growth.
Cohort Analysis with SQL & Python
Real-World Examples for Data Scientists & Analysts Goal: Master cohort analysis — the #1 framework for understanding user retention, churn, LTV, and product health.
Cohort Analysis with SQL & Python
Cohort Analysis with SQL & Python
Cohort Analysis with SQL & Python
Real-World Examples for Data Scientists & Analysts
Goal: Master cohort analysis — the #1 framework for understanding user retention, churn, LTV, and product health.
Used by: Netflix, Spotify, Amazon, Uber
Appears in: 90% of Senior DS interviews
Salary impact: +15–25%
What is Cohort Analysis?
Group users by when they started → track behavior over time
| Cohort | Month 0 | Month 1 | Month 2 | Month 3 |
|---|---|---|---|---|
| Jan 2025 | 100 users | 40 | 25 | 18 |
| Feb 2025 | 120 users | 55 | 32 | — |
Key Metric: Retention Rate = (Users in Month N) / (Users in Month 0)
3 Real-World Examples
| # | Business | Cohort Type | Metric |
|---|---|---|---|
| 1 | E-commerce | First Purchase Month | % of customers who buy again |
| 2 | SaaS | Signup Month | % active users (logins) |
| 3 | Gaming | Install Day | Daily Active Users (DAU) |
Example 1: E-commerce Retention (SQL)
Dataset: orders table
CREATE TABLE orders (
order_id INT,
customer_id INT,
order_date DATE,
revenue DECIMAL(10,2)
);
Step 1: Find First Purchase (Cohort Month)
WITH first_purchase AS (
SELECT
customer_id,
MIN(DATE_TRUNC('month', order_date)) AS cohort_month
FROM orders
GROUP BY customer_id
),
Step 2: Assign Activity Month
monthly_activity AS (
SELECT
o.customer_id,
fp.cohort_month,
DATE_TRUNC('month', o.order_date) AS activity_month,
COUNT(DISTINCT o.order_id) AS orders,
SUM(o.revenue) AS revenue
FROM orders o
JOIN first_purchase fp ON o.customer_id = fp.customer_id
GROUP BY o.customer_id, fp.cohort_month, activity_month
),
Step 3: Build Cohort Table
cohort_data AS (
SELECT
cohort_month,
activity_month,
DATEDIFF('month', cohort_month, activity_month) AS month_number,
COUNT(DISTINCT customer_id) AS active_customers,
SUM(revenue) AS total_revenue
FROM monthly_activity
GROUP BY cohort_month, activity_month, month_number
),
Step 4: Final Retention Matrix
SELECT
cohort_month,
month_number,
active_customers,
FIRST_VALUE(active_customers) OVER (PARTITION BY cohort_month ORDER BY month_number) AS cohort_size,
ROUND(100.0 * active_customers /
FIRST_VALUE(active_customers) OVER (PARTITION BY cohort_month ORDER BY month_number), 2) AS retention_rate
FROM cohort_data
ORDER BY cohort_month, month_number;
Output
| cohort_month | month_number | active_customers | cohort_size | retention_rate |
|---|---|---|---|---|
| 2025-01-01 | 0 | 500 | 500 | 100.00 |
| 2025-01-01 | 1 | 180 | 500 | 36.00 |
| 2025-01-01 | 2 | 90 | 500 | 18.00 |
| 2025-02-01 | 0 | 520 | 520 | 100.00 |
| 2025-02-01 | 1 | 210 | 520 | 40.38 |
Insight: Feb cohort retains better → investigate onboarding changes!
Example 2: SaaS User Retention (Python + Pandas)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Sample data
data = pd.read_csv('saas_logins.csv')
data['login_date'] = pd.to_datetime(data['login_date'])
data['signup_date'] = pd.to_datetime(data['signup_date'])
# Step 1: Cohort month
data['cohort_month'] = data['signup_date'].dt.to_period('M')
data['activity_month'] = data['login_date'].dt.to_period('M')
# Step 2: Group
cohort_data = data.groupby(['cohort_month', 'activity_month']).agg(
active_users=('user_id', 'nunique')
).reset_index()
# Step 3: Month number
cohort_data['month_number'] = (
cohort_data['activity_month'] - cohort_data['cohort_month']
).apply(lambda x: x.n)
# Step 4: Cohort size
cohort_size = cohort_data[cohort_data['month_number'] == 0][['cohort_month', 'active_users']]
cohort_size = cohort_size.rename(columns={'active_users': 'cohort_size'})
# Merge
final = cohort_data.merge(cohort_size, on='cohort_month')
final['retention'] = final['active_users'] / final['cohort_size']
# Pivot for heatmap
pivot = final.pivot_table(
index='cohort_month',
columns='month_number',
values='retention'
)
# Plot
plt.figure(figsize=(10, 8))
sns.heatmap(pivot, annot=True, fmt='.0%', cmap='Blues', cbar_kws={'label': 'Retention %'})
plt.title('Monthly User Retention by Signup Cohort')
plt.ylabel('Cohort Month')
plt.xlabel('Month Number')
plt.show()
Output Heatmap
Month → 0 1 2 3
Cohort ↓
2025-01 100% 45% 28% 19%
2025-02 100% 52% 35% —
2025-03 100% 48% — —
Example 3: Gaming DAU Retention (Daily Cohorts)
Goal: Track Day 0, Day 1, Day 7 retention for mobile game installs
WITH installs AS (
SELECT
user_id,
MIN(event_date) AS install_date
FROM events
WHERE event_type = 'install'
GROUP BY user_id
),
daily_active AS (
SELECT
i.user_id,
i.install_date,
e.event_date,
DATEDIFF(e.event_date, i.install_date) AS day_number
FROM events e
JOIN installs i ON e.user_id = i.user_id
WHERE e.event_type = 'session_start'
),
retention AS (
SELECT
install_date,
day_number,
COUNT(DISTINCT user_id) AS dau
FROM daily_active
WHERE day_number IN (0, 1, 3, 7, 14, 30)
GROUP BY install_date, day_number
),
cohort_size AS (
SELECT install_date, dau AS cohort_size
FROM retention
WHERE day_number = 0
)
SELECT
r.install_date,
r.day_number,
r.dau,
c.cohort_size,
ROUND(100.0 * r.dau / c.cohort_size, 2) AS retention_pct
FROM retention r
JOIN cohort_size c ON r.install_date = c.install_date
ORDER BY install_date, day_number;
Output
| install_date | day_number | dau | cohort_size | retention_pct |
|---|---|---|---|---|
| 2025-04-01 | 0 | 1000 | 1000 | 100.00 |
| 2025-04-01 | 1 | 420 | 1000 | 42.00 |
| 2025-04-01 | 7 | 180 | 1000 | 18.00 |
| 2025-04-02 | 0 | 980 | 980 | 100.00 |
| 2025-04-02 | 1 | 460 | 980 | 46.94 |
Cohort Types Summary
| Type | Granularity | Use Case |
|---|---|---|
| Acquisition | First purchase, signup | Classic retention |
| Behavioral | First use of feature | Feature adoption |
| Time-based | Week, Day, Hour | Gaming, news apps |
| Segmented | By channel, plan | Marketing ROI |
Advanced: Revenue per Cohort (LTV)
-- Add revenue to e-commerce cohort
SELECT
cohort_month,
month_number,
SUM(total_revenue) AS revenue,
SUM(total_revenue) / FIRST_VALUE(SUM(total_revenue)) OVER (PARTITION BY cohort_month ORDER BY month_number) AS revenue_index
FROM cohort_data
GROUP BY cohort_month, month_number;
LTV at Month 6 = Sum of revenue from Month 0 to 6
Interview Question (Solve Live!)
"A ride-sharing app wants to improve Day 3 retention. Using the
ridestable, build a daily cohort retention table and identify which cohort has the highest drop-off from Day 0 to Day 1."
Hint:
-- Use DATEDIFF, GROUP BY install_date, day_number
-- Filter day_number IN (0,1)
-- Find cohort with max(1 - retention_d1)
Project: Build Your Own Cohort Dashboard
Repo: yourname/cohort-analysis
cohort-analysis/
├── sql/
│ ├── ecommerce_retention.sql
│ ├── gaming_dau.sql
│ └── ltv_cohort.sql
├── python/
│ ├── saas_heatmap.ipynb
│ └── cohort_class.py
├── data/
│ └── sample_orders.csv
└── README.md
README.md
# Cohort Analysis Toolkit
## Features
- SQL: E-commerce, Gaming, SaaS retention
- Python: Heatmaps, LTV, segmentation
- Reusable: Works with any user-event data
## Run
```bash
# SQL
sqlite3 data.db < sql/ecommerce_retention.sql
# Python
jupyter notebook python/saas_heatmap.ipynb
```
Tools & Libraries
| Tool | Use |
|---|---|
| SQL | Raw computation |
| Pandas | Data wrangling |
| Seaborn | Heatmaps |
| Plotly | Interactive dashboards |
| dbt | Production cohorts |
Pro Tips
- Always normalize by cohort size → avoid misleading %
- Cap at 12 months → long-tail distorts
- Segment cohorts → by acquisition channel, region, plan
- Automate → schedule in Airflow/Cron
- Visualize → heatmaps > tables
Free Datasets to Practice
| Dataset | Link |
|---|---|
| Online Retail | archive.ics.uci.edu/ml/datasets/online+retail |
| Instacart | instacart.com/datasets |
| Kaggle: E-commerce | kaggle.com/c/instacart-market-basket-analysis |
Final Checklist: Can You Build This?
| Task | Yes/No |
|---|---|
| SQL cohort table with retention % | ☐ |
| Python heatmap with seaborn | ☐ |
| Daily gaming retention (D1, D7) | ☐ |
| LTV per cohort | ☐ |
| Identify worst-performing cohort | ☐ |
All Yes → You’re cohort-ready for FAANG!
Next: Phase 2 – Statistics & A/B Testing
Now you can measure retention → learn to improve it.
Start Now:
1. Download Online Retail Dataset
2. Load into SQLite
3. Run the e-commerce SQL cohort query
Tag me when you push your cohort repo!
You now speak the language of product growth.