Behavioral Cohort Analysis: Real-World Examples

Advanced Retention & Product Analytics Goal: Go beyond acquisition → analyze what users do to predict churn, LTV, and growth.

Behavioral Cohort Analysis: Real-World Examples

Advanced Retention & Product Analytics

Goal: Go beyond acquisition → analyze what users do to predict churn, LTV, and growth.

Used by: Airbnb, Duolingo, Notion, Spotify
Interview frequency: 95% of Product DS roles
Impact: +30% accuracy in churn models

What is Behavioral Cohort Analysis?

Group users by their first meaningful action → track engagement, progression, and monetization

Cohort Type	Example
Acquisition	“Signed up in Jan”
Behavioral	“Completed first lesson in 24h”
Hybrid	“Signed up via Google + used search in first session”

3 Real-World Behavioral Cohort Examples

#	Product	Behavioral Cohort	Key Insight
1	Duolingo	First streak length	7-day streak → 5x LTV
2	Spotify	First playlist created	Creators → 3x retention
3	Notion	First page shared	Sharers → 4x viral coefficient

Example 1: Duolingo – Streak-Based Cohorts (SQL)

Goal: Do users who start with a 7-day streak retain better?

-- events table
-- event_type: 'lesson_completed', 'streak_updated'
-- event_time: timestamp

WITH first_lesson AS (
    SELECT 
        user_id,
        MIN(event_time) AS first_lesson_time
    FROM events 
    WHERE event_type = 'lesson_completed'
    GROUP BY user_id
),
streaks AS (
    SELECT 
        e.user_id,
        e.event_time::date AS activity_date,
        ROW_NUMBER() OVER (PARTITION BY e.user_id ORDER BY e.event_time::date) AS day_number
    FROM events e
    JOIN first_lesson fl ON e.user_id = fl.user_id
    WHERE e.event_type = 'lesson_completed'
      AND e.event_time::date <= fl.first_lesson_time::date + INTERVAL '30 days'
),
cohorts AS (
    SELECT 
        user_id,
        CASE 
            WHEN MAX(day_number) >= 7 THEN '7-day_streak'
            WHEN MAX(day_number) >= 3 THEN '3-day_streak'
            ELSE 'low_streak'
        END AS cohort
    FROM streaks
    GROUP BY user_id
),
daily_active AS (
    SELECT 
        c.cohort,
        DATE_TRUNC('day', e.event_time) AS activity_day,
        COUNT(DISTINCT e.user_id) AS dau
    FROM events e
    JOIN cohorts c ON e.user_id = c.user_id
    WHERE e.event_type = 'lesson_completed'
    GROUP BY c.cohort, activity_day
),
cohort_size AS (
    SELECT cohort, COUNT(*) AS size
    FROM cohorts
    GROUP BY cohort
)
SELECT 
    da.cohort,
    da.activity_day,
    da.dau,
    cs.size,
    ROUND(100.0 * da.dau / cs.size, 2) AS retention_pct
FROM daily_active da
JOIN cohort_size cs ON da.cohort = cs.cohort
WHERE da.activity_day <= '2025-05-01'
ORDER BY da.cohort, da.activity_day;

Insight

Cohort	Day 30 Retention
7-day_streak	68%
3-day_streak	42%
low_streak	15%

Action: Gamify onboarding to hit Day 7 streak.

Example 2: Spotify – Feature Adoption Cohorts (Python + Pandas)

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# events: user_id, event_time, event_type
data = pd.read_csv('spotify_events.csv')
data['event_time'] = pd.to_datetime(data['event_time'])

# Step 1: First playlist creation
first_playlist = (
    data[data['event_type'] == 'playlist_created']
    .groupby('user_id')['event_time'].min()
    .reset_index()
    .rename(columns={'event_time': 'first_playlist_time'})
)

data = data.merge(first_playlist, on='user_id', how='left')

# Step 2: Define cohort
data['cohort'] = data['first_playlist_time'].apply(
    lambda x: 'creator' if pd.notna(x) else 'non_creator'
)

# Step 3: Daily active users
data['activity_day'] = data['event_time'].dt.date
daily = (
    data[data['event_type'].isin(['play_song', 'skip_track'])]
    .groupby(['cohort', 'activity_day'])['user_id']
    .nunique().reset_index(name='dau')
)

# Step 4: Cohort size
cohort_size = data.groupby('cohort')['user_id'].nunique().reset_index()
cohort_size = cohort_size.rename(columns={'user_id': 'size'})

daily = daily.merge(cohort_size, on='cohort')
daily['retention'] = daily['dau'] / daily['size']

# Plot
plt.figure(figsize=(10, 6))
sns.lineplot(data=daily, x='activity_day', y='retention', hue='cohort')
plt.title('Retention: Playlist Creators vs Non-Creators')
plt.ylabel('Retention Rate')
plt.xlabel('Days Since First Activity')
plt.show()

Insight

Creators: 65% Day 30 retention
Non-creators: 28%
→ Prompt playlist creation in onboarding

Example 3: Notion – Collaboration Cohorts (Hybrid SQL + Python)

Cohort: Users who shared first page within 24h of signup

WITH first_signup AS (
    SELECT user_id, MIN(event_time) AS signup_time
    FROM events WHERE event_type = 'signup'
    GROUP BY user_id
),
first_share AS (
    SELECT 
        e.user_id,
        MIN(e.event_time) AS first_share_time
    FROM events e
    WHERE e.event_type = 'page_shared'
    GROUP BY e.user_id
),
behavioral_cohort AS (
    SELECT 
        fs.user_id,
        CASE 
            WHEN fsh.first_share_time IS NOT NULL 
             AND fsh.first_share_time <= fs.signup_time + INTERVAL '24 hours'
            THEN 'fast_sharer'
            ELSE 'standard_user'
        END AS cohort
    FROM first_signup fs
    LEFT JOIN first_share fsh ON fs.user_id = fsh.user_id
),
monthly_active AS (
    SELECT 
        bc.cohort,
        DATE_TRUNC('month', e.event_time) AS activity_month,
        COUNT(DISTINCT e.user_id) AS mau
    FROM events e
    JOIN behavioral_cohort bc ON e.user_id = bc.user_id
    WHERE e.event_type IN ('page_view', 'page_edit')
    GROUP BY bc.cohort, activity_month
)
SELECT 
    cohort,
    activity_month,
    mau,
    FIRST_VALUE(mau) OVER (PARTITION BY cohort ORDER BY activity_month) AS cohort_size,
    ROUND(100.0 * mau / FIRST_VALUE(mau) OVER (PARTITION BY cohort ORDER BY activity_month), 2) AS retention
FROM monthly_active
ORDER BY cohort, activity_month;

Insight

Cohort	Month 3 Retention	Viral Invites
fast_sharer	72%	2.1 per user
standard_user	31%	0.4 per user

Action: Add “Share with team” CTA on first page save.

Behavioral Cohort Framework (Reusable)

class BehavioralCohort:
    def __init__(self, events_df, behavior_event, time_window='24h'):
        self.events = events_df
        self.behavior = behavior_event
        self.window = time_window

    def define_cohort(self, first_event='signup'):
        first = (
            self.events[self.events['event_type'] == first_event]
            .groupby('user_id')['event_time'].min()
            .reset_index()
            .rename(columns={'event_time': 'first_time'})
        )

        behavior = (
            self.events[self.events['event_type'] == self.behavior]
            .groupby('user_id')['event_time'].min()
            .reset_index()
            .rename(columns={'event_time': 'behavior_time'})
        )

        merged = first.merge(behavior, on='user_id', how='left')
        merged['cohort'] = np.where(
            merged['behavior_time'] <= merged['first_time'] + pd.Timedelta(self.window),
            f'{self.behavior}_within_{self.window}',
            'control'
        )
        return merged[['user_id', 'cohort']]

    def retention_heatmap(self, activity_events=['page_view']):
        # ... same as before
        pass

Top 10 Behavioral Cohorts to Track

Product	Cohort	Metric
E-commerce	Added to cart in 1st session	Conversion rate
SaaS	Connected integration	Upgrade rate
Fintech	Linked bank account	Deposit volume
Health	Logged first workout	30-day consistency
Education	Submitted first assignment	Completion rate
Social	Sent first message	Network growth
Gaming	Won first match	Pay rate
News	Read 5+ articles	Subscription
Ride-share	Completed first ride	Frequency
Food	Ordered from 3+ restaurants	LTV

Interview Question (Solve in 10 Mins)

"An edtech app sees 80% drop-off after signup. Define a behavioral cohort based on 'quiz_started' within 1 hour of signup. Build a retention curve and recommend 1 onboarding change."

Answer Structure:
1. SQL: CTE for first quiz → cohort label
2. Python: Plot retention
3. Insight: Fast starters → 3x retention
4. Action: Auto-start quiz after signup

Project: Behavioral Cohort Dashboard

Repo: yourname/behavioral-cohorts

behavioral-cohorts/
├── sql/
│   ├── duolingo_streak.sql
│   ├── spotify_creator.sql
│   └── notion_sharer.sql
├── python/
│   ├── cohort_framework.py
│   └── dashboards/
│       ├── streak_retention.ipynb
│       └── creator_ltv.ipynb
├── data/
│   └── sample_events.csv
└── README.md

`README.md`

# Behavioral Cohort Analysis

## Cohorts Included
- **Duolingo**: 7-day streak starters
- **Spotify**: First playlist creators
- **Notion**: Fast sharers (within 24h)

## Key Finding
> Users who take a **high-intent action early** retain **3–5x better**

## Run
```bash
python python/cohort_framework.py --product duolingo

```

Tools & Best Practices

Tool	Use
Amplitude / Mixpanel	Event tracking
Snowflake / BigQuery	Scale
dbt models	Reusable cohorts
Looker / Mode	Dashboards
A/B test	Validate changes

Pro Tips

Define "meaningful" → tied to monetization or retention
Time-bound → within 1h, 24h, 7d
Combine behaviors → “searched + saved item”
Use in ML → fast_sharer = 1 as feature
Automate alerts → “7-day streak cohort down 10%”

Free Datasets to Practice

Dataset	Behavioral Event
Kaggle: Event Data	`addtocart`
Duolingo Shared Data	`lesson_completed`
Board Game Reviews	`rated_game`

Final Checklist

Task	Yes/No
Define cohort by first behavior	☐
SQL: time-window filter	☐
Python: retention curve	☐
Compare 2+ cohorts	☐
Recommend product change	☐

All Yes → You’re a Product Analytics Pro!

Next: A/B Testing & Causality

You can measure behavior → now change it.

Start Now:
1. Pick a dataset (e.g., Kaggle e-commerce)
2. Define cohort: “Added to cart in first session”
3. Build retention curve

Tag me when you push your behavioral cohort repo!
You now think like a Product Data Scientist.

Last updated: Nov 09, 2025

Behavioral Cohort Analysis: Real-World Examples

Advanced Retention & Product Analytics Goal: Go beyond acquisition → analyze what users do to predict churn, LTV, and growth.