Statistics & Math for Data Science

(Months 2–3 | 8 Weeks | 5–7 hrs/day) Goal: Don’t just run models — understand them. Master the math & stats behind ML, A/B tests, and causal inference.

Statistics & Math for Data Science

Statistics & Math for Data Science

Phase 2: Statistics & Math for Data Science

(Months 2–3 | 8 Weeks | 5–7 hrs/day)

Goal: Don’t just run models — understand them.
Master the math & stats behind ML, A/B tests, and causal inference.

Why?
- 90% of DS interviews test stats intuition
- Avoid p-hacking, overfitting, spurious correlations
- Explain "Why did the model predict X?"


Week-by-Week Roadmap

Week Focus Hours
1 Descriptive Stats + Distributions 30
2 Probability & Bayes 30
3 Hypothesis Testing & p-values 35
4 Confidence Intervals & Power 30
5 A/B Testing Deep Dive 35
6 Correlation vs Causation 30
7 Linear Algebra for ML 35
8 Capstone: A/B Test Report 25

Week 1: Descriptive Statistics & Distributions

Core Concepts

Concept Formula Intuition
Mean μ = Σx / n Average
Median Middle value Robust to outliers
Variance σ² = Σ(x-μ)²/n Spread
Std Dev σ = √σ² Typical deviation
Skewness (mean - median)/σ Tail direction
Kurtosis Heavy tails? Outlier proneness

Distributions

Distribution When PMF/PDF
Normal Heights, errors Bell curve
Binomial Coin flips P(k) = C(n,k)p^k(1-p)^(n-k)
Poisson Events in time P(k) = λ^k e^(-λ)/k!
Exponential Time between events f(x) = λe^(-λx)

Practice

import numpy as np
import seaborn as sns

data = np.random.normal(100, 15, 1000)
sns.histplot(data, kde=True)
print(f"Mean: {data.mean():.1f}, Std: {data.std():.1f}")

Resources:
- StatQuest: Descriptive Stats
- Kaggle: Statistics Course


Week 2: Probability & Bayes’ Theorem

Key Rules

Rule Formula
Addition P(A∪B) = P(A) + P(B) - P(A∩B)
Multiplication P(A∩B) = P(A)P(B\|A)
Complement P(A') = 1 - P(A)

Bayes’ Theorem

P(A|B) = [P(B|A) * P(A)] / P(B)

Example:

Spam filter:
- P(Spam) = 20%
- P("win" | Spam) = 80%
- P("win" | Ham) = 5%
→ P(Spam | "win") = ?

p_spam = 0.2
p_win_spam = 0.8
p_win_ham = 0.05
p_win = p_win_spam * p_spam + p_win_ham * (1 - p_spam)

p_spam_win = (p_win_spam * p_spam) / p_win
print(f"P(Spam|'win') = {p_spam_win:.1%}")
# → 76.2%

Resources:
- Khan Academy: Probability
- 3Blue1Brown: Bayes Video


Week 3: Hypothesis Testing & p-values

Framework

  1. Null (H₀): No effect
  2. Alternative (H₁): Effect exists
  3. Test Statisticp-value
  4. α = 0.05 → reject H₀ if p < 0.05

Common Tests

Test Use
t-test Compare means (small n)
z-test Compare means (large n)
Chi-square Categorical data
ANOVA 3+ groups
from scipy import stats
group_a = [25, 30, 28, 35]
group_b = [20, 22, 19, 25]
t_stat, p_val = stats.ttest_ind(group_a, group_b)
print(f"p-value: {p_val:.4f}")  # → 0.008 → reject H₀

Resources:
- StatQuest: p-values
- Book: Practical Statistics for Data Scientists (Ch 3–4)


Week 4: Confidence Intervals & Statistical Power

Confidence Interval (95%)

mean ± 1.96 * (σ / √n)
import numpy as np
data = np.random.normal(100, 15, 100)
se = 15 / np.sqrt(100)
ci = (100 - 1.96*se, 100 + 1.96*se)
print(f"95% CI: [{ci[0]:.1f}, {ci[1]:.1f}]")

Power = 1 - β

Probability of detecting an effect if it exists
80% power is standard

Factors:
- Effect size ↑ → Power ↑
- Sample size ↑ → Power ↑
- α ↑ → Power ↑

Resources:
- StatQuest: Power
- G*Power (free software)


Week 5: A/B Testing Deep Dive

End-to-End Process

graph TD
    A[Define Metric] --> B[Random Split]
    B --> C[Run Test]
    C --> D[Check AA]
    D --> E[t-test / z-test]
    E --> F[p < 0.05?]
    F -->|Yes| G[Winner]
    F -->|No| H[Inconclusive]

Practical Example

Goal: Does new checkout button increase conversion?

Group Users Conversions Rate
A (Control) 10,000 420 4.20%
B (Variant) 10,000 485 4.85%
from statsmodels.stats.proportion import proportions_ztest
count = np.array([485, 420])
nobs = np.array([10000, 10000])
z_stat, p_val = proportions_ztest(count, nobs)
print(f"p-value: {p_val:.4f}")  # → 0.031 → **significant**

Resources:
- Google A/B Testing Course (free)
- Evan Miller’s Calculator (online)


Week 6: Correlation ≠ Causation

Common Pitfalls

Example Correlation Causation?
Ice cream sales ↑ → Shark attacks ↑ 0.9 No (both caused by summer)
Storks → Babies 0.8 No (both in rural areas)

Tools to Infer Causation

Method Use
RCT Gold standard
Propensity Score Matching Observational
Difference-in-Differences Policy changes
Instrumental Variables Natural experiments

Resources:
- Causal Inference Book (free PDF)
- StatQuest: Correlation vs Causation


Week 7: Linear Algebra for ML

Why It Matters

ML Concept Linear Algebra
Features Vectors
Dataset Matrix
Weights Vector
Prediction Dot product
PCA Eigenvectors

Key Operations

A = np.array([[1, 2], [3, 4]])
b = np.array([5, 6])
x = np.dot(A, b)        # Matrix-vector
eigvals, eigvecs = np.linalg.eig(A)  # PCA

Resources:
- 3Blue1Brown: Essence of Linear Algebra
- MIT 18.06 (free)


Week 8: Capstone – A/B Test Report

Deliverable: ab_test_report.pdf

# A/B Test: New Checkout Button

## Hypothesis
H₀: Conversion rate same  
H₁: Variant > Control

## Results
| Group | n | Conversions | Rate |
|-------|----|--------------|------|
| A     | 10,000 | 420 | 4.20% |
| B     | 10,000 | 485 | 4.85% |

- **Lift**: +15.5%  
- **p-value**: 0.031  
- **95% CI**: [0.3%, 1.3%]  
- **Power**: 84%  
→ **Reject H₀**

## Recommendation
Roll out new button → **+6,500 conversions/year**

GitHub Repo: yourname/ab-test-capstone


Daily Schedule

Time Task
9–10 AM Watch video (StatQuest / 3B1B)
10–12 PM Code + solve 10 problems
1–3 PM Read book chapter
3–4 PM Explain concept aloud
4–5 PM Apply to dataset

Practice Problems (Solve 100+)

Platform Link
StrataScratch stratascratch.com
DataCamp Stats Track
HackerRank SQL + Stats
LeetCode Medium SQL

Assessment: Can You Explain?

Question Yes/No
Why is p < 0.05 not proof?
Bayes: P(A|B) vs P(B|A)
95% CI interpretation
t-test vs z-test
Matrix multiplication in NN

All Yes → You passed Phase 2!


Free Resources Summary

Topic Link
StatQuest youtube.com/c/joshstarmer
3Blue1Brown youtube.com/c/3blue1brown
Khan Academy khanacademy.org
Practical Stats Book PDF
A/B Calculator evanmiller.org/ab-testing

Pro Tips

  1. Teach it → record yourself explaining p-values
  2. Use real data → analyze your own A/B test
  3. Build a cheat sheetstats_cheat_sheet.pdf
  4. Interview prep → “Explain t-test in 2 mins”

Next: Phase 3 – Data Visualization

You understand the why → now show it.


Start Today:
1. Watch StatQuest: Mean, Variance, Std Dev
2. Open Jupyter:

import numpy as np
data = np.random.normal(100, 15, 1000)
print(f"Mean: {data.mean():.1f}, 95% in [{data.mean()-1.96*15/np.sqrt(1000):.1f}, {data.mean()+1.96*15/np.sqrt(1000):.1f}]")

Tag me when you finish your A/B report!
You now think like a Data Scientist.

Last updated: Nov 09, 2025

Statistics & Math for Data Science

(Months 2–3 | 8 Weeks | 5–7 hrs/day) Goal: Don’t just run models — understand them. Master the math & stats behind ML, A/B tests, and causal inference.

Statistics & Math for Data Science

Statistics & Math for Data Science

Phase 2: Statistics & Math for Data Science

(Months 2–3 | 8 Weeks | 5–7 hrs/day)

Goal: Don’t just run models — understand them.
Master the math & stats behind ML, A/B tests, and causal inference.

Why?
- 90% of DS interviews test stats intuition
- Avoid p-hacking, overfitting, spurious correlations
- Explain "Why did the model predict X?"


Week-by-Week Roadmap

Week Focus Hours
1 Descriptive Stats + Distributions 30
2 Probability & Bayes 30
3 Hypothesis Testing & p-values 35
4 Confidence Intervals & Power 30
5 A/B Testing Deep Dive 35
6 Correlation vs Causation 30
7 Linear Algebra for ML 35
8 Capstone: A/B Test Report 25

Week 1: Descriptive Statistics & Distributions

Core Concepts

Concept Formula Intuition
Mean μ = Σx / n Average
Median Middle value Robust to outliers
Variance σ² = Σ(x-μ)²/n Spread
Std Dev σ = √σ² Typical deviation
Skewness (mean - median)/σ Tail direction
Kurtosis Heavy tails? Outlier proneness

Distributions

Distribution When PMF/PDF
Normal Heights, errors Bell curve
Binomial Coin flips P(k) = C(n,k)p^k(1-p)^(n-k)
Poisson Events in time P(k) = λ^k e^(-λ)/k!
Exponential Time between events f(x) = λe^(-λx)

Practice

import numpy as np
import seaborn as sns

data = np.random.normal(100, 15, 1000)
sns.histplot(data, kde=True)
print(f"Mean: {data.mean():.1f}, Std: {data.std():.1f}")

Resources:
- StatQuest: Descriptive Stats
- Kaggle: Statistics Course


Week 2: Probability & Bayes’ Theorem

Key Rules

Rule Formula
Addition P(A∪B) = P(A) + P(B) - P(A∩B)
Multiplication P(A∩B) = P(A)P(B\|A)
Complement P(A') = 1 - P(A)

Bayes’ Theorem

P(A|B) = [P(B|A) * P(A)] / P(B)

Example:

Spam filter:
- P(Spam) = 20%
- P("win" | Spam) = 80%
- P("win" | Ham) = 5%
→ P(Spam | "win") = ?

p_spam = 0.2
p_win_spam = 0.8
p_win_ham = 0.05
p_win = p_win_spam * p_spam + p_win_ham * (1 - p_spam)

p_spam_win = (p_win_spam * p_spam) / p_win
print(f"P(Spam|'win') = {p_spam_win:.1%}")
# → 76.2%

Resources:
- Khan Academy: Probability
- 3Blue1Brown: Bayes Video


Week 3: Hypothesis Testing & p-values

Framework

  1. Null (H₀): No effect
  2. Alternative (H₁): Effect exists
  3. Test Statisticp-value
  4. α = 0.05 → reject H₀ if p < 0.05

Common Tests

Test Use
t-test Compare means (small n)
z-test Compare means (large n)
Chi-square Categorical data
ANOVA 3+ groups
from scipy import stats
group_a = [25, 30, 28, 35]
group_b = [20, 22, 19, 25]
t_stat, p_val = stats.ttest_ind(group_a, group_b)
print(f"p-value: {p_val:.4f}")  # → 0.008 → reject H₀

Resources:
- StatQuest: p-values
- Book: Practical Statistics for Data Scientists (Ch 3–4)


Week 4: Confidence Intervals & Statistical Power

Confidence Interval (95%)

mean ± 1.96 * (σ / √n)
import numpy as np
data = np.random.normal(100, 15, 100)
se = 15 / np.sqrt(100)
ci = (100 - 1.96*se, 100 + 1.96*se)
print(f"95% CI: [{ci[0]:.1f}, {ci[1]:.1f}]")

Power = 1 - β

Probability of detecting an effect if it exists
80% power is standard

Factors:
- Effect size ↑ → Power ↑
- Sample size ↑ → Power ↑
- α ↑ → Power ↑

Resources:
- StatQuest: Power
- G*Power (free software)


Week 5: A/B Testing Deep Dive

End-to-End Process

graph TD
    A[Define Metric] --> B[Random Split]
    B --> C[Run Test]
    C --> D[Check AA]
    D --> E[t-test / z-test]
    E --> F[p < 0.05?]
    F -->|Yes| G[Winner]
    F -->|No| H[Inconclusive]

Practical Example

Goal: Does new checkout button increase conversion?

Group Users Conversions Rate
A (Control) 10,000 420 4.20%
B (Variant) 10,000 485 4.85%
from statsmodels.stats.proportion import proportions_ztest
count = np.array([485, 420])
nobs = np.array([10000, 10000])
z_stat, p_val = proportions_ztest(count, nobs)
print(f"p-value: {p_val:.4f}")  # → 0.031 → **significant**

Resources:
- Google A/B Testing Course (free)
- Evan Miller’s Calculator (online)


Week 6: Correlation ≠ Causation

Common Pitfalls

Example Correlation Causation?
Ice cream sales ↑ → Shark attacks ↑ 0.9 No (both caused by summer)
Storks → Babies 0.8 No (both in rural areas)

Tools to Infer Causation

Method Use
RCT Gold standard
Propensity Score Matching Observational
Difference-in-Differences Policy changes
Instrumental Variables Natural experiments

Resources:
- Causal Inference Book (free PDF)
- StatQuest: Correlation vs Causation


Week 7: Linear Algebra for ML

Why It Matters

ML Concept Linear Algebra
Features Vectors
Dataset Matrix
Weights Vector
Prediction Dot product
PCA Eigenvectors

Key Operations

A = np.array([[1, 2], [3, 4]])
b = np.array([5, 6])
x = np.dot(A, b)        # Matrix-vector
eigvals, eigvecs = np.linalg.eig(A)  # PCA

Resources:
- 3Blue1Brown: Essence of Linear Algebra
- MIT 18.06 (free)


Week 8: Capstone – A/B Test Report

Deliverable: ab_test_report.pdf

# A/B Test: New Checkout Button

## Hypothesis
H₀: Conversion rate same  
H₁: Variant > Control

## Results
| Group | n | Conversions | Rate |
|-------|----|--------------|------|
| A     | 10,000 | 420 | 4.20% |
| B     | 10,000 | 485 | 4.85% |

- **Lift**: +15.5%  
- **p-value**: 0.031  
- **95% CI**: [0.3%, 1.3%]  
- **Power**: 84%  
→ **Reject H₀**

## Recommendation
Roll out new button → **+6,500 conversions/year**

GitHub Repo: yourname/ab-test-capstone


Daily Schedule

Time Task
9–10 AM Watch video (StatQuest / 3B1B)
10–12 PM Code + solve 10 problems
1–3 PM Read book chapter
3–4 PM Explain concept aloud
4–5 PM Apply to dataset

Practice Problems (Solve 100+)

Platform Link
StrataScratch stratascratch.com
DataCamp Stats Track
HackerRank SQL + Stats
LeetCode Medium SQL

Assessment: Can You Explain?

Question Yes/No
Why is p < 0.05 not proof?
Bayes: P(A|B) vs P(B|A)
95% CI interpretation
t-test vs z-test
Matrix multiplication in NN

All Yes → You passed Phase 2!


Free Resources Summary

Topic Link
StatQuest youtube.com/c/joshstarmer
3Blue1Brown youtube.com/c/3blue1brown
Khan Academy khanacademy.org
Practical Stats Book PDF
A/B Calculator evanmiller.org/ab-testing

Pro Tips

  1. Teach it → record yourself explaining p-values
  2. Use real data → analyze your own A/B test
  3. Build a cheat sheetstats_cheat_sheet.pdf
  4. Interview prep → “Explain t-test in 2 mins”

Next: Phase 3 – Data Visualization

You understand the why → now show it.


Start Today:
1. Watch StatQuest: Mean, Variance, Std Dev
2. Open Jupyter:

import numpy as np
data = np.random.normal(100, 15, 1000)
print(f"Mean: {data.mean():.1f}, 95% in [{data.mean()-1.96*15/np.sqrt(1000):.1f}, {data.mean()+1.96*15/np.sqrt(1000):.1f}]")

Tag me when you finish your A/B report!
You now think like a Data Scientist.

Last updated: Nov 09, 2025