Statistics & Math for Data Science
(Months 2–3 | 8 Weeks | 5–7 hrs/day) Goal: Don’t just run models — understand them. Master the math & stats behind ML, A/B tests, and causal inference.
Statistics & Math for Data Science
Statistics & Math for Data Science
Phase 2: Statistics & Math for Data Science
(Months 2–3 | 8 Weeks | 5–7 hrs/day)
Goal: Don’t just run models — understand them.
Master the math & stats behind ML, A/B tests, and causal inference.
Why?
- 90% of DS interviews test stats intuition
- Avoid p-hacking, overfitting, spurious correlations
- Explain "Why did the model predict X?"
Week-by-Week Roadmap
| Week | Focus | Hours |
|---|---|---|
| 1 | Descriptive Stats + Distributions | 30 |
| 2 | Probability & Bayes | 30 |
| 3 | Hypothesis Testing & p-values | 35 |
| 4 | Confidence Intervals & Power | 30 |
| 5 | A/B Testing Deep Dive | 35 |
| 6 | Correlation vs Causation | 30 |
| 7 | Linear Algebra for ML | 35 |
| 8 | Capstone: A/B Test Report | 25 |
Week 1: Descriptive Statistics & Distributions
Core Concepts
| Concept | Formula | Intuition |
|---|---|---|
| Mean | μ = Σx / n |
Average |
| Median | Middle value | Robust to outliers |
| Variance | σ² = Σ(x-μ)²/n |
Spread |
| Std Dev | σ = √σ² |
Typical deviation |
| Skewness | (mean - median)/σ |
Tail direction |
| Kurtosis | Heavy tails? | Outlier proneness |
Distributions
| Distribution | When | PMF/PDF |
|---|---|---|
| Normal | Heights, errors | Bell curve |
| Binomial | Coin flips | P(k) = C(n,k)p^k(1-p)^(n-k) |
| Poisson | Events in time | P(k) = λ^k e^(-λ)/k! |
| Exponential | Time between events | f(x) = λe^(-λx) |
Practice
import numpy as np
import seaborn as sns
data = np.random.normal(100, 15, 1000)
sns.histplot(data, kde=True)
print(f"Mean: {data.mean():.1f}, Std: {data.std():.1f}")
Resources:
- StatQuest: Descriptive Stats
- Kaggle: Statistics Course
Week 2: Probability & Bayes’ Theorem
Key Rules
| Rule | Formula |
|---|---|
| Addition | P(A∪B) = P(A) + P(B) - P(A∩B) |
| Multiplication | P(A∩B) = P(A)P(B\|A) |
| Complement | P(A') = 1 - P(A) |
Bayes’ Theorem
P(A|B) = [P(B|A) * P(A)] / P(B)
Example:
Spam filter:
- P(Spam) = 20%
- P("win" | Spam) = 80%
- P("win" | Ham) = 5%
→ P(Spam | "win") = ?
p_spam = 0.2
p_win_spam = 0.8
p_win_ham = 0.05
p_win = p_win_spam * p_spam + p_win_ham * (1 - p_spam)
p_spam_win = (p_win_spam * p_spam) / p_win
print(f"P(Spam|'win') = {p_spam_win:.1%}")
# → 76.2%
Resources:
- Khan Academy: Probability
- 3Blue1Brown: Bayes Video
Week 3: Hypothesis Testing & p-values
Framework
- Null (H₀): No effect
- Alternative (H₁): Effect exists
- Test Statistic → p-value
- α = 0.05 → reject H₀ if p < 0.05
Common Tests
| Test | Use |
|---|---|
| t-test | Compare means (small n) |
| z-test | Compare means (large n) |
| Chi-square | Categorical data |
| ANOVA | 3+ groups |
from scipy import stats
group_a = [25, 30, 28, 35]
group_b = [20, 22, 19, 25]
t_stat, p_val = stats.ttest_ind(group_a, group_b)
print(f"p-value: {p_val:.4f}") # → 0.008 → reject H₀
Resources:
- StatQuest: p-values
- Book: Practical Statistics for Data Scientists (Ch 3–4)
Week 4: Confidence Intervals & Statistical Power
Confidence Interval (95%)
mean ± 1.96 * (σ / √n)
import numpy as np
data = np.random.normal(100, 15, 100)
se = 15 / np.sqrt(100)
ci = (100 - 1.96*se, 100 + 1.96*se)
print(f"95% CI: [{ci[0]:.1f}, {ci[1]:.1f}]")
Power = 1 - β
Probability of detecting an effect if it exists
80% power is standard
Factors:
- Effect size ↑ → Power ↑
- Sample size ↑ → Power ↑
- α ↑ → Power ↑
Resources:
- StatQuest: Power
- G*Power (free software)
Week 5: A/B Testing Deep Dive
End-to-End Process
graph TD
A[Define Metric] --> B[Random Split]
B --> C[Run Test]
C --> D[Check AA]
D --> E[t-test / z-test]
E --> F[p < 0.05?]
F -->|Yes| G[Winner]
F -->|No| H[Inconclusive]
Practical Example
Goal: Does new checkout button increase conversion?
| Group | Users | Conversions | Rate |
|---|---|---|---|
| A (Control) | 10,000 | 420 | 4.20% |
| B (Variant) | 10,000 | 485 | 4.85% |
from statsmodels.stats.proportion import proportions_ztest
count = np.array([485, 420])
nobs = np.array([10000, 10000])
z_stat, p_val = proportions_ztest(count, nobs)
print(f"p-value: {p_val:.4f}") # → 0.031 → **significant**
Resources:
- Google A/B Testing Course (free)
- Evan Miller’s Calculator (online)
Week 6: Correlation ≠ Causation
Common Pitfalls
| Example | Correlation | Causation? |
|---|---|---|
| Ice cream sales ↑ → Shark attacks ↑ | 0.9 | No (both caused by summer) |
| Storks → Babies | 0.8 | No (both in rural areas) |
Tools to Infer Causation
| Method | Use |
|---|---|
| RCT | Gold standard |
| Propensity Score Matching | Observational |
| Difference-in-Differences | Policy changes |
| Instrumental Variables | Natural experiments |
Resources:
- Causal Inference Book (free PDF)
- StatQuest: Correlation vs Causation
Week 7: Linear Algebra for ML
Why It Matters
| ML Concept | Linear Algebra |
|---|---|
| Features | Vectors |
| Dataset | Matrix |
| Weights | Vector |
| Prediction | Dot product |
| PCA | Eigenvectors |
Key Operations
A = np.array([[1, 2], [3, 4]])
b = np.array([5, 6])
x = np.dot(A, b) # Matrix-vector
eigvals, eigvecs = np.linalg.eig(A) # PCA
Resources:
- 3Blue1Brown: Essence of Linear Algebra
- MIT 18.06 (free)
Week 8: Capstone – A/B Test Report
Deliverable: ab_test_report.pdf
# A/B Test: New Checkout Button
## Hypothesis
H₀: Conversion rate same
H₁: Variant > Control
## Results
| Group | n | Conversions | Rate |
|-------|----|--------------|------|
| A | 10,000 | 420 | 4.20% |
| B | 10,000 | 485 | 4.85% |
- **Lift**: +15.5%
- **p-value**: 0.031
- **95% CI**: [0.3%, 1.3%]
- **Power**: 84%
→ **Reject H₀**
## Recommendation
Roll out new button → **+6,500 conversions/year**
GitHub Repo: yourname/ab-test-capstone
Daily Schedule
| Time | Task |
|---|---|
| 9–10 AM | Watch video (StatQuest / 3B1B) |
| 10–12 PM | Code + solve 10 problems |
| 1–3 PM | Read book chapter |
| 3–4 PM | Explain concept aloud |
| 4–5 PM | Apply to dataset |
Practice Problems (Solve 100+)
| Platform | Link |
|---|---|
| StrataScratch | stratascratch.com |
| DataCamp | Stats Track |
| HackerRank | SQL + Stats |
| LeetCode | Medium SQL |
Assessment: Can You Explain?
| Question | Yes/No |
|---|---|
| Why is p < 0.05 not proof? | ☐ |
| Bayes: P(A|B) vs P(B|A) | ☐ |
| 95% CI interpretation | ☐ |
| t-test vs z-test | ☐ |
| Matrix multiplication in NN | ☐ |
All Yes → You passed Phase 2!
Free Resources Summary
| Topic | Link |
|---|---|
| StatQuest | youtube.com/c/joshstarmer |
| 3Blue1Brown | youtube.com/c/3blue1brown |
| Khan Academy | khanacademy.org |
| Practical Stats Book | |
| A/B Calculator | evanmiller.org/ab-testing |
Pro Tips
- Teach it → record yourself explaining p-values
- Use real data → analyze your own A/B test
- Build a cheat sheet →
stats_cheat_sheet.pdf - Interview prep → “Explain t-test in 2 mins”
Next: Phase 3 – Data Visualization
You understand the why → now show it.
Start Today:
1. Watch StatQuest: Mean, Variance, Std Dev
2. Open Jupyter:
import numpy as np
data = np.random.normal(100, 15, 1000)
print(f"Mean: {data.mean():.1f}, 95% in [{data.mean()-1.96*15/np.sqrt(1000):.1f}, {data.mean()+1.96*15/np.sqrt(1000):.1f}]")
Tag me when you finish your A/B report!
You now think like a Data Scientist.
Statistics & Math for Data Science
(Months 2–3 | 8 Weeks | 5–7 hrs/day) Goal: Don’t just run models — understand them. Master the math & stats behind ML, A/B tests, and causal inference.
Statistics & Math for Data Science
Statistics & Math for Data Science
Phase 2: Statistics & Math for Data Science
(Months 2–3 | 8 Weeks | 5–7 hrs/day)
Goal: Don’t just run models — understand them.
Master the math & stats behind ML, A/B tests, and causal inference.
Why?
- 90% of DS interviews test stats intuition
- Avoid p-hacking, overfitting, spurious correlations
- Explain "Why did the model predict X?"
Week-by-Week Roadmap
| Week | Focus | Hours |
|---|---|---|
| 1 | Descriptive Stats + Distributions | 30 |
| 2 | Probability & Bayes | 30 |
| 3 | Hypothesis Testing & p-values | 35 |
| 4 | Confidence Intervals & Power | 30 |
| 5 | A/B Testing Deep Dive | 35 |
| 6 | Correlation vs Causation | 30 |
| 7 | Linear Algebra for ML | 35 |
| 8 | Capstone: A/B Test Report | 25 |
Week 1: Descriptive Statistics & Distributions
Core Concepts
| Concept | Formula | Intuition |
|---|---|---|
| Mean | μ = Σx / n |
Average |
| Median | Middle value | Robust to outliers |
| Variance | σ² = Σ(x-μ)²/n |
Spread |
| Std Dev | σ = √σ² |
Typical deviation |
| Skewness | (mean - median)/σ |
Tail direction |
| Kurtosis | Heavy tails? | Outlier proneness |
Distributions
| Distribution | When | PMF/PDF |
|---|---|---|
| Normal | Heights, errors | Bell curve |
| Binomial | Coin flips | P(k) = C(n,k)p^k(1-p)^(n-k) |
| Poisson | Events in time | P(k) = λ^k e^(-λ)/k! |
| Exponential | Time between events | f(x) = λe^(-λx) |
Practice
import numpy as np
import seaborn as sns
data = np.random.normal(100, 15, 1000)
sns.histplot(data, kde=True)
print(f"Mean: {data.mean():.1f}, Std: {data.std():.1f}")
Resources:
- StatQuest: Descriptive Stats
- Kaggle: Statistics Course
Week 2: Probability & Bayes’ Theorem
Key Rules
| Rule | Formula |
|---|---|
| Addition | P(A∪B) = P(A) + P(B) - P(A∩B) |
| Multiplication | P(A∩B) = P(A)P(B\|A) |
| Complement | P(A') = 1 - P(A) |
Bayes’ Theorem
P(A|B) = [P(B|A) * P(A)] / P(B)
Example:
Spam filter:
- P(Spam) = 20%
- P("win" | Spam) = 80%
- P("win" | Ham) = 5%
→ P(Spam | "win") = ?
p_spam = 0.2
p_win_spam = 0.8
p_win_ham = 0.05
p_win = p_win_spam * p_spam + p_win_ham * (1 - p_spam)
p_spam_win = (p_win_spam * p_spam) / p_win
print(f"P(Spam|'win') = {p_spam_win:.1%}")
# → 76.2%
Resources:
- Khan Academy: Probability
- 3Blue1Brown: Bayes Video
Week 3: Hypothesis Testing & p-values
Framework
- Null (H₀): No effect
- Alternative (H₁): Effect exists
- Test Statistic → p-value
- α = 0.05 → reject H₀ if p < 0.05
Common Tests
| Test | Use |
|---|---|
| t-test | Compare means (small n) |
| z-test | Compare means (large n) |
| Chi-square | Categorical data |
| ANOVA | 3+ groups |
from scipy import stats
group_a = [25, 30, 28, 35]
group_b = [20, 22, 19, 25]
t_stat, p_val = stats.ttest_ind(group_a, group_b)
print(f"p-value: {p_val:.4f}") # → 0.008 → reject H₀
Resources:
- StatQuest: p-values
- Book: Practical Statistics for Data Scientists (Ch 3–4)
Week 4: Confidence Intervals & Statistical Power
Confidence Interval (95%)
mean ± 1.96 * (σ / √n)
import numpy as np
data = np.random.normal(100, 15, 100)
se = 15 / np.sqrt(100)
ci = (100 - 1.96*se, 100 + 1.96*se)
print(f"95% CI: [{ci[0]:.1f}, {ci[1]:.1f}]")
Power = 1 - β
Probability of detecting an effect if it exists
80% power is standard
Factors:
- Effect size ↑ → Power ↑
- Sample size ↑ → Power ↑
- α ↑ → Power ↑
Resources:
- StatQuest: Power
- G*Power (free software)
Week 5: A/B Testing Deep Dive
End-to-End Process
graph TD
A[Define Metric] --> B[Random Split]
B --> C[Run Test]
C --> D[Check AA]
D --> E[t-test / z-test]
E --> F[p < 0.05?]
F -->|Yes| G[Winner]
F -->|No| H[Inconclusive]
Practical Example
Goal: Does new checkout button increase conversion?
| Group | Users | Conversions | Rate |
|---|---|---|---|
| A (Control) | 10,000 | 420 | 4.20% |
| B (Variant) | 10,000 | 485 | 4.85% |
from statsmodels.stats.proportion import proportions_ztest
count = np.array([485, 420])
nobs = np.array([10000, 10000])
z_stat, p_val = proportions_ztest(count, nobs)
print(f"p-value: {p_val:.4f}") # → 0.031 → **significant**
Resources:
- Google A/B Testing Course (free)
- Evan Miller’s Calculator (online)
Week 6: Correlation ≠ Causation
Common Pitfalls
| Example | Correlation | Causation? |
|---|---|---|
| Ice cream sales ↑ → Shark attacks ↑ | 0.9 | No (both caused by summer) |
| Storks → Babies | 0.8 | No (both in rural areas) |
Tools to Infer Causation
| Method | Use |
|---|---|
| RCT | Gold standard |
| Propensity Score Matching | Observational |
| Difference-in-Differences | Policy changes |
| Instrumental Variables | Natural experiments |
Resources:
- Causal Inference Book (free PDF)
- StatQuest: Correlation vs Causation
Week 7: Linear Algebra for ML
Why It Matters
| ML Concept | Linear Algebra |
|---|---|
| Features | Vectors |
| Dataset | Matrix |
| Weights | Vector |
| Prediction | Dot product |
| PCA | Eigenvectors |
Key Operations
A = np.array([[1, 2], [3, 4]])
b = np.array([5, 6])
x = np.dot(A, b) # Matrix-vector
eigvals, eigvecs = np.linalg.eig(A) # PCA
Resources:
- 3Blue1Brown: Essence of Linear Algebra
- MIT 18.06 (free)
Week 8: Capstone – A/B Test Report
Deliverable: ab_test_report.pdf
# A/B Test: New Checkout Button
## Hypothesis
H₀: Conversion rate same
H₁: Variant > Control
## Results
| Group | n | Conversions | Rate |
|-------|----|--------------|------|
| A | 10,000 | 420 | 4.20% |
| B | 10,000 | 485 | 4.85% |
- **Lift**: +15.5%
- **p-value**: 0.031
- **95% CI**: [0.3%, 1.3%]
- **Power**: 84%
→ **Reject H₀**
## Recommendation
Roll out new button → **+6,500 conversions/year**
GitHub Repo: yourname/ab-test-capstone
Daily Schedule
| Time | Task |
|---|---|
| 9–10 AM | Watch video (StatQuest / 3B1B) |
| 10–12 PM | Code + solve 10 problems |
| 1–3 PM | Read book chapter |
| 3–4 PM | Explain concept aloud |
| 4–5 PM | Apply to dataset |
Practice Problems (Solve 100+)
| Platform | Link |
|---|---|
| StrataScratch | stratascratch.com |
| DataCamp | Stats Track |
| HackerRank | SQL + Stats |
| LeetCode | Medium SQL |
Assessment: Can You Explain?
| Question | Yes/No |
|---|---|
| Why is p < 0.05 not proof? | ☐ |
| Bayes: P(A|B) vs P(B|A) | ☐ |
| 95% CI interpretation | ☐ |
| t-test vs z-test | ☐ |
| Matrix multiplication in NN | ☐ |
All Yes → You passed Phase 2!
Free Resources Summary
| Topic | Link |
|---|---|
| StatQuest | youtube.com/c/joshstarmer |
| 3Blue1Brown | youtube.com/c/3blue1brown |
| Khan Academy | khanacademy.org |
| Practical Stats Book | |
| A/B Calculator | evanmiller.org/ab-testing |
Pro Tips
- Teach it → record yourself explaining p-values
- Use real data → analyze your own A/B test
- Build a cheat sheet →
stats_cheat_sheet.pdf - Interview prep → “Explain t-test in 2 mins”
Next: Phase 3 – Data Visualization
You understand the why → now show it.
Start Today:
1. Watch StatQuest: Mean, Variance, Std Dev
2. Open Jupyter:
import numpy as np
data = np.random.normal(100, 15, 1000)
print(f"Mean: {data.mean():.1f}, 95% in [{data.mean()-1.96*15/np.sqrt(1000):.1f}, {data.mean()+1.96*15/np.sqrt(1000):.1f}]")
Tag me when you finish your A/B report!
You now think like a Data Scientist.