Machine Learning Core
Goal: Build & Evaluate Models Like a Pro
Machine Learning Core
Machine Learning Core
Machine Learning Core
Goal: Build & Evaluate Models Like a Pro
Focus: Scikit-learn + Real Projects
Why?
- 95% of DS roles require model building & evaluation
- AUC > 0.9 = job offer
- Kaggle Top 20% = senior-level interview
Week-by-Week Roadmap
| Week | Focus | Hours |
|---|---|---|
| 1–2 | Regression (Linear + Logistic) | 60 |
| 3–4 | Classification (Trees, SVM, KNN) | 60 |
| 5–6 | Model Evaluation & Cross-Validation | 60 |
| 7–8 | Ensemble Methods (RF, XGBoost) | 60 |
| 9–10 | Hyperparameter Tuning & Pipelines | 60 |
| 11–12 | Capstone: 2 Kaggle Competitions | 80 |
Tools Setup (Day 1)
pip install scikit-learn pandas numpy matplotlib seaborn xgboost optuna kaggle
# config.py
import os
os.environ['KAGGLE_USERNAME'] = 'yourname'
os.environ['KAGGLE_KEY'] = 'yourkey'
Week 1–2: Regression Deep Dive
1. Linear Regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.2f}")
print(f"R²: {r2_score(y_test, y_pred):.3f}")
Project: House Prices
Goal: RMSE < 25,000 → Top 20%
2. Logistic Regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]
from sklearn.metrics import roc_auc_score
print(f"AUC: {roc_auc_score(y_test, y_prob):.4f}")
Project: Titanic
Goal: AUC > 0.85 → Top 10%
Resources:
- Andrew Ng ML Course (Weeks 1–3) – coursera.org
- Hands-On ML – Ch 2–4
Week 3–4: Classification Algorithms
| Algorithm | Use Case | Code |
|---|---|---|
| Decision Tree | Interpretable | DecisionTreeClassifier(max_depth=5) |
| Random Forest | Robust | RandomForestClassifier(n_estimators=100) |
| SVM | Small, clean data | SVC(kernel='rbf', probability=True) |
| KNN | Simple baseline | KNeighborsClassifier(n_neighbors=5) |
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
models = {
'RF': RandomForestClassifier(n_estimators=200),
'SVM': SVC(probability=True),
'KNN': KNeighborsClassifier()
}
for name, model in models.items():
model.fit(X_train, y_train)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
print(f"{name} AUC: {auc:.4f}")
Project: Customer Churn
Goal: F1 > 0.65
Resources:
- Hands-On ML – Ch 5–6
- Kaggle Intermediate ML – kaggle.com/learn/intermediate-machine-learning
Week 5–6: Model Evaluation Masterclass
Confusion Matrix
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('True')
plt.xlabel('Predicted')
Key Metrics
| Metric | Formula | When to Use |
|---|---|---|
| Accuracy | (TP+TN)/(Total) | Balanced |
| Precision | TP/(TP+FP) | Minimize false positives |
| Recall | TP/(TP+FN) | Catch all positives |
| F1 | 2×(P×R)/(P+R) | Imbalanced |
| AUC-ROC | Area under ROC | Ranking quality |
Cross-Validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
print(f"CV AUC: {scores.mean():.4f} ± {scores.std():.4f}")
StatQuest Videos:
- ROC & AUC
- Precision & Recall
Week 7–8: Ensemble Power (RF + XGBoost)
import xgboost as xgb
model = xgb.XGBClassifier(
n_estimators=500,
learning_rate=0.05,
max_depth=6,
subsample=0.8,
colsample_bytree=0.8,
eval_metric='auc'
)
model.fit(X_train, y_train,
eval_set=[(X_test, y_test)],
early_stopping_rounds=50,
verbose=False)
Project: Porto Seguro
Goal: Gini > 0.28 → Top 5%
Week 9–10: Pipelines & Hyperparameter Tuning
Scikit-learn Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
numeric_features = ['age', 'fare']
categorical_features = ['sex', 'embarked']
preprocessor = ColumnTransformer([
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(), categorical_features)
])
pipeline = Pipeline([
('prep', preprocessor),
('model', xgb.XGBClassifier())
])
Hyperparameter Tuning
# Grid Search
from sklearn.model_selection import GridSearchCV
param_grid = {'model__max_depth': [3, 5, 7]}
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc')
grid.fit(X, y)
# Optuna (Faster!)
import optuna
def objective(trial):
params = {
'max_depth': trial.suggest_int('max_depth', 3, 10),
'learning_rate': trial.suggest_float('lr', 0.01, 0.3)
}
model = xgb.XGBClassifier(**params)
return cross_val_score(model, X, y, cv=3, scoring='roc_auc').mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
Week 11–12: Capstone – Kaggle Top 20% in 2 Comps
Project 1: House Prices
- Feature Engineering:
TotalSF,Age,HasPool - Model: XGBoost + Optuna
- Target: RMSE < 0.12 (log scale)
Project 2: Santander Customer Transaction
- Anonymized features → PCA + XGBoost
- Target: AUC > 0.90
Deliverables (GitHub: yourname/ml-core-capstone)
ml-core-capstone/
├── house_prices/
│ ├── notebook.ipynb
│ ├── submission.csv (RMSE: 0.118)
│ └── model.pkl
├── santander/
│ ├── notebook.ipynb
│ └── submission.csv (AUC: 0.902)
└── README.md
README.md (Hiring Manager Magnet)
# ML Core Capstone: Kaggle Top 20%
## House Prices (RMSE: 0.118 – Top 18%)
- Feature eng: TotalSF, Age, Neighborhood encoding
- XGBoost + Optuna (50 trials)
- Cross-validation: 5-fold
## Santander (AUC: 0.902 – Top 15%)
- PCA on 200 anon features
- Early stopping + class weights
**Tech**: Scikit-learn, XGBoost, Optuna, Pandas
**Live**: [kaggle.com/yourname](https://www.kaggle.com/yourname)
Interview Prep: Can You Answer?
| Question | Your Answer |
|---|---|
| "Explain overfitting" | High train acc, low test → use CV |
| "AUC vs Accuracy" | AUC robust to imbalance |
| "Why XGBoost?" | Gradient boosting + regularization |
| "Pipeline benefits" | Reproducible, prevents leakage |
| "Optuna vs GridSearch" | Bayesian, faster convergence |
Assessment: Can You Do This?
| Task | Yes/No |
|---|---|
| Build end-to-end pipeline | ☐ |
| Achieve AUC > 0.85 on Titanic | ☐ |
| Tune XGBoost with Optuna | ☐ |
| Explain confusion matrix | ☐ |
| Submit Kaggle (Top 20%) | ☐ |
All Yes → You passed Phase 4!
Free Resources Summary
| Resource | Link |
|---|---|
| Andrew Ng ML | coursera.org/learn/machine-learning |
| Hands-On ML Book | GitHub |
| Kaggle Learn | kaggle.com/learn |
| StatQuest | youtube.com/c/joshstarmer |
| Optuna Docs | optuna.org |
Pro Tips
- Always use pipelines → no data leakage
- Log everything → MLflow (next phase)
- Submit early, submit often → Kaggle leaderboard
- Write blogs → "How I got Top 20% with XGBoost"
Next: Phase 5 – Advanced ML & MLOps
You can build models → now deploy them.
Start Now:
kaggle competitions download -c house-prices-advanced-regression-techniques
unzip house-prices-advanced-regression-techniques.zip
Tag me when you hit Kaggle Top 20%!
You’re now a real Machine Learning engineer.
Machine Learning Core
Goal: Build & Evaluate Models Like a Pro
Machine Learning Core
Machine Learning Core
Machine Learning Core
Goal: Build & Evaluate Models Like a Pro
Focus: Scikit-learn + Real Projects
Why?
- 95% of DS roles require model building & evaluation
- AUC > 0.9 = job offer
- Kaggle Top 20% = senior-level interview
Week-by-Week Roadmap
| Week | Focus | Hours |
|---|---|---|
| 1–2 | Regression (Linear + Logistic) | 60 |
| 3–4 | Classification (Trees, SVM, KNN) | 60 |
| 5–6 | Model Evaluation & Cross-Validation | 60 |
| 7–8 | Ensemble Methods (RF, XGBoost) | 60 |
| 9–10 | Hyperparameter Tuning & Pipelines | 60 |
| 11–12 | Capstone: 2 Kaggle Competitions | 80 |
Tools Setup (Day 1)
pip install scikit-learn pandas numpy matplotlib seaborn xgboost optuna kaggle
# config.py
import os
os.environ['KAGGLE_USERNAME'] = 'yourname'
os.environ['KAGGLE_KEY'] = 'yourkey'
Week 1–2: Regression Deep Dive
1. Linear Regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.2f}")
print(f"R²: {r2_score(y_test, y_pred):.3f}")
Project: House Prices
Goal: RMSE < 25,000 → Top 20%
2. Logistic Regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]
from sklearn.metrics import roc_auc_score
print(f"AUC: {roc_auc_score(y_test, y_prob):.4f}")
Project: Titanic
Goal: AUC > 0.85 → Top 10%
Resources:
- Andrew Ng ML Course (Weeks 1–3) – coursera.org
- Hands-On ML – Ch 2–4
Week 3–4: Classification Algorithms
| Algorithm | Use Case | Code |
|---|---|---|
| Decision Tree | Interpretable | DecisionTreeClassifier(max_depth=5) |
| Random Forest | Robust | RandomForestClassifier(n_estimators=100) |
| SVM | Small, clean data | SVC(kernel='rbf', probability=True) |
| KNN | Simple baseline | KNeighborsClassifier(n_neighbors=5) |
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
models = {
'RF': RandomForestClassifier(n_estimators=200),
'SVM': SVC(probability=True),
'KNN': KNeighborsClassifier()
}
for name, model in models.items():
model.fit(X_train, y_train)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
print(f"{name} AUC: {auc:.4f}")
Project: Customer Churn
Goal: F1 > 0.65
Resources:
- Hands-On ML – Ch 5–6
- Kaggle Intermediate ML – kaggle.com/learn/intermediate-machine-learning
Week 5–6: Model Evaluation Masterclass
Confusion Matrix
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('True')
plt.xlabel('Predicted')
Key Metrics
| Metric | Formula | When to Use |
|---|---|---|
| Accuracy | (TP+TN)/(Total) | Balanced |
| Precision | TP/(TP+FP) | Minimize false positives |
| Recall | TP/(TP+FN) | Catch all positives |
| F1 | 2×(P×R)/(P+R) | Imbalanced |
| AUC-ROC | Area under ROC | Ranking quality |
Cross-Validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
print(f"CV AUC: {scores.mean():.4f} ± {scores.std():.4f}")
StatQuest Videos:
- ROC & AUC
- Precision & Recall
Week 7–8: Ensemble Power (RF + XGBoost)
import xgboost as xgb
model = xgb.XGBClassifier(
n_estimators=500,
learning_rate=0.05,
max_depth=6,
subsample=0.8,
colsample_bytree=0.8,
eval_metric='auc'
)
model.fit(X_train, y_train,
eval_set=[(X_test, y_test)],
early_stopping_rounds=50,
verbose=False)
Project: Porto Seguro
Goal: Gini > 0.28 → Top 5%
Week 9–10: Pipelines & Hyperparameter Tuning
Scikit-learn Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
numeric_features = ['age', 'fare']
categorical_features = ['sex', 'embarked']
preprocessor = ColumnTransformer([
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(), categorical_features)
])
pipeline = Pipeline([
('prep', preprocessor),
('model', xgb.XGBClassifier())
])
Hyperparameter Tuning
# Grid Search
from sklearn.model_selection import GridSearchCV
param_grid = {'model__max_depth': [3, 5, 7]}
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc')
grid.fit(X, y)
# Optuna (Faster!)
import optuna
def objective(trial):
params = {
'max_depth': trial.suggest_int('max_depth', 3, 10),
'learning_rate': trial.suggest_float('lr', 0.01, 0.3)
}
model = xgb.XGBClassifier(**params)
return cross_val_score(model, X, y, cv=3, scoring='roc_auc').mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
Week 11–12: Capstone – Kaggle Top 20% in 2 Comps
Project 1: House Prices
- Feature Engineering:
TotalSF,Age,HasPool - Model: XGBoost + Optuna
- Target: RMSE < 0.12 (log scale)
Project 2: Santander Customer Transaction
- Anonymized features → PCA + XGBoost
- Target: AUC > 0.90
Deliverables (GitHub: yourname/ml-core-capstone)
ml-core-capstone/
├── house_prices/
│ ├── notebook.ipynb
│ ├── submission.csv (RMSE: 0.118)
│ └── model.pkl
├── santander/
│ ├── notebook.ipynb
│ └── submission.csv (AUC: 0.902)
└── README.md
README.md (Hiring Manager Magnet)
# ML Core Capstone: Kaggle Top 20%
## House Prices (RMSE: 0.118 – Top 18%)
- Feature eng: TotalSF, Age, Neighborhood encoding
- XGBoost + Optuna (50 trials)
- Cross-validation: 5-fold
## Santander (AUC: 0.902 – Top 15%)
- PCA on 200 anon features
- Early stopping + class weights
**Tech**: Scikit-learn, XGBoost, Optuna, Pandas
**Live**: [kaggle.com/yourname](https://www.kaggle.com/yourname)
Interview Prep: Can You Answer?
| Question | Your Answer |
|---|---|
| "Explain overfitting" | High train acc, low test → use CV |
| "AUC vs Accuracy" | AUC robust to imbalance |
| "Why XGBoost?" | Gradient boosting + regularization |
| "Pipeline benefits" | Reproducible, prevents leakage |
| "Optuna vs GridSearch" | Bayesian, faster convergence |
Assessment: Can You Do This?
| Task | Yes/No |
|---|---|
| Build end-to-end pipeline | ☐ |
| Achieve AUC > 0.85 on Titanic | ☐ |
| Tune XGBoost with Optuna | ☐ |
| Explain confusion matrix | ☐ |
| Submit Kaggle (Top 20%) | ☐ |
All Yes → You passed Phase 4!
Free Resources Summary
| Resource | Link |
|---|---|
| Andrew Ng ML | coursera.org/learn/machine-learning |
| Hands-On ML Book | GitHub |
| Kaggle Learn | kaggle.com/learn |
| StatQuest | youtube.com/c/joshstarmer |
| Optuna Docs | optuna.org |
Pro Tips
- Always use pipelines → no data leakage
- Log everything → MLflow (next phase)
- Submit early, submit often → Kaggle leaderboard
- Write blogs → "How I got Top 20% with XGBoost"
Next: Phase 5 – Advanced ML & MLOps
You can build models → now deploy them.
Start Now:
kaggle competitions download -c house-prices-advanced-regression-techniques
unzip house-prices-advanced-regression-techniques.zip
Tag me when you hit Kaggle Top 20%!
You’re now a real Machine Learning engineer.