Machine Learning Core

Goal: Build & Evaluate Models Like a Pro

Machine Learning Core

Machine Learning Core

Machine Learning Core

Goal: Build & Evaluate Models Like a Pro

Focus: Scikit-learn + Real Projects

Why?
- 95% of DS roles require model building & evaluation
- AUC > 0.9 = job offer
- Kaggle Top 20% = senior-level interview


Week-by-Week Roadmap

Week Focus Hours
1–2 Regression (Linear + Logistic) 60
3–4 Classification (Trees, SVM, KNN) 60
5–6 Model Evaluation & Cross-Validation 60
7–8 Ensemble Methods (RF, XGBoost) 60
9–10 Hyperparameter Tuning & Pipelines 60
11–12 Capstone: 2 Kaggle Competitions 80

Tools Setup (Day 1)

pip install scikit-learn pandas numpy matplotlib seaborn xgboost optuna kaggle
# config.py
import os
os.environ['KAGGLE_USERNAME'] = 'yourname'
os.environ['KAGGLE_KEY'] = 'yourkey'

Week 1–2: Regression Deep Dive

1. Linear Regression

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.2f}")
print(f"R²: {r2_score(y_test, y_pred):.3f}")

Project: House Prices

Goal: RMSE < 25,000Top 20%


2. Logistic Regression

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]

from sklearn.metrics import roc_auc_score
print(f"AUC: {roc_auc_score(y_test, y_prob):.4f}")

Project: Titanic

Goal: AUC > 0.85Top 10%

Resources:
- Andrew Ng ML Course (Weeks 1–3) – coursera.org
- Hands-On ML – Ch 2–4


Week 3–4: Classification Algorithms

Algorithm Use Case Code
Decision Tree Interpretable DecisionTreeClassifier(max_depth=5)
Random Forest Robust RandomForestClassifier(n_estimators=100)
SVM Small, clean data SVC(kernel='rbf', probability=True)
KNN Simple baseline KNeighborsClassifier(n_neighbors=5)
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

models = {
    'RF': RandomForestClassifier(n_estimators=200),
    'SVM': SVC(probability=True),
    'KNN': KNeighborsClassifier()
}

for name, model in models.items():
    model.fit(X_train, y_train)
    auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
    print(f"{name} AUC: {auc:.4f}")

Project: Customer Churn

Goal: F1 > 0.65

Resources:
- Hands-On ML – Ch 5–6
- Kaggle Intermediate MLkaggle.com/learn/intermediate-machine-learning


Week 5–6: Model Evaluation Masterclass

Confusion Matrix

from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('True')
plt.xlabel('Predicted')

Key Metrics

Metric Formula When to Use
Accuracy (TP+TN)/(Total) Balanced
Precision TP/(TP+FP) Minimize false positives
Recall TP/(TP+FN) Catch all positives
F1 2×(P×R)/(P+R) Imbalanced
AUC-ROC Area under ROC Ranking quality

Cross-Validation

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
print(f"CV AUC: {scores.mean():.4f} ± {scores.std():.4f}")

StatQuest Videos:
- ROC & AUC
- Precision & Recall


Week 7–8: Ensemble Power (RF + XGBoost)

import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric='auc'
)
model.fit(X_train, y_train, 
          eval_set=[(X_test, y_test)], 
          early_stopping_rounds=50, 
          verbose=False)

Project: Porto Seguro

Goal: Gini > 0.28Top 5%


Week 9–10: Pipelines & Hyperparameter Tuning

Scikit-learn Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

numeric_features = ['age', 'fare']
categorical_features = ['sex', 'embarked']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(), categorical_features)
])

pipeline = Pipeline([
    ('prep', preprocessor),
    ('model', xgb.XGBClassifier())
])

Hyperparameter Tuning

# Grid Search
from sklearn.model_selection import GridSearchCV
param_grid = {'model__max_depth': [3, 5, 7]}
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc')
grid.fit(X, y)

# Optuna (Faster!)
import optuna
def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('lr', 0.01, 0.3)
    }
    model = xgb.XGBClassifier(**params)
    return cross_val_score(model, X, y, cv=3, scoring='roc_auc').mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

Week 11–12: Capstone – Kaggle Top 20% in 2 Comps

Project 1: House Prices

  • Feature Engineering: TotalSF, Age, HasPool
  • Model: XGBoost + Optuna
  • Target: RMSE < 0.12 (log scale)

Project 2: Santander Customer Transaction

  • Anonymized features → PCA + XGBoost
  • Target: AUC > 0.90

Deliverables (GitHub: yourname/ml-core-capstone)

ml-core-capstone/
├── house_prices/
│   ├── notebook.ipynb
│   ├── submission.csv (RMSE: 0.118)
│   └── model.pkl
├── santander/
│   ├── notebook.ipynb
│   └── submission.csv (AUC: 0.902)
└── README.md

README.md (Hiring Manager Magnet)

# ML Core Capstone: Kaggle Top 20%

## House Prices (RMSE: 0.118 – Top 18%)
- Feature eng: TotalSF, Age, Neighborhood encoding
- XGBoost + Optuna (50 trials)
- Cross-validation: 5-fold

## Santander (AUC: 0.902 – Top 15%)
- PCA on 200 anon features
- Early stopping + class weights

**Tech**: Scikit-learn, XGBoost, Optuna, Pandas  
**Live**: [kaggle.com/yourname](https://www.kaggle.com/yourname)

Interview Prep: Can You Answer?

Question Your Answer
"Explain overfitting" High train acc, low test → use CV
"AUC vs Accuracy" AUC robust to imbalance
"Why XGBoost?" Gradient boosting + regularization
"Pipeline benefits" Reproducible, prevents leakage
"Optuna vs GridSearch" Bayesian, faster convergence

Assessment: Can You Do This?

Task Yes/No
Build end-to-end pipeline
Achieve AUC > 0.85 on Titanic
Tune XGBoost with Optuna
Explain confusion matrix
Submit Kaggle (Top 20%)

All Yes → You passed Phase 4!


Free Resources Summary

Resource Link
Andrew Ng ML coursera.org/learn/machine-learning
Hands-On ML Book GitHub
Kaggle Learn kaggle.com/learn
StatQuest youtube.com/c/joshstarmer
Optuna Docs optuna.org

Pro Tips

  1. Always use pipelines → no data leakage
  2. Log everything → MLflow (next phase)
  3. Submit early, submit often → Kaggle leaderboard
  4. Write blogs → "How I got Top 20% with XGBoost"

Next: Phase 5 – Advanced ML & MLOps

You can build models → now deploy them.


Start Now:

kaggle competitions download -c house-prices-advanced-regression-techniques
unzip house-prices-advanced-regression-techniques.zip

Tag me when you hit Kaggle Top 20%!
You’re now a real Machine Learning engineer.

Last updated: Nov 12, 2025

Machine Learning Core

Goal: Build & Evaluate Models Like a Pro

Machine Learning Core

Machine Learning Core

Machine Learning Core

Goal: Build & Evaluate Models Like a Pro

Focus: Scikit-learn + Real Projects

Why?
- 95% of DS roles require model building & evaluation
- AUC > 0.9 = job offer
- Kaggle Top 20% = senior-level interview


Week-by-Week Roadmap

Week Focus Hours
1–2 Regression (Linear + Logistic) 60
3–4 Classification (Trees, SVM, KNN) 60
5–6 Model Evaluation & Cross-Validation 60
7–8 Ensemble Methods (RF, XGBoost) 60
9–10 Hyperparameter Tuning & Pipelines 60
11–12 Capstone: 2 Kaggle Competitions 80

Tools Setup (Day 1)

pip install scikit-learn pandas numpy matplotlib seaborn xgboost optuna kaggle
# config.py
import os
os.environ['KAGGLE_USERNAME'] = 'yourname'
os.environ['KAGGLE_KEY'] = 'yourkey'

Week 1–2: Regression Deep Dive

1. Linear Regression

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.2f}")
print(f"R²: {r2_score(y_test, y_pred):.3f}")

Project: House Prices

Goal: RMSE < 25,000Top 20%


2. Logistic Regression

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]

from sklearn.metrics import roc_auc_score
print(f"AUC: {roc_auc_score(y_test, y_prob):.4f}")

Project: Titanic

Goal: AUC > 0.85Top 10%

Resources:
- Andrew Ng ML Course (Weeks 1–3) – coursera.org
- Hands-On ML – Ch 2–4


Week 3–4: Classification Algorithms

Algorithm Use Case Code
Decision Tree Interpretable DecisionTreeClassifier(max_depth=5)
Random Forest Robust RandomForestClassifier(n_estimators=100)
SVM Small, clean data SVC(kernel='rbf', probability=True)
KNN Simple baseline KNeighborsClassifier(n_neighbors=5)
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

models = {
    'RF': RandomForestClassifier(n_estimators=200),
    'SVM': SVC(probability=True),
    'KNN': KNeighborsClassifier()
}

for name, model in models.items():
    model.fit(X_train, y_train)
    auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
    print(f"{name} AUC: {auc:.4f}")

Project: Customer Churn

Goal: F1 > 0.65

Resources:
- Hands-On ML – Ch 5–6
- Kaggle Intermediate MLkaggle.com/learn/intermediate-machine-learning


Week 5–6: Model Evaluation Masterclass

Confusion Matrix

from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('True')
plt.xlabel('Predicted')

Key Metrics

Metric Formula When to Use
Accuracy (TP+TN)/(Total) Balanced
Precision TP/(TP+FP) Minimize false positives
Recall TP/(TP+FN) Catch all positives
F1 2×(P×R)/(P+R) Imbalanced
AUC-ROC Area under ROC Ranking quality

Cross-Validation

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
print(f"CV AUC: {scores.mean():.4f} ± {scores.std():.4f}")

StatQuest Videos:
- ROC & AUC
- Precision & Recall


Week 7–8: Ensemble Power (RF + XGBoost)

import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric='auc'
)
model.fit(X_train, y_train, 
          eval_set=[(X_test, y_test)], 
          early_stopping_rounds=50, 
          verbose=False)

Project: Porto Seguro

Goal: Gini > 0.28Top 5%


Week 9–10: Pipelines & Hyperparameter Tuning

Scikit-learn Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

numeric_features = ['age', 'fare']
categorical_features = ['sex', 'embarked']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(), categorical_features)
])

pipeline = Pipeline([
    ('prep', preprocessor),
    ('model', xgb.XGBClassifier())
])

Hyperparameter Tuning

# Grid Search
from sklearn.model_selection import GridSearchCV
param_grid = {'model__max_depth': [3, 5, 7]}
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc')
grid.fit(X, y)

# Optuna (Faster!)
import optuna
def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('lr', 0.01, 0.3)
    }
    model = xgb.XGBClassifier(**params)
    return cross_val_score(model, X, y, cv=3, scoring='roc_auc').mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

Week 11–12: Capstone – Kaggle Top 20% in 2 Comps

Project 1: House Prices

  • Feature Engineering: TotalSF, Age, HasPool
  • Model: XGBoost + Optuna
  • Target: RMSE < 0.12 (log scale)

Project 2: Santander Customer Transaction

  • Anonymized features → PCA + XGBoost
  • Target: AUC > 0.90

Deliverables (GitHub: yourname/ml-core-capstone)

ml-core-capstone/
├── house_prices/
│   ├── notebook.ipynb
│   ├── submission.csv (RMSE: 0.118)
│   └── model.pkl
├── santander/
│   ├── notebook.ipynb
│   └── submission.csv (AUC: 0.902)
└── README.md

README.md (Hiring Manager Magnet)

# ML Core Capstone: Kaggle Top 20%

## House Prices (RMSE: 0.118 – Top 18%)
- Feature eng: TotalSF, Age, Neighborhood encoding
- XGBoost + Optuna (50 trials)
- Cross-validation: 5-fold

## Santander (AUC: 0.902 – Top 15%)
- PCA on 200 anon features
- Early stopping + class weights

**Tech**: Scikit-learn, XGBoost, Optuna, Pandas  
**Live**: [kaggle.com/yourname](https://www.kaggle.com/yourname)

Interview Prep: Can You Answer?

Question Your Answer
"Explain overfitting" High train acc, low test → use CV
"AUC vs Accuracy" AUC robust to imbalance
"Why XGBoost?" Gradient boosting + regularization
"Pipeline benefits" Reproducible, prevents leakage
"Optuna vs GridSearch" Bayesian, faster convergence

Assessment: Can You Do This?

Task Yes/No
Build end-to-end pipeline
Achieve AUC > 0.85 on Titanic
Tune XGBoost with Optuna
Explain confusion matrix
Submit Kaggle (Top 20%)

All Yes → You passed Phase 4!


Free Resources Summary

Resource Link
Andrew Ng ML coursera.org/learn/machine-learning
Hands-On ML Book GitHub
Kaggle Learn kaggle.com/learn
StatQuest youtube.com/c/joshstarmer
Optuna Docs optuna.org

Pro Tips

  1. Always use pipelines → no data leakage
  2. Log everything → MLflow (next phase)
  3. Submit early, submit often → Kaggle leaderboard
  4. Write blogs → "How I got Top 20% with XGBoost"

Next: Phase 5 – Advanced ML & MLOps

You can build models → now deploy them.


Start Now:

kaggle competitions download -c house-prices-advanced-regression-techniques
unzip house-prices-advanced-regression-techniques.zip

Tag me when you hit Kaggle Top 20%!
You’re now a real Machine Learning engineer.

Last updated: Nov 12, 2025