Advanced ML & MLOps
Goal: Production-Ready Models
Advanced ML & MLOps
Advanced ML & MLOps
**Phase 5: Advanced ML & MLOps
Goal: Production-Ready Models
Why?
- 80% of ML projects fail in production — master MLOps to join the top 20%
- $150K+ salaries for roles like "MLOps Engineer"
- 2025 Trends: AutoML pipelines, federated learning, edge deployment
Week-by-Week Roadmap
| Week | Focus | Hours |
|---|---|---|
| 1–2 | XGBoost / LightGBM Mastery | 60 |
| 3–4 | Feature Engineering & NLP Basics | 60 |
| 5–6 | Time Series Forecasting | 60 |
| 7–8 | Docker & Containerization | 60 |
| 9–10 | MLflow / DVC + FastAPI Deployment | 60 |
| 11–12 | Capstone: End-to-End Fraud System | 80 |
Tools Setup (Day 1)
pip install xgboost lightgbm feature-engine transformers datasets scikit-learn pandas numpy matplotlib seaborn optuna mlflow dvc fastapi uvicorn docker
# config.py
import os
os.environ['MLFLOW_TRACKING_URI'] = 'http://localhost:5000'
Week 1–2: XGBoost / LightGBM – Kaggle Competition Level
XGBoost vs LightGBM (2025 Comparison)
| Aspect | XGBoost | LightGBM |
|---|---|---|
| Speed | Fast, but slower on large data | 2–10x faster leaf-wise growth |
| Memory | High for large datasets | Lower histogram-based |
| Accuracy | Excellent, robust | Often better on tabular data |
| GPU Support | Yes (cuML) | Native CUDA 10x speedup |
XGBoost Example
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = xgb.XGBClassifier(
n_estimators=1000,
learning_rate=0.05,
max_depth=6,
subsample=0.8,
tree_method='hist', # 2025 default
device='cuda' # GPU!
)
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], early_stopping_rounds=50)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
print(f"AUC: {auc:.4f}")
LightGBM Example (Faster!)
import lightgbm as lgb
train_data = lgb.Dataset(X_train, label=y_train)
params = {
'objective': 'binary',
'metric': 'auc',
'boosting_type': 'gbdt',
'num_leaves': 31,
'learning_rate': 0.05,
'feature_fraction': 0.9,
'device': 'gpu' # CUDA
}
model = lgb.train(params, train_data, num_boost_round=1000, valid_sets=[train_data])
Project: Kaggle: Porto Seguro
Goal: Gini > 0.30 with LightGBM GPU → Top 5%
Resources:
- Machine Learning Mastery: Gradient Boosting Tutorial
- Kaggle Kernels: LightGBM vs XGBoost
- GPU Guide: 10x Speed Tutorial
Week 3–4: Feature Engineering & NLP Basics
Feature Engineering with Feature-Engine
from feature_engine.imputation import MeanMedianImputer
from feature_engine.encoding import OneHotEncoder, RareLabelEncoder
from feature_engine.creation import MathematicalCombination
from sklearn.pipeline import Pipeline
# Pipeline
pipe = Pipeline([
('imputer', MeanMedianImputer(imputation_method='median')),
('rare', RareLabelEncoder(tol=0.05, n_categories=5)),
('ohe', OneHotEncoder(top_categories=5, variables=['cat_var'])),
('combo', MathematicalCombination(variables_to_combine=['num1', 'num2'], math_operations=['sum']))
])
X_transformed = pipe.fit_transform(X)
Project: Titanic + Feature-Engine → AUC > 0.90
NLP Basics with Hugging Face
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset
dataset = load_dataset("imdb", split="train[:1000]") # Subset
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
results = classifier("This movie is amazing!")
print(results) # [{'label': 'POSITIVE', 'score': 0.999}]
Project: Sentiment Analysis on Tweets → Fine-tune DistilBERT
Resources:
- Feature-Engine GitHub: Examples Repo
- Hugging Face LLM Course: Free 2025 Update (Now covers LLMs + NLP foundations)
Week 5–6: Time Series Forecasting
Store Item Demand Kaggle
import pandas as pd
from prophet import Prophet
df = pd.read_csv('train.csv') # Kaggle dataset
df['date'] = pd.to_datetime(df['date'])
df = df.groupby(['store', 'item', 'date'])['sales'].sum().reset_index()
model = Prophet(daily_seasonality=True)
forecast = model.fit(df[df['store']==1]).predict(pd.date_range('2018-01-01', periods=90))
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(test['sales'], forecast['yhat'], squared=False)
print(f"RMSE: {rmse:.2f}")
Advanced: XGBoost for Multi-Series
from sktime.forecasting.compose import make_reduction
from xgboost import XGBRegressor
forecaster = make_reduction(XGBRegressor(), window_length=90)
forecaster.fit(y_train)
y_pred = forecaster.predict(fh=90)
Project: Store Item Demand → WRMSSE < 0.85
Resources:
- Kaggle Kernels: Time Series Tutorial
- GitHub Repo: Full Solution
Week 7–8: Docker for Data Science
Dockerfile for ML Project
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
Build & Run
docker build -t ml-app .
docker run -p 8000:8000 ml-app
Multi-Container with Docker Compose:
# docker-compose.yml
services:
app:
build: .
ports:
- "8000:8000"
db:
image: postgres:13
environment:
POSTGRES_DB: ml_db
Resources:
- YouTube Tutorial: Krish Naik: Complete Docker for DS
- Towards DS Guide: Docker Basics
Week 9–10: MLflow / DVC + FastAPI Deployment
MLflow for Experiment Tracking
import mlflow
import mlflow.xgboost
with mlflow.start_run():
mlflow.log_param("max_depth", 6)
mlflow.log_metric("auc", auc)
mlflow.xgboost.log_model(model, "model")
mlflow ui # Run at localhost:5000
DVC for Data/Model Versioning
dvc init
dvc add data/train.csv
git add data/train.csv.dvc
dvc push # To remote (S3/Git)
dvc repro # Reproduce pipeline
FastAPI for Model API
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import mlflow.pyfunc
app = FastAPI()
model = mlflow.pyfunc.load_model("models:/xgboost_model/Production")
class InputData(BaseModel):
features: list[float]
@app.post("/predict")
def predict(data: InputData):
pred = model.predict([data.features])
return {"prediction": float(pred[0])}
Project: Deploy XGBoost to FastAPI + Docker
Resources:
- MLflow + DVC Tutorial: Experiment Tracking
- FastAPI ML Deployment: GeeksforGeeks Guide
Week 11–12: Capstone – End-to-End Fraud Detection System
Repo: yourname/fraud-mlops-capstone
Stack: LightGBM + Feature-Engine + Hugging Face NLP + Prophet TS + Docker + MLflow/DVC + FastAPI
Deliverables:
- Pipeline: dvc.yaml for FE + Train
- API: /predict endpoint (FastAPI)
- Dashboard: Streamlit for monitoring (MLflow UI)
- Docker: Multi-container deploy
- Kaggle Submission: Top 10% on Fraud Dataset
README Snippet:
# Fraud Detection MLOps System
- **AUC: 0.95** (LightGBM + NLP features)
- **Deployed**: Docker + FastAPI
- **Tracked**: MLflow experiments + DVC data
- **Live**: http://localhost:8000/docs
Interview Prep: Key Questions
| Question | Answer |
|---|---|
| "XGBoost vs LightGBM?" | LightGBM faster for large data; XGBoost more robust |
| "Why DVC?" | Git for code, DVC for large data/models |
| "FastAPI advantages?" | Async, auto-docs, Pydantic validation |
| "MLOps pipeline?" | FE → Train (MLflow) → Deploy (Docker/FastAPI) → Monitor |
Assessment: Can You Build?
| Task | Yes/No |
|---|---|
| LightGBM GPU train <5min | ☐ |
| Feature-Engine pipeline | ☐ |
| Fine-tune DistilBERT | ☐ |
| Prophet forecast RMSE <10 | ☐ |
| Dockerized FastAPI API | ☐ |
| MLflow + DVC repro | ☐ |
All Yes → Production-Ready!
Free Resources Summary
| Topic | Link |
|---|---|
| XGBoost/LightGBM | Machine Learning Mastery |
| Feature-Engine | GitHub Examples |
| Hugging Face NLP | LLM Course |
| Time Series Kaggle | Demand Forecasting |
| Docker Tutorial | Krish Naik YouTube |
| MLflow/DVC | Tracking Guide |
| FastAPI Deploy | GeeksforGeeks |
Pro Tips
- GPU Everywhere: LightGBM CUDA for 10x speed
- Version Everything: DVC for data, MLflow for models
- Auto-Docs: FastAPI's
/docs= instant portfolio - Kaggle Compete: Submit weekly → build resume
Next: Phase 6 – Big Data & Cloud
You deploy single models → now scale to petabytes.
Start Now:
dvc init && mlflow ui
Tag me on LinkedIn with your deployed API!
You're now an MLOps Engineer.
Advanced ML & MLOps
Goal: Production-Ready Models
Advanced ML & MLOps
Advanced ML & MLOps
**Phase 5: Advanced ML & MLOps
Goal: Production-Ready Models
Why?
- 80% of ML projects fail in production — master MLOps to join the top 20%
- $150K+ salaries for roles like "MLOps Engineer"
- 2025 Trends: AutoML pipelines, federated learning, edge deployment
Week-by-Week Roadmap
| Week | Focus | Hours |
|---|---|---|
| 1–2 | XGBoost / LightGBM Mastery | 60 |
| 3–4 | Feature Engineering & NLP Basics | 60 |
| 5–6 | Time Series Forecasting | 60 |
| 7–8 | Docker & Containerization | 60 |
| 9–10 | MLflow / DVC + FastAPI Deployment | 60 |
| 11–12 | Capstone: End-to-End Fraud System | 80 |
Tools Setup (Day 1)
pip install xgboost lightgbm feature-engine transformers datasets scikit-learn pandas numpy matplotlib seaborn optuna mlflow dvc fastapi uvicorn docker
# config.py
import os
os.environ['MLFLOW_TRACKING_URI'] = 'http://localhost:5000'
Week 1–2: XGBoost / LightGBM – Kaggle Competition Level
XGBoost vs LightGBM (2025 Comparison)
| Aspect | XGBoost | LightGBM |
|---|---|---|
| Speed | Fast, but slower on large data | 2–10x faster leaf-wise growth |
| Memory | High for large datasets | Lower histogram-based |
| Accuracy | Excellent, robust | Often better on tabular data |
| GPU Support | Yes (cuML) | Native CUDA 10x speedup |
XGBoost Example
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = xgb.XGBClassifier(
n_estimators=1000,
learning_rate=0.05,
max_depth=6,
subsample=0.8,
tree_method='hist', # 2025 default
device='cuda' # GPU!
)
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], early_stopping_rounds=50)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
print(f"AUC: {auc:.4f}")
LightGBM Example (Faster!)
import lightgbm as lgb
train_data = lgb.Dataset(X_train, label=y_train)
params = {
'objective': 'binary',
'metric': 'auc',
'boosting_type': 'gbdt',
'num_leaves': 31,
'learning_rate': 0.05,
'feature_fraction': 0.9,
'device': 'gpu' # CUDA
}
model = lgb.train(params, train_data, num_boost_round=1000, valid_sets=[train_data])
Project: Kaggle: Porto Seguro
Goal: Gini > 0.30 with LightGBM GPU → Top 5%
Resources:
- Machine Learning Mastery: Gradient Boosting Tutorial
- Kaggle Kernels: LightGBM vs XGBoost
- GPU Guide: 10x Speed Tutorial
Week 3–4: Feature Engineering & NLP Basics
Feature Engineering with Feature-Engine
from feature_engine.imputation import MeanMedianImputer
from feature_engine.encoding import OneHotEncoder, RareLabelEncoder
from feature_engine.creation import MathematicalCombination
from sklearn.pipeline import Pipeline
# Pipeline
pipe = Pipeline([
('imputer', MeanMedianImputer(imputation_method='median')),
('rare', RareLabelEncoder(tol=0.05, n_categories=5)),
('ohe', OneHotEncoder(top_categories=5, variables=['cat_var'])),
('combo', MathematicalCombination(variables_to_combine=['num1', 'num2'], math_operations=['sum']))
])
X_transformed = pipe.fit_transform(X)
Project: Titanic + Feature-Engine → AUC > 0.90
NLP Basics with Hugging Face
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset
dataset = load_dataset("imdb", split="train[:1000]") # Subset
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
results = classifier("This movie is amazing!")
print(results) # [{'label': 'POSITIVE', 'score': 0.999}]
Project: Sentiment Analysis on Tweets → Fine-tune DistilBERT
Resources:
- Feature-Engine GitHub: Examples Repo
- Hugging Face LLM Course: Free 2025 Update (Now covers LLMs + NLP foundations)
Week 5–6: Time Series Forecasting
Store Item Demand Kaggle
import pandas as pd
from prophet import Prophet
df = pd.read_csv('train.csv') # Kaggle dataset
df['date'] = pd.to_datetime(df['date'])
df = df.groupby(['store', 'item', 'date'])['sales'].sum().reset_index()
model = Prophet(daily_seasonality=True)
forecast = model.fit(df[df['store']==1]).predict(pd.date_range('2018-01-01', periods=90))
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(test['sales'], forecast['yhat'], squared=False)
print(f"RMSE: {rmse:.2f}")
Advanced: XGBoost for Multi-Series
from sktime.forecasting.compose import make_reduction
from xgboost import XGBRegressor
forecaster = make_reduction(XGBRegressor(), window_length=90)
forecaster.fit(y_train)
y_pred = forecaster.predict(fh=90)
Project: Store Item Demand → WRMSSE < 0.85
Resources:
- Kaggle Kernels: Time Series Tutorial
- GitHub Repo: Full Solution
Week 7–8: Docker for Data Science
Dockerfile for ML Project
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
Build & Run
docker build -t ml-app .
docker run -p 8000:8000 ml-app
Multi-Container with Docker Compose:
# docker-compose.yml
services:
app:
build: .
ports:
- "8000:8000"
db:
image: postgres:13
environment:
POSTGRES_DB: ml_db
Resources:
- YouTube Tutorial: Krish Naik: Complete Docker for DS
- Towards DS Guide: Docker Basics
Week 9–10: MLflow / DVC + FastAPI Deployment
MLflow for Experiment Tracking
import mlflow
import mlflow.xgboost
with mlflow.start_run():
mlflow.log_param("max_depth", 6)
mlflow.log_metric("auc", auc)
mlflow.xgboost.log_model(model, "model")
mlflow ui # Run at localhost:5000
DVC for Data/Model Versioning
dvc init
dvc add data/train.csv
git add data/train.csv.dvc
dvc push # To remote (S3/Git)
dvc repro # Reproduce pipeline
FastAPI for Model API
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import mlflow.pyfunc
app = FastAPI()
model = mlflow.pyfunc.load_model("models:/xgboost_model/Production")
class InputData(BaseModel):
features: list[float]
@app.post("/predict")
def predict(data: InputData):
pred = model.predict([data.features])
return {"prediction": float(pred[0])}
Project: Deploy XGBoost to FastAPI + Docker
Resources:
- MLflow + DVC Tutorial: Experiment Tracking
- FastAPI ML Deployment: GeeksforGeeks Guide
Week 11–12: Capstone – End-to-End Fraud Detection System
Repo: yourname/fraud-mlops-capstone
Stack: LightGBM + Feature-Engine + Hugging Face NLP + Prophet TS + Docker + MLflow/DVC + FastAPI
Deliverables:
- Pipeline: dvc.yaml for FE + Train
- API: /predict endpoint (FastAPI)
- Dashboard: Streamlit for monitoring (MLflow UI)
- Docker: Multi-container deploy
- Kaggle Submission: Top 10% on Fraud Dataset
README Snippet:
# Fraud Detection MLOps System
- **AUC: 0.95** (LightGBM + NLP features)
- **Deployed**: Docker + FastAPI
- **Tracked**: MLflow experiments + DVC data
- **Live**: http://localhost:8000/docs
Interview Prep: Key Questions
| Question | Answer |
|---|---|
| "XGBoost vs LightGBM?" | LightGBM faster for large data; XGBoost more robust |
| "Why DVC?" | Git for code, DVC for large data/models |
| "FastAPI advantages?" | Async, auto-docs, Pydantic validation |
| "MLOps pipeline?" | FE → Train (MLflow) → Deploy (Docker/FastAPI) → Monitor |
Assessment: Can You Build?
| Task | Yes/No |
|---|---|
| LightGBM GPU train <5min | ☐ |
| Feature-Engine pipeline | ☐ |
| Fine-tune DistilBERT | ☐ |
| Prophet forecast RMSE <10 | ☐ |
| Dockerized FastAPI API | ☐ |
| MLflow + DVC repro | ☐ |
All Yes → Production-Ready!
Free Resources Summary
| Topic | Link |
|---|---|
| XGBoost/LightGBM | Machine Learning Mastery |
| Feature-Engine | GitHub Examples |
| Hugging Face NLP | LLM Course |
| Time Series Kaggle | Demand Forecasting |
| Docker Tutorial | Krish Naik YouTube |
| MLflow/DVC | Tracking Guide |
| FastAPI Deploy | GeeksforGeeks |
Pro Tips
- GPU Everywhere: LightGBM CUDA for 10x speed
- Version Everything: DVC for data, MLflow for models
- Auto-Docs: FastAPI's
/docs= instant portfolio - Kaggle Compete: Submit weekly → build resume
Next: Phase 6 – Big Data & Cloud
You deploy single models → now scale to petabytes.
Start Now:
dvc init && mlflow ui
Tag me on LinkedIn with your deployed API!
You're now an MLOps Engineer.