LoRA Fine-Tuning Tutorial (2025 Edition)
Fine-Tune LLMs with 99% Less GPU Memory — From Zero to Production Goal: Master LoRA (Low-Rank Adaptation) — the #1 technique for efficient, parameter-efficient fine-tuning of LLMs.
LoRA Fine-Tuning Tutorial (2025 Edition)
LoRA Fine-Tuning Tutorial (2025 Edition)
LoRA Fine-Tuning Tutorial (2025 Edition)
Fine-Tune LLMs with 99% Less GPU Memory — From Zero to Production
Goal: Master LoRA (Low-Rank Adaptation) — the #1 technique for efficient, parameter-efficient fine-tuning of LLMs.
Why LoRA?
- Train BERT/GPT on a single GPU (4GB VRAM)
- Only 0.1% of original parameters updated → fast & cheap
- Used by: Hugging Face, OpenAI, Meta (Llama 3)
- 2025 Standard: All production LLM fine-tuning uses LoRA/DoRA/QLoRA
- Salary Impact: +50K for "LoRA + PEFT" on resume
LoRA in 3 Minutes
| Full Fine-Tuning | LoRA |
|---|---|
| Update all 7B parameters | Update ~1M low-rank matrices |
| 28GB VRAM (FP16) | 1.5GB VRAM |
| 10+ hours | 30 mins |
| Overwrites base model | Merges cleanly |
Math:
Instead of updating weight W (d×k), inject:
W' = W + ΔW = W + BA
B: d×r,A: r×k,r << min(d,k)→ r = 8 typical
Tutorial: Fine-Tune distilbert on IMDB (1 GPU)
Step 0: Install & Setup
pip install transformers peft datasets accelerate wandb
wandb login
Step 1: Load Dataset & Model
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Dataset
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def preprocess(examples):
return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)
tokenized = dataset.map(preprocess, batched=True)
# Model
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=2
)
Step 2: Apply LoRA with PEFT
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32, # Scaling
target_modules=["q_lin", "v_lin"], # distilbert layers
lora_dropout=0.05,
bias="none",
task_type=TaskType.SEQ_CLS
)
lora_model = get_peft_model(model, lora_config)
lora_model.print_trainable_parameters()
# Output: "trainable params: 1,181,954 || all params: 67,740,172 || trainable%: 1.74"
Step 3: Train with Trainer
from transformers import TrainingArguments, Trainer
import numpy as np
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=1)
acc = (preds == labels).mean()
return {"accuracy": acc}
training_args = TrainingArguments(
output_dir="./lora-imdb",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
evaluation_strategy="epoch",
save_strategy="epoch",
logging_dir="./logs",
learning_rate=2e-4,
weight_decay=0.01,
fp16=True, # Mixed precision
report_to="wandb",
run_name="lora-distilbert-imdb"
)
trainer = Trainer(
model=lora_model,
args=training_args,
train_dataset=tokenized["train"].shuffle(seed=42).shard(10, 0), # 10% for speed
eval_dataset=tokenized["test"].shard(10, 0),
compute_metrics=compute_metrics
)
trainer.train()
Result:
Epoch 3 | Eval Accuracy: 93.2%
VRAM: ~3.8GB | Time: 28 mins
Step 4: Save & Merge LoRA Adapter
# Save only adapter
lora_model.save_pretrained("./lora-adapter")
# Load + merge
from peft import PeftModel
base_model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
merged_model = PeftModel.from_pretrained(base_model, "./lora-adapter")
merged_model = merged_model.merge_and_unload() # Full model, no adapter
merged_model.save_pretrained("./merged-imdb-model")
Step 5: Inference (1 Line)
from transformers import pipeline
classifier = pipeline("text-classification", model="./merged-imdb-model")
print(classifier("This movie was amazing!"))
# [{'label': 'POSITIVE', 'score': 0.999}]
Advanced: QLoRA (4-bit) on 1 GPU (7B Model!)
Fine-tune Llama 3 8B on 1x RTX 3090 (24GB)
pip install bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
quantization_config=bnb_config,
device_map="auto"
)
# Apply QLoRA
from peft import LoraConfig, prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=64,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
Memory: ~9GB (vs 64GB full FP16)
LoRA Config Cheat Sheet
| Parameter | Typical Value | Effect |
|---|---|---|
r |
8, 16, 32, 64 | Higher = more capacity |
lora_alpha |
16, 32 | Scaling factor |
target_modules |
["q_lin", "v_lin"] (BERT) ["q_proj", "v_proj"] (Llama) |
Which layers to adapt |
lora_dropout |
0.05–0.1 | Regularization |
bias |
"none" | Usually not needed |
Production Deployment
Option 1: Hugging Face Inference API
import requests
API_URL = "https://api-inference.huggingface.co/models/yourname/lora-imdb"
headers = {"Authorization": "Bearer hf_..."}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
print(query({"inputs": "I loved this film!"}))
Option 2: FastAPI + vLLM
from fastapi import FastAPI
from vllm import LLM, SamplingParams
app = FastAPI()
llm = LLM(model="./merged-llama3-lora")
@app.post("/generate")
def generate(prompt: str):
outputs = llm.generate([prompt], SamplingParams(max_tokens=100))
return {"response": outputs[0].outputs[0].text}
Capstone Project: "Personal AI Assistant"
Task: Fine-tune Mistral-7B with LoRA on your personal emails + notes
Goal: Generate replies in your style
Stack:
- QLoRA + bitsandbytes
- PEFT + Hugging Face
- Deploy on RunPod ($0.39/hr A100)
Deliverable:
https://yourname-assistant.hf.space— "Write email to boss about delay"
Interview Questions (Solve in 5 Mins)
| Question | Answer |
|---|---|
| "Full vs LoRA fine-tuning?" | LoRA: 100x less memory, mergeable |
"How to choose r?" |
Start with 16, increase if underfitting |
| "QLoRA vs LoRA?" | QLoRA adds 4-bit → fits 70B on 1 GPU |
| "Merge LoRA weights?" | merge_and_unload() |
| "Target modules for Llama?" | q_proj, k_proj, v_proj, o_proj |
Free Resources Summary
| Resource | Link |
|---|---|
| PEFT Docs | huggingface.co/docs/peft |
| QLoRA Paper | arxiv.org/abs/2305.14314 |
| Hugging Face Course | huggingface.co/course |
| RunPod | runpod.io (A100 $0.39/hr) |
| Colab Pro+ | $50/mo → A100 access |
Pro Tips
- Always use
prepare_model_for_kbit_training()with QLoRA - Merge before inference → faster, no adapter overhead
- Log
r,alpha,target_modulesin WandB - Use
push_to_hub()→ share adapter in 1 line - Resume:
"Fine-tuned Llama 3 8B with QLoRA on 1 GPU — 94% accuracy in 2 hours"
Final Checklist
| Task | Done? |
|---|---|
| Apply LoRA to BERT | ☐ |
| Train on 10% IMDB | ☐ |
| Save & merge adapter | ☐ |
| QLoRA on 7B model | ☐ |
| Deploy to HF Space | ☐ |
All Yes → You’re a PEFT Expert!
Next: MLOps & Production Monitoring
You can fine-tune → now serve at scale.
Start Now:
git clone https://huggingface.co/spaces/huggingface/peft-lora-demo
cd peft-lora-demo
pip install -r requirements.txt
Tag me when you deploy your LoRA model!
You now fine-tune LLMs like OpenAI.
LoRA Fine-Tuning Tutorial (2025 Edition)
Fine-Tune LLMs with 99% Less GPU Memory — From Zero to Production Goal: Master LoRA (Low-Rank Adaptation) — the #1 technique for efficient, parameter-efficient fine-tuning of LLMs.
LoRA Fine-Tuning Tutorial (2025 Edition)
LoRA Fine-Tuning Tutorial (2025 Edition)
LoRA Fine-Tuning Tutorial (2025 Edition)
Fine-Tune LLMs with 99% Less GPU Memory — From Zero to Production
Goal: Master LoRA (Low-Rank Adaptation) — the #1 technique for efficient, parameter-efficient fine-tuning of LLMs.
Why LoRA?
- Train BERT/GPT on a single GPU (4GB VRAM)
- Only 0.1% of original parameters updated → fast & cheap
- Used by: Hugging Face, OpenAI, Meta (Llama 3)
- 2025 Standard: All production LLM fine-tuning uses LoRA/DoRA/QLoRA
- Salary Impact: +50K for "LoRA + PEFT" on resume
LoRA in 3 Minutes
| Full Fine-Tuning | LoRA |
|---|---|
| Update all 7B parameters | Update ~1M low-rank matrices |
| 28GB VRAM (FP16) | 1.5GB VRAM |
| 10+ hours | 30 mins |
| Overwrites base model | Merges cleanly |
Math:
Instead of updating weight W (d×k), inject:
W' = W + ΔW = W + BA
B: d×r,A: r×k,r << min(d,k)→ r = 8 typical
Tutorial: Fine-Tune distilbert on IMDB (1 GPU)
Step 0: Install & Setup
pip install transformers peft datasets accelerate wandb
wandb login
Step 1: Load Dataset & Model
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Dataset
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def preprocess(examples):
return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)
tokenized = dataset.map(preprocess, batched=True)
# Model
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=2
)
Step 2: Apply LoRA with PEFT
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32, # Scaling
target_modules=["q_lin", "v_lin"], # distilbert layers
lora_dropout=0.05,
bias="none",
task_type=TaskType.SEQ_CLS
)
lora_model = get_peft_model(model, lora_config)
lora_model.print_trainable_parameters()
# Output: "trainable params: 1,181,954 || all params: 67,740,172 || trainable%: 1.74"
Step 3: Train with Trainer
from transformers import TrainingArguments, Trainer
import numpy as np
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=1)
acc = (preds == labels).mean()
return {"accuracy": acc}
training_args = TrainingArguments(
output_dir="./lora-imdb",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
evaluation_strategy="epoch",
save_strategy="epoch",
logging_dir="./logs",
learning_rate=2e-4,
weight_decay=0.01,
fp16=True, # Mixed precision
report_to="wandb",
run_name="lora-distilbert-imdb"
)
trainer = Trainer(
model=lora_model,
args=training_args,
train_dataset=tokenized["train"].shuffle(seed=42).shard(10, 0), # 10% for speed
eval_dataset=tokenized["test"].shard(10, 0),
compute_metrics=compute_metrics
)
trainer.train()
Result:
Epoch 3 | Eval Accuracy: 93.2%
VRAM: ~3.8GB | Time: 28 mins
Step 4: Save & Merge LoRA Adapter
# Save only adapter
lora_model.save_pretrained("./lora-adapter")
# Load + merge
from peft import PeftModel
base_model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
merged_model = PeftModel.from_pretrained(base_model, "./lora-adapter")
merged_model = merged_model.merge_and_unload() # Full model, no adapter
merged_model.save_pretrained("./merged-imdb-model")
Step 5: Inference (1 Line)
from transformers import pipeline
classifier = pipeline("text-classification", model="./merged-imdb-model")
print(classifier("This movie was amazing!"))
# [{'label': 'POSITIVE', 'score': 0.999}]
Advanced: QLoRA (4-bit) on 1 GPU (7B Model!)
Fine-tune Llama 3 8B on 1x RTX 3090 (24GB)
pip install bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
quantization_config=bnb_config,
device_map="auto"
)
# Apply QLoRA
from peft import LoraConfig, prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=64,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
Memory: ~9GB (vs 64GB full FP16)
LoRA Config Cheat Sheet
| Parameter | Typical Value | Effect |
|---|---|---|
r |
8, 16, 32, 64 | Higher = more capacity |
lora_alpha |
16, 32 | Scaling factor |
target_modules |
["q_lin", "v_lin"] (BERT) ["q_proj", "v_proj"] (Llama) |
Which layers to adapt |
lora_dropout |
0.05–0.1 | Regularization |
bias |
"none" | Usually not needed |
Production Deployment
Option 1: Hugging Face Inference API
import requests
API_URL = "https://api-inference.huggingface.co/models/yourname/lora-imdb"
headers = {"Authorization": "Bearer hf_..."}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
print(query({"inputs": "I loved this film!"}))
Option 2: FastAPI + vLLM
from fastapi import FastAPI
from vllm import LLM, SamplingParams
app = FastAPI()
llm = LLM(model="./merged-llama3-lora")
@app.post("/generate")
def generate(prompt: str):
outputs = llm.generate([prompt], SamplingParams(max_tokens=100))
return {"response": outputs[0].outputs[0].text}
Capstone Project: "Personal AI Assistant"
Task: Fine-tune Mistral-7B with LoRA on your personal emails + notes
Goal: Generate replies in your style
Stack:
- QLoRA + bitsandbytes
- PEFT + Hugging Face
- Deploy on RunPod ($0.39/hr A100)
Deliverable:
https://yourname-assistant.hf.space— "Write email to boss about delay"
Interview Questions (Solve in 5 Mins)
| Question | Answer |
|---|---|
| "Full vs LoRA fine-tuning?" | LoRA: 100x less memory, mergeable |
"How to choose r?" |
Start with 16, increase if underfitting |
| "QLoRA vs LoRA?" | QLoRA adds 4-bit → fits 70B on 1 GPU |
| "Merge LoRA weights?" | merge_and_unload() |
| "Target modules for Llama?" | q_proj, k_proj, v_proj, o_proj |
Free Resources Summary
| Resource | Link |
|---|---|
| PEFT Docs | huggingface.co/docs/peft |
| QLoRA Paper | arxiv.org/abs/2305.14314 |
| Hugging Face Course | huggingface.co/course |
| RunPod | runpod.io (A100 $0.39/hr) |
| Colab Pro+ | $50/mo → A100 access |
Pro Tips
- Always use
prepare_model_for_kbit_training()with QLoRA - Merge before inference → faster, no adapter overhead
- Log
r,alpha,target_modulesin WandB - Use
push_to_hub()→ share adapter in 1 line - Resume:
"Fine-tuned Llama 3 8B with QLoRA on 1 GPU — 94% accuracy in 2 hours"
Final Checklist
| Task | Done? |
|---|---|
| Apply LoRA to BERT | ☐ |
| Train on 10% IMDB | ☐ |
| Save & merge adapter | ☐ |
| QLoRA on 7B model | ☐ |
| Deploy to HF Space | ☐ |
All Yes → You’re a PEFT Expert!
Next: MLOps & Production Monitoring
You can fine-tune → now serve at scale.
Start Now:
git clone https://huggingface.co/spaces/huggingface/peft-lora-demo
cd peft-lora-demo
pip install -r requirements.txt
Tag me when you deploy your LoRA model!
You now fine-tune LLMs like OpenAI.