LoRA Fine-Tuning Tutorial (2025 Edition)

Fine-Tune LLMs with 99% Less GPU Memory — From Zero to Production Goal: Master LoRA (Low-Rank Adaptation) — the #1 technique for efficient, parameter-efficient fine-tuning of LLMs.

LoRA Fine-Tuning Tutorial (2025 Edition)

LoRA Fine-Tuning Tutorial (2025 Edition)

LoRA Fine-Tuning Tutorial (2025 Edition)

Fine-Tune LLMs with 99% Less GPU Memory — From Zero to Production

Goal: Master LoRA (Low-Rank Adaptation) — the #1 technique for efficient, parameter-efficient fine-tuning of LLMs.

Why LoRA?
- Train BERT/GPT on a single GPU (4GB VRAM)
- Only 0.1% of original parameters updatedfast & cheap
- Used by: Hugging Face, OpenAI, Meta (Llama 3)
- 2025 Standard: All production LLM fine-tuning uses LoRA/DoRA/QLoRA
- Salary Impact: +50K for "LoRA + PEFT" on resume


LoRA in 3 Minutes

Full Fine-Tuning LoRA
Update all 7B parameters Update ~1M low-rank matrices
28GB VRAM (FP16) 1.5GB VRAM
10+ hours 30 mins
Overwrites base model Merges cleanly

Math:
Instead of updating weight W (d×k), inject:

W' = W + ΔW = W + BA
  • B: d×r, A: r×k, r << min(d,k)r = 8 typical

Tutorial: Fine-Tune distilbert on IMDB (1 GPU)

Step 0: Install & Setup

pip install transformers peft datasets accelerate wandb
wandb login

Step 1: Load Dataset & Model

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Dataset
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

tokenized = dataset.map(preprocess, batched=True)

# Model
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
)

Step 2: Apply LoRA with PEFT

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=16,                    # Rank
    lora_alpha=32,           # Scaling
    target_modules=["q_lin", "v_lin"],  # distilbert layers
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_CLS
)

lora_model = get_peft_model(model, lora_config)
lora_model.print_trainable_parameters()
# Output: "trainable params: 1,181,954 || all params: 67,740,172 || trainable%: 1.74"

Step 3: Train with Trainer

from transformers import TrainingArguments, Trainer
import numpy as np

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    acc = (preds == labels).mean()
    return {"accuracy": acc}

training_args = TrainingArguments(
    output_dir="./lora-imdb",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_dir="./logs",
    learning_rate=2e-4,
    weight_decay=0.01,
    fp16=True,                    # Mixed precision
    report_to="wandb",
    run_name="lora-distilbert-imdb"
)

trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=tokenized["train"].shuffle(seed=42).shard(10, 0),  # 10% for speed
    eval_dataset=tokenized["test"].shard(10, 0),
    compute_metrics=compute_metrics
)

trainer.train()

Result:

Epoch 3 | Eval Accuracy: 93.2%
VRAM: ~3.8GB | Time: 28 mins

Step 4: Save & Merge LoRA Adapter

# Save only adapter
lora_model.save_pretrained("./lora-adapter")

# Load + merge
from peft import PeftModel

base_model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
merged_model = PeftModel.from_pretrained(base_model, "./lora-adapter")
merged_model = merged_model.merge_and_unload()  # Full model, no adapter

merged_model.save_pretrained("./merged-imdb-model")

Step 5: Inference (1 Line)

from transformers import pipeline

classifier = pipeline("text-classification", model="./merged-imdb-model")
print(classifier("This movie was amazing!"))  
# [{'label': 'POSITIVE', 'score': 0.999}]

Advanced: QLoRA (4-bit) on 1 GPU (7B Model!)

Fine-tune Llama 3 8B on 1x RTX 3090 (24GB)

pip install bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto"
)

# Apply QLoRA
from peft import LoraConfig, prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

Memory: ~9GB (vs 64GB full FP16)


LoRA Config Cheat Sheet

Parameter Typical Value Effect
r 8, 16, 32, 64 Higher = more capacity
lora_alpha 16, 32 Scaling factor
target_modules ["q_lin", "v_lin"] (BERT)
["q_proj", "v_proj"] (Llama)
Which layers to adapt
lora_dropout 0.05–0.1 Regularization
bias "none" Usually not needed

Production Deployment

Option 1: Hugging Face Inference API

import requests

API_URL = "https://api-inference.huggingface.co/models/yourname/lora-imdb"
headers = {"Authorization": "Bearer hf_..."}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

print(query({"inputs": "I loved this film!"}))

Option 2: FastAPI + vLLM

from fastapi import FastAPI
from vllm import LLM, SamplingParams

app = FastAPI()
llm = LLM(model="./merged-llama3-lora")

@app.post("/generate")
def generate(prompt: str):
    outputs = llm.generate([prompt], SamplingParams(max_tokens=100))
    return {"response": outputs[0].outputs[0].text}

Capstone Project: "Personal AI Assistant"

Task: Fine-tune Mistral-7B with LoRA on your personal emails + notes
Goal: Generate replies in your style
Stack:
- QLoRA + bitsandbytes
- PEFT + Hugging Face
- Deploy on RunPod ($0.39/hr A100)

Deliverable:

https://yourname-assistant.hf.space — "Write email to boss about delay"


Interview Questions (Solve in 5 Mins)

Question Answer
"Full vs LoRA fine-tuning?" LoRA: 100x less memory, mergeable
"How to choose r?" Start with 16, increase if underfitting
"QLoRA vs LoRA?" QLoRA adds 4-bit → fits 70B on 1 GPU
"Merge LoRA weights?" merge_and_unload()
"Target modules for Llama?" q_proj, k_proj, v_proj, o_proj

Free Resources Summary

Resource Link
PEFT Docs huggingface.co/docs/peft
QLoRA Paper arxiv.org/abs/2305.14314
Hugging Face Course huggingface.co/course
RunPod runpod.io (A100 $0.39/hr)
Colab Pro+ $50/mo → A100 access

Pro Tips

  1. Always use prepare_model_for_kbit_training() with QLoRA
  2. Merge before inference → faster, no adapter overhead
  3. Log r, alpha, target_modules in WandB
  4. Use push_to_hub() → share adapter in 1 line
  5. Resume:

    "Fine-tuned Llama 3 8B with QLoRA on 1 GPU — 94% accuracy in 2 hours"


Final Checklist

Task Done?
Apply LoRA to BERT
Train on 10% IMDB
Save & merge adapter
QLoRA on 7B model
Deploy to HF Space

All Yes → You’re a PEFT Expert!


Next: MLOps & Production Monitoring

You can fine-tune → now serve at scale.


Start Now:

git clone https://huggingface.co/spaces/huggingface/peft-lora-demo
cd peft-lora-demo
pip install -r requirements.txt

Tag me when you deploy your LoRA model!
You now fine-tune LLMs like OpenAI.

Last updated: Nov 09, 2025

LoRA Fine-Tuning Tutorial (2025 Edition)

Fine-Tune LLMs with 99% Less GPU Memory — From Zero to Production Goal: Master LoRA (Low-Rank Adaptation) — the #1 technique for efficient, parameter-efficient fine-tuning of LLMs.

LoRA Fine-Tuning Tutorial (2025 Edition)

LoRA Fine-Tuning Tutorial (2025 Edition)

LoRA Fine-Tuning Tutorial (2025 Edition)

Fine-Tune LLMs with 99% Less GPU Memory — From Zero to Production

Goal: Master LoRA (Low-Rank Adaptation) — the #1 technique for efficient, parameter-efficient fine-tuning of LLMs.

Why LoRA?
- Train BERT/GPT on a single GPU (4GB VRAM)
- Only 0.1% of original parameters updatedfast & cheap
- Used by: Hugging Face, OpenAI, Meta (Llama 3)
- 2025 Standard: All production LLM fine-tuning uses LoRA/DoRA/QLoRA
- Salary Impact: +50K for "LoRA + PEFT" on resume


LoRA in 3 Minutes

Full Fine-Tuning LoRA
Update all 7B parameters Update ~1M low-rank matrices
28GB VRAM (FP16) 1.5GB VRAM
10+ hours 30 mins
Overwrites base model Merges cleanly

Math:
Instead of updating weight W (d×k), inject:

W' = W + ΔW = W + BA
  • B: d×r, A: r×k, r << min(d,k)r = 8 typical

Tutorial: Fine-Tune distilbert on IMDB (1 GPU)

Step 0: Install & Setup

pip install transformers peft datasets accelerate wandb
wandb login

Step 1: Load Dataset & Model

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Dataset
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

tokenized = dataset.map(preprocess, batched=True)

# Model
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
)

Step 2: Apply LoRA with PEFT

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=16,                    # Rank
    lora_alpha=32,           # Scaling
    target_modules=["q_lin", "v_lin"],  # distilbert layers
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_CLS
)

lora_model = get_peft_model(model, lora_config)
lora_model.print_trainable_parameters()
# Output: "trainable params: 1,181,954 || all params: 67,740,172 || trainable%: 1.74"

Step 3: Train with Trainer

from transformers import TrainingArguments, Trainer
import numpy as np

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    acc = (preds == labels).mean()
    return {"accuracy": acc}

training_args = TrainingArguments(
    output_dir="./lora-imdb",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_dir="./logs",
    learning_rate=2e-4,
    weight_decay=0.01,
    fp16=True,                    # Mixed precision
    report_to="wandb",
    run_name="lora-distilbert-imdb"
)

trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=tokenized["train"].shuffle(seed=42).shard(10, 0),  # 10% for speed
    eval_dataset=tokenized["test"].shard(10, 0),
    compute_metrics=compute_metrics
)

trainer.train()

Result:

Epoch 3 | Eval Accuracy: 93.2%
VRAM: ~3.8GB | Time: 28 mins

Step 4: Save & Merge LoRA Adapter

# Save only adapter
lora_model.save_pretrained("./lora-adapter")

# Load + merge
from peft import PeftModel

base_model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
merged_model = PeftModel.from_pretrained(base_model, "./lora-adapter")
merged_model = merged_model.merge_and_unload()  # Full model, no adapter

merged_model.save_pretrained("./merged-imdb-model")

Step 5: Inference (1 Line)

from transformers import pipeline

classifier = pipeline("text-classification", model="./merged-imdb-model")
print(classifier("This movie was amazing!"))  
# [{'label': 'POSITIVE', 'score': 0.999}]

Advanced: QLoRA (4-bit) on 1 GPU (7B Model!)

Fine-tune Llama 3 8B on 1x RTX 3090 (24GB)

pip install bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto"
)

# Apply QLoRA
from peft import LoraConfig, prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

Memory: ~9GB (vs 64GB full FP16)


LoRA Config Cheat Sheet

Parameter Typical Value Effect
r 8, 16, 32, 64 Higher = more capacity
lora_alpha 16, 32 Scaling factor
target_modules ["q_lin", "v_lin"] (BERT)
["q_proj", "v_proj"] (Llama)
Which layers to adapt
lora_dropout 0.05–0.1 Regularization
bias "none" Usually not needed

Production Deployment

Option 1: Hugging Face Inference API

import requests

API_URL = "https://api-inference.huggingface.co/models/yourname/lora-imdb"
headers = {"Authorization": "Bearer hf_..."}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

print(query({"inputs": "I loved this film!"}))

Option 2: FastAPI + vLLM

from fastapi import FastAPI
from vllm import LLM, SamplingParams

app = FastAPI()
llm = LLM(model="./merged-llama3-lora")

@app.post("/generate")
def generate(prompt: str):
    outputs = llm.generate([prompt], SamplingParams(max_tokens=100))
    return {"response": outputs[0].outputs[0].text}

Capstone Project: "Personal AI Assistant"

Task: Fine-tune Mistral-7B with LoRA on your personal emails + notes
Goal: Generate replies in your style
Stack:
- QLoRA + bitsandbytes
- PEFT + Hugging Face
- Deploy on RunPod ($0.39/hr A100)

Deliverable:

https://yourname-assistant.hf.space — "Write email to boss about delay"


Interview Questions (Solve in 5 Mins)

Question Answer
"Full vs LoRA fine-tuning?" LoRA: 100x less memory, mergeable
"How to choose r?" Start with 16, increase if underfitting
"QLoRA vs LoRA?" QLoRA adds 4-bit → fits 70B on 1 GPU
"Merge LoRA weights?" merge_and_unload()
"Target modules for Llama?" q_proj, k_proj, v_proj, o_proj

Free Resources Summary

Resource Link
PEFT Docs huggingface.co/docs/peft
QLoRA Paper arxiv.org/abs/2305.14314
Hugging Face Course huggingface.co/course
RunPod runpod.io (A100 $0.39/hr)
Colab Pro+ $50/mo → A100 access

Pro Tips

  1. Always use prepare_model_for_kbit_training() with QLoRA
  2. Merge before inference → faster, no adapter overhead
  3. Log r, alpha, target_modules in WandB
  4. Use push_to_hub() → share adapter in 1 line
  5. Resume:

    "Fine-tuned Llama 3 8B with QLoRA on 1 GPU — 94% accuracy in 2 hours"


Final Checklist

Task Done?
Apply LoRA to BERT
Train on 10% IMDB
Save & merge adapter
QLoRA on 7B model
Deploy to HF Space

All Yes → You’re a PEFT Expert!


Next: MLOps & Production Monitoring

You can fine-tune → now serve at scale.


Start Now:

git clone https://huggingface.co/spaces/huggingface/peft-lora-demo
cd peft-lora-demo
pip install -r requirements.txt

Tag me when you deploy your LoRA model!
You now fine-tune LLMs like OpenAI.

Last updated: Nov 09, 2025