QLoRA Implementation Details (2025 Edition)
Fine-Tune 70B LLMs on a Single 24GB GPU — Full Technical Deep Dive Goal: Master QLoRA — the gold standard for parameter-efficient, memory-efficient fine-tuning of massive language models.
QLoRA Implementation Details (2025 Edition)
QLoRA Implementation Details (2025 Edition)
QLoRA Implementation Details (2025 Edition)
Fine-Tune 70B LLMs on a Single 24GB GPU — Full Technical Deep Dive
Goal: Master QLoRA — the gold standard for parameter-efficient, memory-efficient fine-tuning of massive language models.
Why QLoRA?
- 70B model on 1 GPU (24GB VRAM)
- Only 0.1% of weights updated (LoRA) + 4-bit quantization
- Performance within 1% of full fine-tuning
- Used by: Mistral, Llama 3, Phi-3, Gemma
- Paper: QLoRA: Efficient Finetuning of Quantized LLMs (NeurIPS 2023)
QLoRA Architecture: 4 Key Innovations
| Component | What It Does | Memory Saved |
|---|---|---|
| 4-bit NormalFloat (NF4) | Optimal 4-bit datatype | 4x vs FP16 |
| Double Quantization | Quantize quantization constants | +0.37 bits/param |
| Paged Optimizers | Prevent OOM on GPU | CPU offload |
| LoRA | Low-rank adapters | 99.9% frozen |
Full FP16 (70B): 140 GB
QLoRA (70B): ~40 GB → fits on 1x A100 40GB
Full QLoRA Pipeline (Code + Math)
Step 1: 4-bit Quantization with NF4
from transformers import BitsAndBytesConfig
import torch
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 (optimal for normals)
bnb_4bit_compute_dtype=torch.bfloat16, # FP16 compute
bnb_4bit_use_double_quantization=True # Double quantize constants
)
NF4 Math:
- Data ~ N(0,1) → 4-bit range: [-8, 7]
- Block-wise quantization (64 values/block)
- Double Quant: Q(W) = Q(Q(W)) → saves 0.37 bits/param
Step 2: Load 70B Model in 4-bit
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-70B",
quantization_config=quantization_config,
device_map="auto", # Auto-split across GPU/CPU
torch_dtype=torch.bfloat16
)
Memory Breakdown (70B):
| Component | VRAM |
|---------|------|
| 4-bit weights | 35 GB |
| Optimizer states (paged) | 5 GB |
| Gradients (4-bit) | 4 GB |
| Total | ~44 GB → fits on 1x A100 40GB
Step 3: Prepare for QLoRA (Freeze + LoRA)
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
What this does:
- Enables gradient checkpointing
- Sets requires_grad=True only for LoRA
- Freezes 4-bit base
Step 4: Apply LoRA Adapters
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=64, # Rank (higher for larger models)
lora_alpha=16, # Scaling: alpha/r = 0.25
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj", # Llama attention
"gate_proj", "up_proj", "down_proj" # MLP
],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM",
modules_to_save=["lm_head"] # Fine-tune head too
)
qlora_model = get_peft_model(model, lora_config)
qlora_model.print_trainable_parameters()
# Output: "trainable params: 340M || total params: 70.6B || trainable%: 0.48"
Step 5: Paged Optimizer (Avoid OOM)
from transformers import TrainingArguments
from trl import SFTTrainer
training_args = TrainingArguments(
output_dir="./qlora-llama3",
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
learning_rate=2e-4,
num_train_epochs=1,
fp16=True,
bf16=True,
logging_steps=10,
save_strategy="epoch",
optim="paged_adamw_8bit", # Paged 8-bit Adam
gradient_checkpointing=True,
max_grad_norm=0.3,
warmup_ratio=0.03,
report_to="wandb"
)
Paged Adam:
- Stores optimizer states in CPU RAM
- Pages in/out as needed
- Prevents OOM during long sequences
Step 6: Train with SFTTrainer
from datasets import load_dataset
dataset = load_dataset("timdettmers/openassistant-guanaco")
trainer = SFTTrainer(
model=qlora_model,
args=training_args,
train_dataset=dataset["train"],
dataset_text_field="text",
max_seq_length=2048,
tokenizer=tokenizer,
packing=True # Pack sequences → faster
)
trainer.train()
Training Speed:
- 70B model: 1.2 it/s on 1x A100
- 3 hours for 10k examples
Step 7: Merge & Save
# Save adapter
qlora_model.save_pretrained("./qlora-adapter")
# Merge (for inference)
merged_model = qlora_model.merge_and_unload()
merged_model.save_pretrained("./merged-llama3-70b")
Inference (1 Line, 40GB → 35GB)
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="./merged-llama3-70b",
torch_dtype=torch.bfloat16,
device_map="auto"
)
print(pipe("Explain QLoRA in one sentence:", max_new_tokens=100)[0]["generated_text"])
QLoRA Config Cheat Sheet (70B vs 7B)
| Model | r |
alpha |
target_modules |
VRAM | Trainable % |
|---|---|---|---|---|---|
| 7B | 32 | 16 | q,v_proj | ~9GB | 0.3% |
| 13B | 64 | 16 | q,k,v,o_proj | ~16GB | 0.4% |
| 70B | 64 | 16 | all proj | ~44GB | 0.48% |
Advanced: DoRA (2024) — Weight-Decomposed LoRA
lora_config = LoraConfig(
use_dora=True, # Weight decomposition
...
)
DoRA = LoRA + magnitude + direction → +2% accuracy
Production Deployment
vLLM + QLoRA (1000+ QPS)
pip install vllm
from vllm import LLM
llm = LLM(model="./merged-llama3-70b", quantization="bitsandbytes")
outputs = llm.generate(["Hello!"])
Debugging QLoRA OOM
| Issue | Fix |
|---|---|
| OOM during forward | gradient_checkpointing=True |
| OOM in optimizer | optim="paged_adamw_8bit" |
| NaN loss | bnb_4bit_compute_dtype=torch.bfloat16 |
| Slow training | packing=True, torch.compile() |
Benchmark: QLoRA vs Full FT
| Method | VRAM | Time | MMLU | GPU |
|---|---|---|---|---|
| Full FT (FP16) | 560GB | 48h | 68.2 | 8x H100 |
| QLoRA | 44GB | 3h | 67.8 | 1x A100 |
Capstone: "Your Personal AI Tutor"
Task: QLoRA fine-tune Llama 3 70B on your lecture notes + Q&A
Goal: Answer student questions in your teaching style
Stack:
- QLoRA + NF4 + DoRA
- vLLM inference
- Deploy on RunPod A100 ($0.79/hr)
# Generate
print(llm.generate("Explain backpropagation like I'm 10:"))
Interview Questions (Solve in 10 Mins)
| Question | Answer |
|---|---|
| "NF4 vs INT4?" | NF4 optimal for normal dist, +1% accuracy |
| "Double quantization?" | Quantize constants → 0.37 bits/param saved |
| "Paged optimizer?" | CPU offload → no OOM |
"Why prepare_model_for_kbit_training?" |
Enables grad checkpointing on 4-bit |
| "Merge QLoRA?" | merge_and_unload() → full FP16 model |
Free Resources Summary
| Resource | Link |
|---|---|
| QLoRA Paper | arxiv.org/abs/2305.14314 |
| PEFT QLoRA Guide | huggingface.co/docs/peft/en/quantization |
| Bitsandbytes | github.com/TimDettmers/bitsandbytes |
| RunPod | runpod.io (A100 40GB $0.79/hr) |
| Colab Pro+ | A100 access |
Pro Tips
- Use
bnb_4bit_compute_dtype=torch.bfloat16→ stable training - Always
packing=True→ 2x faster - Log VRAM:
torch.cuda.max_memory_allocated() - Merge before sharing → smaller, faster
- Resume:
"Fine-tuned Llama 3 70B with QLoRA on 1 A100 — 67.8 MMLU in 3 hours"
Final Checklist
| Task | Done? |
|---|---|
| Load 70B in 4-bit | ☐ |
| Apply QLoRA (r=64) | ☐ |
| Train with paged Adam | ☐ |
| Merge & infer | ☐ |
| Deploy with vLLM | ☐ |
All Yes → You’re a QLoRA Master!
Next: Federated Learning & On-Device
You can fine-tune 70B → now run on phone.
Start Now:
pip install bitsandbytes peft transformers accelerate
import torch
print(torch.cuda.get_device_name(0))
Tag me when you fine-tune 70B on 1 GPU!
You now train models bigger than GPT-3 on consumer hardware.
QLoRA Implementation Details (2025 Edition)
Fine-Tune 70B LLMs on a Single 24GB GPU — Full Technical Deep Dive Goal: Master QLoRA — the gold standard for parameter-efficient, memory-efficient fine-tuning of massive language models.
QLoRA Implementation Details (2025 Edition)
QLoRA Implementation Details (2025 Edition)
QLoRA Implementation Details (2025 Edition)
Fine-Tune 70B LLMs on a Single 24GB GPU — Full Technical Deep Dive
Goal: Master QLoRA — the gold standard for parameter-efficient, memory-efficient fine-tuning of massive language models.
Why QLoRA?
- 70B model on 1 GPU (24GB VRAM)
- Only 0.1% of weights updated (LoRA) + 4-bit quantization
- Performance within 1% of full fine-tuning
- Used by: Mistral, Llama 3, Phi-3, Gemma
- Paper: QLoRA: Efficient Finetuning of Quantized LLMs (NeurIPS 2023)
QLoRA Architecture: 4 Key Innovations
| Component | What It Does | Memory Saved |
|---|---|---|
| 4-bit NormalFloat (NF4) | Optimal 4-bit datatype | 4x vs FP16 |
| Double Quantization | Quantize quantization constants | +0.37 bits/param |
| Paged Optimizers | Prevent OOM on GPU | CPU offload |
| LoRA | Low-rank adapters | 99.9% frozen |
Full FP16 (70B): 140 GB
QLoRA (70B): ~40 GB → fits on 1x A100 40GB
Full QLoRA Pipeline (Code + Math)
Step 1: 4-bit Quantization with NF4
from transformers import BitsAndBytesConfig
import torch
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 (optimal for normals)
bnb_4bit_compute_dtype=torch.bfloat16, # FP16 compute
bnb_4bit_use_double_quantization=True # Double quantize constants
)
NF4 Math:
- Data ~ N(0,1) → 4-bit range: [-8, 7]
- Block-wise quantization (64 values/block)
- Double Quant: Q(W) = Q(Q(W)) → saves 0.37 bits/param
Step 2: Load 70B Model in 4-bit
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-70B",
quantization_config=quantization_config,
device_map="auto", # Auto-split across GPU/CPU
torch_dtype=torch.bfloat16
)
Memory Breakdown (70B):
| Component | VRAM |
|---------|------|
| 4-bit weights | 35 GB |
| Optimizer states (paged) | 5 GB |
| Gradients (4-bit) | 4 GB |
| Total | ~44 GB → fits on 1x A100 40GB
Step 3: Prepare for QLoRA (Freeze + LoRA)
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
What this does:
- Enables gradient checkpointing
- Sets requires_grad=True only for LoRA
- Freezes 4-bit base
Step 4: Apply LoRA Adapters
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=64, # Rank (higher for larger models)
lora_alpha=16, # Scaling: alpha/r = 0.25
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj", # Llama attention
"gate_proj", "up_proj", "down_proj" # MLP
],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM",
modules_to_save=["lm_head"] # Fine-tune head too
)
qlora_model = get_peft_model(model, lora_config)
qlora_model.print_trainable_parameters()
# Output: "trainable params: 340M || total params: 70.6B || trainable%: 0.48"
Step 5: Paged Optimizer (Avoid OOM)
from transformers import TrainingArguments
from trl import SFTTrainer
training_args = TrainingArguments(
output_dir="./qlora-llama3",
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
learning_rate=2e-4,
num_train_epochs=1,
fp16=True,
bf16=True,
logging_steps=10,
save_strategy="epoch",
optim="paged_adamw_8bit", # Paged 8-bit Adam
gradient_checkpointing=True,
max_grad_norm=0.3,
warmup_ratio=0.03,
report_to="wandb"
)
Paged Adam:
- Stores optimizer states in CPU RAM
- Pages in/out as needed
- Prevents OOM during long sequences
Step 6: Train with SFTTrainer
from datasets import load_dataset
dataset = load_dataset("timdettmers/openassistant-guanaco")
trainer = SFTTrainer(
model=qlora_model,
args=training_args,
train_dataset=dataset["train"],
dataset_text_field="text",
max_seq_length=2048,
tokenizer=tokenizer,
packing=True # Pack sequences → faster
)
trainer.train()
Training Speed:
- 70B model: 1.2 it/s on 1x A100
- 3 hours for 10k examples
Step 7: Merge & Save
# Save adapter
qlora_model.save_pretrained("./qlora-adapter")
# Merge (for inference)
merged_model = qlora_model.merge_and_unload()
merged_model.save_pretrained("./merged-llama3-70b")
Inference (1 Line, 40GB → 35GB)
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="./merged-llama3-70b",
torch_dtype=torch.bfloat16,
device_map="auto"
)
print(pipe("Explain QLoRA in one sentence:", max_new_tokens=100)[0]["generated_text"])
QLoRA Config Cheat Sheet (70B vs 7B)
| Model | r |
alpha |
target_modules |
VRAM | Trainable % |
|---|---|---|---|---|---|
| 7B | 32 | 16 | q,v_proj | ~9GB | 0.3% |
| 13B | 64 | 16 | q,k,v,o_proj | ~16GB | 0.4% |
| 70B | 64 | 16 | all proj | ~44GB | 0.48% |
Advanced: DoRA (2024) — Weight-Decomposed LoRA
lora_config = LoraConfig(
use_dora=True, # Weight decomposition
...
)
DoRA = LoRA + magnitude + direction → +2% accuracy
Production Deployment
vLLM + QLoRA (1000+ QPS)
pip install vllm
from vllm import LLM
llm = LLM(model="./merged-llama3-70b", quantization="bitsandbytes")
outputs = llm.generate(["Hello!"])
Debugging QLoRA OOM
| Issue | Fix |
|---|---|
| OOM during forward | gradient_checkpointing=True |
| OOM in optimizer | optim="paged_adamw_8bit" |
| NaN loss | bnb_4bit_compute_dtype=torch.bfloat16 |
| Slow training | packing=True, torch.compile() |
Benchmark: QLoRA vs Full FT
| Method | VRAM | Time | MMLU | GPU |
|---|---|---|---|---|
| Full FT (FP16) | 560GB | 48h | 68.2 | 8x H100 |
| QLoRA | 44GB | 3h | 67.8 | 1x A100 |
Capstone: "Your Personal AI Tutor"
Task: QLoRA fine-tune Llama 3 70B on your lecture notes + Q&A
Goal: Answer student questions in your teaching style
Stack:
- QLoRA + NF4 + DoRA
- vLLM inference
- Deploy on RunPod A100 ($0.79/hr)
# Generate
print(llm.generate("Explain backpropagation like I'm 10:"))
Interview Questions (Solve in 10 Mins)
| Question | Answer |
|---|---|
| "NF4 vs INT4?" | NF4 optimal for normal dist, +1% accuracy |
| "Double quantization?" | Quantize constants → 0.37 bits/param saved |
| "Paged optimizer?" | CPU offload → no OOM |
"Why prepare_model_for_kbit_training?" |
Enables grad checkpointing on 4-bit |
| "Merge QLoRA?" | merge_and_unload() → full FP16 model |
Free Resources Summary
| Resource | Link |
|---|---|
| QLoRA Paper | arxiv.org/abs/2305.14314 |
| PEFT QLoRA Guide | huggingface.co/docs/peft/en/quantization |
| Bitsandbytes | github.com/TimDettmers/bitsandbytes |
| RunPod | runpod.io (A100 40GB $0.79/hr) |
| Colab Pro+ | A100 access |
Pro Tips
- Use
bnb_4bit_compute_dtype=torch.bfloat16→ stable training - Always
packing=True→ 2x faster - Log VRAM:
torch.cuda.max_memory_allocated() - Merge before sharing → smaller, faster
- Resume:
"Fine-tuned Llama 3 70B with QLoRA on 1 A100 — 67.8 MMLU in 3 hours"
Final Checklist
| Task | Done? |
|---|---|
| Load 70B in 4-bit | ☐ |
| Apply QLoRA (r=64) | ☐ |
| Train with paged Adam | ☐ |
| Merge & infer | ☐ |
| Deploy with vLLM | ☐ |
All Yes → You’re a QLoRA Master!
Next: Federated Learning & On-Device
You can fine-tune 70B → now run on phone.
Start Now:
pip install bitsandbytes peft transformers accelerate
import torch
print(torch.cuda.get_device_name(0))
Tag me when you fine-tune 70B on 1 GPU!
You now train models bigger than GPT-3 on consumer hardware.