Fine-Tuning LLMs on Consumer Hardware: LoRA and QLoRA Guide
๐ More on this topic: Qwen Models Guide ยท Llama 3 Guide ยท What Can You Run on 24GB VRAM ยท Used RTX 3090 Guide
Fine-tuning used to require datacenter hardware. A 7B model needs ~60 GB VRAM for full fine-tuning โ that’s multiple A100s. Consumer GPUs couldn’t touch it.
LoRA changed that in 2023. QLoRA made it accessible in 2024. Now you can fine-tune a 7B model on an RTX 3060 12GB in a few hours. The barrier isn’t hardware anymore โ it’s knowing what actually works.
This guide covers the practical path: what fine-tuning does, when it’s worth it, and how to actually do it on hardware you already own.
What Fine-Tuning Actually Does
Fine-tuning takes a pre-trained model and adjusts its weights using your data. The model learns your patterns โ your writing style, your domain terminology, your specific task format.
Fine-tuning is NOT:
- Training a model from scratch (that requires millions of dollars)
- Adding new knowledge (use RAG for that)
- Making a small model as smart as a large one
Fine-tuning IS:
- Teaching a model to follow a specific format
- Adapting behavior to your domain
- Improving performance on narrow, well-defined tasks
- Making a model sound like you
When Fine-Tuning Makes Sense
| Use Case | Fine-Tune? | Why |
|---|---|---|
| Match your writing style | Yes | Style is learnable from examples |
| Follow a specific output format | Yes | Consistent structure benefits from training |
| Domain-specific terminology | Maybe | Try RAG first, fine-tune if insufficient |
| Teach new factual knowledge | No | Use RAG instead |
| General improvement | No | Just use a better base model |
| Single task with clear patterns | Yes | Sweet spot for fine-tuning |
The honest assessment: most people who think they need fine-tuning actually need better prompts or RAG. Fine-tuning is for when you’ve tried everything else and need behavior that can’t be prompted.
LoRA Explained
Low-Rank Adaptation (LoRA) is why consumer fine-tuning is possible.
The Problem
Full fine-tuning updates every weight in the model. A 7B model has 7 billion parameters. At 16-bit precision, that’s ~14 GB just for the weights โ plus gradients, optimizer states, and activations. Total: ~60 GB VRAM.
The Solution
LoRA doesn’t update the original weights. Instead, it trains small “adapter” matrices that modify the model’s behavior. These adapters have millions of parameters instead of billions โ 10-100x smaller.
The math: Instead of updating a weight matrix W directly, LoRA learns two small matrices A and B where the update is A ร B. If W is 4096ร4096 (16M params) and the “rank” is 8, then A is 4096ร8 and B is 8ร4096 โ only 65K parameters total.
Quality Trade-off
LoRA typically recovers 90-95% of full fine-tuning quality. For most practical applications, this is indistinguishable. The remaining 5-10% only matters for pushing state-of-the-art benchmarks.
QLoRA: The Consumer Hardware Breakthrough
QLoRA combines LoRA with quantization. The base model runs in 4-bit precision while training the LoRA adapters in 16-bit.
Memory Savings
| Method | 7B Model VRAM | 13B Model VRAM |
|---|---|---|
| Full fine-tuning | ~60 GB | ~120 GB |
| LoRA (16-bit) | ~16-20 GB | ~32-40 GB |
| QLoRA (4-bit) | ~6-10 GB | ~10-16 GB |
QLoRA reduces memory by 75-80% compared to standard LoRA. A 7B model that needed a multi-GPU setup now fits on an RTX 3060.
Quality Trade-off
QLoRA typically achieves 80-90% of full fine-tuning quality. The 4-bit quantization introduces some approximation error, but it’s small enough that most tasks don’t notice.
For practical purposes: if your task is clear and well-defined (format following, style matching, domain adaptation), QLoRA quality is more than sufficient.
Hardware Requirements
Realistic VRAM Needs
| Model Size | QLoRA VRAM | Suitable GPUs |
|---|---|---|
| 3B | ~4 GB | RTX 3060 8GB, any 8GB+ card |
| 7B | ~6-10 GB | RTX 3060 12GB, RTX 3070 |
| 13B | ~10-16 GB | RTX 3090, RTX 4090 |
| 32B | ~24-32 GB | RTX 4090 (tight), dual GPU |
| 70B | ~46-48 GB | A100 80GB, multi-GPU |
What Each GPU Tier Can Train
| GPU | VRAM | Realistic Training |
|---|---|---|
| RTX 3060 12GB | 12 GB | 7B models with QLoRA |
| RTX 3070/3080 | 8-10 GB | 7B with small batch size |
| RTX 3090 | 24 GB | 13B with QLoRA, 7B comfortably |
| RTX 4090 | 24 GB | 13B with QLoRA, ~1.5-2x faster than 3090 |
| 2x RTX 3090 | 48 GB | 32B models, 70B with heavy quantization |
Beyond VRAM
- CPU RAM: At least 32 GB system RAM for 7B, 64 GB for 13B+
- Storage: Fast SSD helps with data loading โ NVMe preferred
- Batch size: Lower VRAM means smaller batches, longer training
The Fine-Tuning Stack
Unsloth (Recommended)
Unsloth is the current best option for consumer fine-tuning:
- 2-5x faster training than standard Hugging Face
- 30-70% less VRAM with no accuracy loss
- Free Colab notebooks for fine-tuning up to 14B models
- Supports Llama, Qwen, Mistral, Gemma, and more
Unsloth achieves this through optimized Triton kernels that fuse operations and reduce memory overhead. It’s not magic โ it’s better engineering.
Other Options
| Tool | Best For | Notes |
|---|---|---|
| Unsloth | Most users | Fastest, easiest |
| Axolotl | Complex setups | More configuration options |
| Hugging Face PEFT | Maximum flexibility | Standard but slower |
| LLaMA-Factory | GUI-based training | Good for beginners |
For your first fine-tune, use Unsloth. Graduate to Axolotl if you need features Unsloth doesn’t support.
Dataset Preparation
Quality Over Quantity
The single most important insight: you need fewer examples than you think.
| Study | Dataset Size | Result |
|---|---|---|
| LIMA (Meta) | 1,000 samples | Matched GPT-quality on evaluations |
| Stanford Alpaca | 52,000 samples | Strong instruction-following |
| Practical minimum | 100-200 samples | Viable for simple tasks |
The LIMA paper showed that 1,000 carefully curated examples beat 50,000 mediocre ones. Quality is everything.
How Much Data You Actually Need
| Task | Recommended Size | Notes |
|---|---|---|
| Style adaptation | 100-200 samples | Examples of your writing |
| Format following | 200-500 samples | Input/output pairs |
| Domain adaptation | 500-1,000 samples | Domain-specific Q&A |
| Complex instruction-following | 1,000-5,000 samples | More for edge cases |
Unsloth’s recommendation:
- 1,000+ rows โ Train on base model
- 300-1,000 rows โ Either base or instruct model
- <300 rows โ Use instruct model (it already knows how to follow instructions)
Dataset Formats
Alpaca format (single-turn, most common):
{
"instruction": "Summarize the following text in one sentence.",
"input": "The quick brown fox jumps over the lazy dog...",
"output": "A fox demonstrates agility by leaping over a resting dog."
}
ShareGPT format (multi-turn conversations):
{
"conversations": [
{"from": "human", "value": "What is the capital of France?"},
{"from": "gpt", "value": "The capital of France is Paris."},
{"from": "human", "value": "What's its population?"},
{"from": "gpt", "value": "Paris has approximately 2.1 million residents..."}
]
}
Use Alpaca for single-turn tasks (most fine-tuning). Use ShareGPT if you’re training a conversational assistant.
Creating Quality Data
- Start with real examples โ Use actual inputs and outputs from your use case
- Review manually โ Every example should be correct and representative
- Include edge cases โ Don’t just train on easy examples
- Diversify inputs โ Vary phrasing, length, and complexity
- Keep outputs consistent โ Same task should produce similar output style
Red flag: If you’re generating training data with another LLM, you’re probably just teaching your model to imitate that LLM. Use real data from your actual use case.
Step-by-Step Tutorial: Unsloth + QLoRA
This tutorial fine-tunes Llama 3.1 8B on a custom dataset using QLoRA. Works on RTX 3060 12GB or better.
1. Install Dependencies
pip install unsloth
pip install --upgrade transformers datasets accelerate peft bitsandbytes
2. Load Model with 4-bit Quantization
from unsloth import FastLanguageModel
import torch
# Load model in 4-bit
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3.1-8b-bnb-4bit",
max_seq_length=2048,
dtype=None, # Auto-detect
load_in_4bit=True,
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
)
3. Prepare Dataset
from datasets import load_dataset
# Load your dataset (Alpaca format)
dataset = load_dataset("json", data_files="your_data.json", split="train")
# Format for training
alpaca_prompt = """### Instruction:
{instruction}
### Input:
{input}
### Response:
{output}"""
def formatting_func(examples):
texts = []
for instruction, input_text, output in zip(
examples["instruction"],
examples["input"],
examples["output"]
):
text = alpaca_prompt.format(
instruction=instruction,
input=input_text if input_text else "",
output=output
)
texts.append(text)
return {"text": texts}
dataset = dataset.map(formatting_func, batched=True)
4. Configure Training
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=10,
max_steps=100, # Adjust based on dataset size
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=10,
output_dir="outputs",
optim="adamw_8bit",
),
)
5. Train
trainer.train()
On an RTX 3090 with 500 examples, this takes ~30-60 minutes. On an RTX 3060 12GB, expect 1-2 hours.
6. Save the LoRA Adapter
# Save just the LoRA weights (small, ~50-200 MB)
model.save_pretrained("my-lora-adapter")
tokenizer.save_pretrained("my-lora-adapter")
Converting and Using Your Fine-Tune
Merge LoRA into Base Model
To run your fine-tune in Ollama or llama.cpp, you need to merge the LoRA adapter into the base model and convert to GGUF.
# Merge LoRA into base model
model.save_pretrained_merged(
"merged-model",
tokenizer,
save_method="merged_16bit",
)
Convert to GGUF
# Clone llama.cpp if you haven't
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Convert to GGUF
python convert_hf_to_gguf.py ../merged-model --outfile my-model.gguf
# Quantize (optional, for smaller size)
./llama-quantize my-model.gguf my-model-q4_k_m.gguf q4_k_m
Run in Ollama
Create a Modelfile:
FROM ./my-model-q4_k_m.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 2048
SYSTEM "You are a helpful assistant fine-tuned for [your task]."
ollama create my-fine-tune -f Modelfile
ollama run my-fine-tune
Common Mistakes
Overfitting
Symptom: Model memorizes training examples verbatim, fails on new inputs.
Fix:
- Use more diverse training data
- Reduce training steps
- Increase LoRA dropout (0.05-0.1)
- Lower learning rate
Catastrophic Forgetting
Symptom: Model loses general capabilities after fine-tuning.
Fix:
- Train for fewer steps
- Use lower learning rate (1e-5 instead of 2e-4)
- Include some general-purpose examples in your dataset
- Start from an instruct model, not base
Learning Rate Too High
Symptom: Training loss spikes or doesn’t decrease.
Fix:
- Start with 2e-4 for QLoRA
- Reduce to 1e-4 or 5e-5 if unstable
- Use warmup steps (5-10% of total steps)
Too Few Examples
Symptom: Model doesn’t learn the pattern you want.
Fix:
- Add more examples (aim for 200+ minimum)
- Ensure examples are diverse
- Check that examples are actually correct
Wrong Base Model
Symptom: Fine-tune underperforms expectations.
Fix:
- For format/style tasks: use instruct models
- For complex reasoning: use larger base models
- For coding: start from a code-specialized model
When NOT to Fine-Tune
Fine-tuning isn’t always the answer. Consider alternatives first:
Prompt Engineering Might Be Enough
If you can describe what you want in a prompt, try that before fine-tuning. Modern models are remarkably good at following detailed instructions.
Example: Instead of fine-tuning for JSON output, try:
Respond ONLY with valid JSON in this exact format:
{"field1": "value", "field2": "value"}
No explanations, no markdown, just the JSON object.
RAG for New Knowledge
Fine-tuning doesn’t reliably add new factual knowledge. If you need the model to know about your company’s products, internal documents, or recent information, RAG is the right approach.
A Better Base Model Might Suffice
Before fine-tuning Llama 3.1 8B for coding, try Qwen 2.5 Coder 32B. The better base model might already do what you need.
The 80/20 Rule
Fine-tuning typically improves performance by 10-30% on specific tasks. If you need 2x improvement, fine-tuning alone won’t get you there. Consider:
- Better base model + fine-tuning
- RAG + fine-tuning
- Multiple specialized models
LoRA Hyperparameters
Rank (r)
| Rank | Use Case | Notes |
|---|---|---|
| r=4-8 | Simple tasks, style adaptation | Maximum efficiency |
| r=16-32 | Most tasks | Recommended default |
| r=64 | Complex tasks | More capacity |
| r=128+ | Rarely needed | Diminishing returns |
Research shows little practical difference between r=8 and r=256 for most tasks. Start with r=16.
Alpha
Common rule: alpha = 2ร rank
- r=8 โ alpha=16
- r=16 โ alpha=32
- r=32 โ alpha=64
This comes from Microsoft’s original LoRA paper and works well in practice.
Target Modules
For most transformer models, target these layers:
q_proj,k_proj,v_proj,o_proj(attention)gate_proj,up_proj,down_proj(MLP)
Training all of these gives best results. Training only attention layers uses less memory but may reduce quality.
Bottom Line
Fine-tuning is more accessible than ever. An RTX 3060 12GB can train a 7B model with QLoRA. An RTX 3090 handles 13B models comfortably. You don’t need thousands of examples โ 500 high-quality samples often suffice.
The realistic path:
- Start with a good instruct model (Qwen 3, Llama 3)
- Collect 200-500 real examples from your use case
- Fine-tune with Unsloth + QLoRA
- Convert to GGUF and run in Ollama
Before you start:
- Try prompt engineering first โ it might be enough
- Consider RAG for knowledge tasks
- Make sure your examples are high quality
- Start small (fewer steps, lower rank) and iterate
Fine-tuning is a tool, not a solution. It works well for teaching models specific formats, styles, and behaviors. It doesn’t work for adding knowledge or making small models smarter. Use it when it fits.
# Get started with Unsloth
pip install unsloth
# Or use their free Colab notebooks:
# https://github.com/unslothai/unsloth