Fine-Tuning on Mac: LoRA & QLoRA with MLX
📚 More on this topic: Fine-Tuning on Consumer Hardware (NVIDIA) · Best Local LLMs for Mac · Running LLMs on Mac M-Series · VRAM Requirements · Ollama on Mac
We already have a general LoRA/QLoRA guide that covers fine-tuning on NVIDIA GPUs with Unsloth. This is the Mac version. Different framework, different constraints, different advantages.
The short version: Apple’s MLX framework lets you fine-tune models on Apple Silicon using LoRA and QLoRA. The unified memory architecture means your entire RAM pool is available for training – no separate VRAM limit. A 32GB MacBook Pro can fine-tune models that would crash a 24GB RTX 3090. The tradeoff is speed. NVIDIA hardware trains 2-4x faster when the model fits in VRAM. But if the model doesn’t fit in VRAM, NVIDIA can’t train it at all without multi-GPU setups. That’s where Mac wins.
Why Mac works for fine-tuning
On a PC, fine-tuning lives or dies by VRAM. An RTX 3090 has 24GB. A 7B model at full precision needs ~28GB for training – it doesn’t fit. QLoRA drops that to ~7GB, making it possible, but you’re always fighting the VRAM ceiling.
On Mac, there’s no separate VRAM. The GPU shares the same memory pool as the CPU. A MacBook Pro with 32GB of unified memory can use all 32GB (minus what macOS needs) for training. No copying data between CPU and GPU memory. No out-of-memory crashes because the GPU buffer ran out while system RAM sat idle.
This matters most for larger models. A 14B model needs ~14-18GB for LoRA training. That’s tight on a 24GB GPU but comfortable on a 32GB Mac. A 32B model at QLoRA needs ~20-25GB. Impossible on any single consumer GPU. Straightforward on a 48GB Mac.
The catch: Apple Silicon’s memory bandwidth is lower than dedicated GPUs. An M3 Max pushes 400 GB/s. An RTX 4090 pushes 1,008 GB/s. Training is a bandwidth-heavy operation, so the Mac is slower per step. You get there – it just takes longer.
What you need
Hardware
Any Apple Silicon Mac works. The question is how large a model you can fine-tune:
| Unified memory | LoRA (full precision base) | QLoRA (4-bit base) | Training speed |
|---|---|---|---|
| 8 GB | 1B-3B (tight) | 3B | Slow, limited batch size |
| 16 GB | 3B-7B | 7B-8B | Workable |
| 24 GB | 7B-8B | 8B-14B | Comfortable |
| 32 GB | 8B-14B | 14B-32B (tight) | Good |
| 48 GB | 14B-32B | 32B | Good |
| 64 GB | 32B | 32B-70B (tight) | Fast |
| 96 GB | 70B (tight) | 70B | Fast |
| 128 GB | 70B | 70B+ comfortably | Fast |
For comparison, here’s what the same tasks require on NVIDIA:
| Mac (unified) | Equivalent NVIDIA setup |
|---|---|
| 16 GB Mac (QLoRA 8B) | RTX 3060 12GB or RTX 4060 Ti 16GB |
| 32 GB Mac (QLoRA 14B) | RTX 3090 24GB (tight) |
| 48 GB Mac (QLoRA 32B) | No single consumer GPU – need dual GPUs |
| 64 GB Mac (LoRA 32B) | RTX A6000 48GB ($4,000+) |
| 96 GB Mac (LoRA 70B) | Multi-GPU setup ($3,000+) |
The Mac doesn’t train faster. It trains models that don’t fit elsewhere.
Software
Python 3.10+ and pip. That’s it.
pip install "mlx-lm[train]"
This installs mlx-lm with training dependencies, including the mlx_lm.lora command. Verify it works:
python -c "import mlx; print(mlx.__version__)"
You should see version 0.30+ (current is 0.30.6 as of early 2026).
Choosing a base model
MLX fine-tuning requires HuggingFace-format models. GGUF files won’t work – you need the original safetensors weights.
Supported architectures: Llama, Mistral, Qwen2, Phi, Gemma, Mixtral, OLMo, MiniCPM, InternLM2.
| Model | Size | HuggingFace ID | Best for |
|---|---|---|---|
| Llama 3.1 8B Instruct | 8B | meta-llama/Llama-3.1-8B-Instruct | General fine-tuning, good baseline |
| Qwen 2.5 7B Instruct | 7B | Qwen/Qwen2.5-7B-Instruct | Multilingual, strong coding |
| Mistral 7B Instruct v0.3 | 7B | mistralai/Mistral-7B-Instruct-v0.3 | Instruction following |
| Phi-4 Mini | 3.8B | microsoft/phi-4-mini-instruct | Reasoning on low memory |
| Qwen 2.5 14B Instruct | 14B | Qwen/Qwen2.5-14B-Instruct | Best quality for 32GB Macs |
| Llama 3.3 70B Instruct | 70B | meta-llama/Llama-3.3-70B-Instruct | 96GB+ Macs only |
Start with an 8B model. It’s the sweet spot for learning the workflow and iterating fast. Move to 14B or 32B once you’ve nailed your dataset and training config.
For quantized models (QLoRA), use MLX-format quantized versions from the mlx-community org on HuggingFace, or quantize yourself:
mlx_lm.convert --hf-path meta-llama/Llama-3.1-8B-Instruct -q
This creates a 4-bit quantized version in MLX format. Point mlx_lm.lora at the quantized model and it automatically uses QLoRA.
Preparing training data
mlx-lm accepts three JSONL formats. Pick the one that matches your task.
Chat format (recommended for instruction tuning)
{"messages": [{"role": "system", "content": "You are a helpful legal assistant."}, {"role": "user", "content": "What is consideration in contract law?"}, {"role": "assistant", "content": "Consideration is something of value exchanged between parties..."}]}
Completions format (for prompt-response pairs)
{"prompt": "Summarize this contract clause:", "completion": "This clause establishes..."}
Text format (for continued pretraining)
{"text": "The Treaty of Westphalia (1648) established the principle of..."}
Put your data in a folder with these files:
train.jsonl(required)valid.jsonl(optional but recommended – 10-15% of your data)test.jsonl(optional – for evaluation after training)
Each example must be a single line. No multi-line JSON.
How much data you need
Less than you think:
| Examples | What to expect |
|---|---|
| 50-100 | Enough to shift tone and format. Won’t learn new knowledge. |
| 200-500 | The practical sweet spot. Enough to specialize behavior on a well-defined task. |
| 500-1,000 | Good for complex tasks with varied inputs. Diminishing returns beyond this for most LoRA fine-tunes. |
| 1,000+ | Only if your task has high variability. More data doesn’t always help – 200 clean examples beat 2,000 sloppy ones. |
Quality matters more than quantity. Every example should represent exactly the behavior you want. If you’re fine-tuning a model to write SQL from natural language, every example should be a clean natural-language-to-SQL pair. No filler, no duplicates, no contradictory examples.
Running LoRA fine-tuning
The basic command:
mlx_lm.lora \
--model meta-llama/Llama-3.1-8B-Instruct \
--train \
--data ./my-dataset \
--iters 500
This downloads the model from HuggingFace (first run only), trains LoRA adapters for 500 iterations, and saves them to ./adapters/.
Key parameters
| Parameter | Default | What it does |
|---|---|---|
--iters | 1000 | Training iterations. 300-600 is usually enough for 200-500 examples. |
--batch-size | 4 | Reduce to 2 or 1 if you hit memory pressure. |
--num-layers | 16 | Number of layers to apply LoRA to. Reduce to 8 or 4 to save memory. |
--learning-rate | 1e-5 | Default works for most cases. |
--fine-tune-type | lora | Options: lora, dora, full. Stick with lora unless you know why you need the others. |
--grad-checkpoint | off | Enable with --grad-checkpoint to trade speed for memory. |
--mask-prompt | off | Enable to compute loss only on the assistant’s response, not the prompt. Recommended for chat format. |
A realistic example
Fine-tuning Llama 3.1 8B on a 32GB Mac to write SQL:
mlx_lm.lora \
--model meta-llama/Llama-3.1-8B-Instruct \
--train \
--data ./sql-dataset \
--iters 500 \
--batch-size 2 \
--num-layers 16 \
--mask-prompt
During training you’ll see output like:
Iter 10: Train loss 2.431, It/sec 0.89, Tokens/sec 285
Iter 20: Train loss 1.872, It/sec 0.91, Tokens/sec 291
Iter 100: Train loss 0.843, It/sec 0.92, Tokens/sec 294
...
Iter 500: Train loss 0.312, It/sec 0.93, Tokens/sec 296
Loss should drop steeply in the first 100 iterations and flatten by 300-500. If it’s still dropping at 500, increase --iters. If it flatlines at a high value, your data may have issues.
Training speed depends on your chip. Rough numbers for LoRA on an 8B model:
| Chip | Tokens/sec (training) |
|---|---|
| M1 Max 32GB | ~250 |
| M2 Pro 16GB | ~200 |
| M2 Ultra 192GB | ~475 |
| M3 Max 48GB | ~320 |
| M4 Max 128GB | ~380 |
A 500-iteration run on 300 examples takes roughly 15-45 minutes depending on your hardware and sequence length.
QLoRA: when memory is tight
QLoRA quantizes the base model to 4-bit and trains LoRA adapters on top. The base model weights stay frozen at 4-bit. Only the small LoRA adapter trains at full precision.
The memory savings are dramatic:
| Model | LoRA memory | QLoRA memory | Savings |
|---|---|---|---|
| Llama 3.1 8B | ~14 GB | ~7 GB | 50% |
| Qwen 2.5 14B | ~24 GB | ~12 GB | 50% |
| Llama 3.3 70B | ~130 GB | ~45 GB | 65% |
To use QLoRA, point mlx_lm.lora at a quantized model. If the model is already quantized, it uses QLoRA automatically:
# First, quantize the model (one-time)
mlx_lm.convert \
--hf-path meta-llama/Llama-3.1-8B-Instruct \
-q \
--q-bits 4
# Then train with QLoRA (automatic when model is quantized)
mlx_lm.lora \
--model mlx_model \
--train \
--data ./my-dataset \
--iters 500
Or use pre-quantized models from mlx-community on HuggingFace:
mlx_lm.lora \
--model mlx-community/Llama-3.1-8B-Instruct-4bit \
--train \
--data ./my-dataset \
--iters 500
When to use QLoRA vs LoRA
- 16GB Mac + 8B model: QLoRA (LoRA won’t fit)
- 32GB Mac + 8B model: LoRA (better quality, you have the memory)
- 32GB Mac + 14B model: QLoRA (LoRA is too tight)
- 64GB Mac + 32B model: Either works. QLoRA for speed, LoRA for quality.
QLoRA produces slightly lower quality than full-precision LoRA because the base weights are approximated at 4-bit. For most practical tasks the difference is small. If you’re on the edge of fitting, use QLoRA and don’t worry about it.
Memory-saving tricks
If you’re still running out of memory:
- Reduce batch size:
--batch-size 1 - Fewer LoRA layers:
--num-layers 8or--num-layers 4 - Enable gradient checkpointing:
--grad-checkpoint - Close other apps. Safari with 30 tabs eats 4-8GB. Close everything non-essential.
- Watch Activity Monitor. Green memory pressure means you’re fine. Yellow means tight. Red means swap – your training will crawl.
Testing your fine-tuned model
After training completes, test with the adapters loaded:
mlx_lm.generate \
--model meta-llama/Llama-3.1-8B-Instruct \
--adapter-path ./adapters \
--prompt "Write a SQL query to find the top 5 customers by total spend"
Compare the output with and without the adapter to see what changed. Run without --adapter-path to see the base model’s response.
For quantitative evaluation, use the test set:
mlx_lm.lora \
--model meta-llama/Llama-3.1-8B-Instruct \
--adapter-path ./adapters \
--data ./my-dataset \
--test
This reports test loss. Lower is better, but the real test is whether the model’s output matches what you want. Read the actual generations, not just the loss number.
Deploying with Ollama
Your fine-tuned adapters are only useful inside mlx-lm until you export them. To use the model with Ollama, you need to fuse the adapters into the base model and convert to GGUF.
Step 1: Fuse adapters
mlx_lm.fuse \
--model meta-llama/Llama-3.1-8B-Instruct \
--adapter-path ./adapters \
--save-path ./fused-model \
--de-quantize
The --de-quantize flag is important if you trained with QLoRA. It converts the fused model back to fp16, which is needed for clean GGUF conversion.
Step 2: Export to GGUF
mlx_lm.fuse \
--model meta-llama/Llama-3.1-8B-Instruct \
--adapter-path ./adapters \
--export-gguf
This produces a GGUF file (fp16) in the output directory. Note: GGUF export currently works with Llama, Mistral, and Mixtral architectures. For other architectures, use llama.cpp’s convert_hf_to_gguf.py on the fused safetensors model.
Step 3: Create an Ollama model
Write a Modelfile:
FROM ./fused-model/ggml-model-f16.gguf
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{{ .Response }}<|eot_id|>"""
PARAMETER stop "<|eot_id|>"
PARAMETER temperature 0.7
Then create and run:
ollama create my-sql-model -f Modelfile
ollama run my-sql-model
Your fine-tuned model is now available through Ollama’s API, accessible by any tool that speaks the Ollama or OpenAI-compatible protocol.
Alternative: quantize the GGUF
The fp16 GGUF is large (~16GB for an 8B model). For daily use, quantize it:
# Using llama.cpp's quantize tool
./quantize fused-model/ggml-model-f16.gguf fused-model/model-q4_k_m.gguf q4_k_m
Then point your Modelfile at the quantized version.
Mac vs NVIDIA: honest comparison
| Mac (MLX) | NVIDIA (CUDA) | |
|---|---|---|
| Max model for LoRA (single device) | 70B on 128GB Mac | 70B only with multi-GPU ($5,000+) |
| Training speed (8B LoRA) | ~250-400 tok/s | ~800-1,500 tok/s |
| Memory efficiency | Unified – all RAM available | VRAM-limited (24GB max consumer) |
| Ecosystem | mlx-lm only | Unsloth, Axolotl, HuggingFace, dozens of tools |
| Multi-GPU training | Not supported | Standard with FSDP, DeepSpeed |
| Setup complexity | pip install, done | CUDA, cuDNN, driver versions, venv management |
| Cost for 8B fine-tuning | $0 (use existing Mac) | $700-900 (used RTX 3090) + PC |
Where Mac is genuinely better
- Models that don’t fit in 24GB VRAM. Fine-tuning a 14B model with full LoRA needs ~24GB. That’s the absolute ceiling for an RTX 3090 and it’ll likely OOM with any real batch size. A 32GB Mac handles it without drama.
- No driver/CUDA hassle. Install mlx-lm, run the command. No CUDA toolkit version conflicts, no cuDNN mismatches, no “which PyTorch build do I need” headaches.
- Silent training. A multi-hour training run on Mac produces zero fan noise. An RTX 3090 under training load sounds like a desk fan on high.
- You already own the hardware. If you have a 32GB+ Mac, fine-tuning is free to try. No GPU purchase, no separate PC.
Where Mac falls short
- Raw speed. An RTX 3090 trains 2-4x faster than an M2 Max on the same model. For production workflows where you’re iterating on datasets and hyperparameters, this adds up.
- Ecosystem. Unsloth (2-5x speedup on NVIDIA) doesn’t work on Mac. Axolotl, a popular fine-tuning framework, is CUDA-only. HuggingFace Trainer works on Mac via MPS but is slower and less tested than CUDA.
- Multi-GPU scaling. If you need to train 70B models fast, NVIDIA scales to 2, 4, 8 GPUs. MLX has no multi-device training support.
- Community. Most fine-tuning tutorials, guides, and debugging resources assume CUDA. When something goes wrong on MLX, you’re searching through GitHub issues.
The practical verdict
For most people reading this article, the answer is straightforward: you own a Mac with 32GB+, you want to fine-tune a 7B-14B model for a specific task, and MLX is the path of least resistance. Install one package, run one command, get a fine-tuned model. No hardware purchase needed.
The calculus changes if fine-tuning is a core part of your workflow and you’re iterating on datasets daily. The 2-4x speed advantage of NVIDIA hardware adds up across dozens of training runs, and the broader ecosystem (Unsloth, Axolotl) saves real time. But if the model you want to fine-tune doesn’t fit on any single consumer GPU, Mac is your only option short of cloud rentals.
Troubleshooting
Out-of-memory / system becomes unresponsive
Your model is too large for available memory. Fixes in order:
- Switch to QLoRA (4-bit base model)
--batch-size 1--num-layers 4--grad-checkpoint- Close all other apps
- Use a smaller model
“Model type not supported”
MLX LoRA supports: Llama, Mistral, Qwen2, Phi, Gemma, Mixtral, OLMo, MiniCPM, InternLM2. If your model isn’t one of these architectures, it won’t work. Check the model card on HuggingFace for the architecture.
Training loss doesn’t decrease
- Check your data format. Each line must be valid JSON with the correct keys.
- Make sure examples are consistent. Contradictory examples confuse training.
- Try increasing
--learning-rateslightly (e.g., 3e-5 instead of 1e-5). - If loss is extremely high from the start (>10), the model may not match your data format. Try a different chat template.
“You must use HuggingFace format models”
GGUF files don’t work with mlx-lm training. You need the original safetensors weights from HuggingFace. Download with:
mlx_lm.convert --hf-path <model-name>
Or let mlx_lm.lora download automatically by passing the HuggingFace model ID.
GGUF export fails for non-Llama models
The built-in --export-gguf flag in mlx_lm.fuse only supports Llama, Mistral, and Mixtral. For Qwen, Phi, or Gemma:
- Fuse the model:
mlx_lm.fuse --model <model> --save-path ./fused - Convert using llama.cpp:
python convert_hf_to_gguf.py ./fused --outtype f16 - Quantize:
./quantize fused.gguf fused-q4.gguf q4_k_m
Training is extremely slow
- Check Activity Monitor for memory pressure. Yellow/red means the system is swapping, which destroys training speed.
- Make sure examples are under 3,500 tokens each. Very long sequences slow training and eat memory.
--grad-checkpointsaves memory but adds ~30% training time. Only use it if you need to.
Bottom line
MLX fine-tuning on Mac is the simplest path to a custom model if you already own the hardware. The entire workflow is four commands: install, train, fuse, deploy. No GPU purchase, no CUDA setup, no driver conflicts.
The practical setup: a 32GB Mac, 200-500 clean examples in JSONL, QLoRA on an 8B model, 300-500 iterations. You’ll have a fine-tuned model running in Ollama within a couple of hours.
It’s not as fast as NVIDIA. The ecosystem isn’t as mature. But for a first fine-tune, or for models that don’t fit in GPU VRAM, it’s the easiest way to get there.
Get notified when we publish new guides.
Subscribe — free, no spam