Ollama on Mac: Setup and Optimization Guide (2026)
📚 More on this topic: Best Local LLMs for Mac 2026 · Running LLMs on Mac M-Series · Ollama Troubleshooting Guide · llama.cpp vs Ollama vs vLLM
Ollama is the fastest path from “I want to try local AI” to a model running on your Mac. One install, one command, and you’re talking to a model. No Python, no Docker, no CUDA drivers.
The generic Ollama docs work fine for getting started. What they skip is the Mac-specific stuff: how unified memory changes the rules, why your environment variables aren’t taking effect, which models fit your RAM, and how to confirm the GPU is actually being used.
Install Ollama and run your first model
Homebrew vs direct download
Both work. The differences are minor:
| Method | Command | Auto-updates | Menubar app | Service management |
|---|---|---|---|---|
| Homebrew | brew install ollama | brew upgrade | No | brew services start ollama |
| Direct download | ollama.com/download | Yes (automatic) | Yes | Built-in |
The direct download gives you a menubar app that auto-updates and auto-starts at login. Homebrew gives you CLI-only control and explicit version management.
Pick direct download if you want it running in the background without thinking about it. Pick Homebrew if you want to control exactly when it updates or if you manage everything through Homebrew already.
Either way, Metal GPU acceleration is compiled into the binary. No extra setup.
Pull your first model
After installing, open Terminal and pull a model matched to your RAM:
| Your RAM | Model | Pull command | Size on disk |
|---|---|---|---|
| 8 GB | Llama 3.2 3B | ollama pull llama3.2 | 2.0 GB |
| 16 GB | Qwen 3 8B | ollama pull qwen3:8b | 4.9 GB |
| 24 GB | Qwen 3 14B | ollama pull qwen3:14b | 9.0 GB |
| 32 GB | Qwen 3 32B | ollama pull qwen3:32b | 19 GB |
| 48 GB+ | Qwen 3 32B (Q6_K) | ollama pull qwen3:32b-q6_K | 24 GB |
| 64 GB+ | Llama 3.3 70B | ollama pull llama3.3:70b | 40 GB |
Then run it:
ollama run qwen3:8b
You should see output within a few seconds. If it takes 30+ seconds before the first token, something’s wrong (probably CPU fallback; see troubleshooting below).
Verify Metal GPU is working
Most people skip this. Don’t. Run:
ollama ps
Check the Processor column:
100% GPUmeans Metal is active. You’re good.100% CPUmeans it’s using system RAM only. Something is wrong.48%/52% CPU/GPUmeans the model is partially offloaded. Expected if the model barely fits.
If you see CPU when you expected GPU, jump to the troubleshooting section below.
For more detail, enable debug logging:
OLLAMA_DEBUG=1 ollama serve
Then pull up the logs. You’ll see messages about Metal device detection, memory allocation, and layer offloading.
How Ollama uses your Mac’s hardware
Unified memory changes the rules
On a PC, your GPU has dedicated VRAM (usually 8-24GB). Models that don’t fit in VRAM either won’t load or run at 2-3 tok/s on CPU fallback.
On Apple Silicon, there’s no separate GPU memory. Your entire RAM pool is shared between CPU and GPU. This means a 48GB Mac Mini can load a 32B model that would require a $700+ RTX 3090 on PC.
The downside: that same 48GB also runs macOS, your browser, and everything else. You don’t get the full amount for models.
The 66%/75% GPU memory rule
macOS caps how much unified memory the GPU can use:
- 36GB RAM or less: GPU gets about 66% (~21GB on a 32GB machine)
- More than 36GB: GPU gets about 75% (~48GB on a 64GB machine)
These are system-enforced limits. Anything beyond that spills to CPU processing, which is slower but still works. Ollama handles the split automatically.
What this means in practice:
| Your RAM | GPU allocation | Largest model (Q4) |
|---|---|---|
| 8 GB | ~5 GB | 3B comfortably, 7B tight |
| 16 GB | ~10 GB | 8B comfortably |
| 24 GB | ~16 GB | 14B comfortably |
| 32 GB | ~21 GB | 14B-20B range |
| 48 GB | ~36 GB | 32B comfortably |
| 64 GB | ~48 GB | 32B-70B range |
| 96 GB | ~72 GB | 70B comfortably |
| 128 GB | ~96 GB | 70B at high quant, or 2-3 models |
These numbers assume some headroom for macOS and apps. Close Chrome before loading large models. (Seriously. Browsers eat 4-8GB on their own.)
Memory bandwidth determines speed
An M3 Max generates tokens faster than an M4 Pro, even though the M4 Pro is a newer chip. Token generation speed is bottlenecked by memory bandwidth, not compute. This surprises people.
| Chip | Memory bandwidth | Rough 8B Q4 speed |
|---|---|---|
| M1 | 68 GB/s | ~15 tok/s |
| M1 Pro | 200 GB/s | ~25 tok/s |
| M2 | 100 GB/s | ~18 tok/s |
| M2 Pro | 200 GB/s | ~28 tok/s |
| M3 | 100 GB/s | ~20 tok/s |
| M3 Pro | 150 GB/s | ~25 tok/s |
| M3 Max | 400 GB/s | ~45 tok/s |
| M4 | 120 GB/s | ~22 tok/s |
| M4 Pro | 273 GB/s | ~35 tok/s |
| M4 Max | 546 GB/s | ~58 tok/s |
These are approximations for 8B Q4 models. Larger models will be slower. The pattern holds: bandwidth predicts speed better than chip generation.
What Activity Monitor should show
Open Activity Monitor while running a model and check:
- Memory tab: Memory Pressure should be green or yellow. Red means you’re swapping, and performance will tank.
- GPU tab (if visible): You should see Ollama or
ollama_llama_serverusing GPU. If it only appears in the CPU tab, Metal isn’t active. - Memory Used: Compare against your model size. A 4.9GB model should increase memory usage by roughly that amount.
If Memory Pressure hits red, you need a smaller model or lower quantization. There’s no way to force more memory; macOS manages the split.
Mac-specific optimization
Environment variables that matter
Ollama’s behavior is controlled through environment variables. The ones that matter most on Mac:
| Variable | Default | Recommended for Mac | What it does |
|---|---|---|---|
OLLAMA_KEEP_ALIVE | 5m | -1 (infinite) or 24h | How long a model stays loaded after last request |
OLLAMA_FLASH_ATTENTION | 0 | 1 | Reduces memory usage, slight speed boost |
OLLAMA_KV_CACHE_TYPE | f16 | q8_0 | Halves KV cache memory with minimal quality loss |
OLLAMA_NUM_PARALLEL | auto (1-4) | Leave default | Max parallel requests per model |
OLLAMA_MAX_LOADED_MODELS | 3 | 1-2 on 8-16GB | Max models in memory simultaneously |
OLLAMA_HOST | 127.0.0.1 | 0.0.0.0 if serving LAN | Network bind address |
OLLAMA_CONTEXT_LENGTH | 4096 | Depends on task | Token window size |
Setting environment variables on Mac (not .zshrc)
This trips up almost everyone. If you installed Ollama via the .dmg (direct download), it runs as a macOS app, not from your shell. Setting variables in .zshrc or .bash_profile does nothing because the app doesn’t read shell config files.
For the macOS app (direct download):
launchctl setenv OLLAMA_KEEP_ALIVE "-1"
launchctl setenv OLLAMA_FLASH_ATTENTION "1"
launchctl setenv OLLAMA_KV_CACHE_TYPE "q8_0"
Then restart the Ollama app (click the menubar icon, select “Quit Ollama”, reopen it).
For Homebrew (brew services):
Create or edit ~/.ollama/env (or set them before starting the service):
# In your shell config (.zshrc):
export OLLAMA_KEEP_ALIVE=-1
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
Then restart:
brew services restart ollama
To verify a variable is set:
# Check what the running Ollama process sees
launchctl getenv OLLAMA_KEEP_ALIVE
Keep models in memory
By default, Ollama unloads models after 5 minutes of inactivity. Every time you come back, there’s a 2-10 second reload delay (depending on model size and disk speed).
Set OLLAMA_KEEP_ALIVE=-1 to keep models loaded indefinitely. On a Mac you use primarily for AI work, this is worth the memory tradeoff. The model sits in RAM ready to respond instantly.
To unload manually when you need the memory back:
ollama stop qwen3:8b
Flash attention and KV cache quantization
These two settings reduce memory usage with minimal quality impact:
OLLAMA_FLASH_ATTENTION=1 optimizes the attention computation to use less memory. No quality downside. Enable it unconditionally.
OLLAMA_KV_CACHE_TYPE=q8_0 quantizes the KV cache, which stores context during generation. At default FP16, a 32K context window can eat 2-4GB of memory. Setting it to q8_0 halves that with negligible quality impact. Setting it to q4_0 cuts it to a quarter but has a noticeable quality reduction on long conversations.
Combined, these two settings can free 2-6GB of memory depending on your context length. That can be the difference between a model fitting in GPU allocation or spilling to CPU.
One caveat: there have been reports of slight performance regressions with KV cache quantization on Apple’s Metal backend specifically. Test with your model. If you notice slower generation after enabling q8_0, revert to f16.
Best models by Mac configuration
For detailed model recommendations with tok/s benchmarks, see our Best Local LLMs for Mac 2026 guide. Here’s the quick reference:
| Mac config | RAM | Best general model | Best coding model | Best reasoning model |
|---|---|---|---|---|
| MacBook Air M1/M2 | 8 GB | Llama 3.2 3B | Qwen 2.5 Coder 3B | Phi-4 Mini 3.8B |
| MacBook Pro M2/M3 Pro | 16 GB | Qwen 3 8B | Qwen 2.5 Coder 7B | DeepSeek-R1-Distill-8B |
| MacBook Pro M3/M4 Pro | 24 GB | Qwen 3 14B | Qwen 2.5 Coder 14B | DeepSeek-R1-Distill-14B |
| Mac Mini/Studio M4 Pro | 32 GB | Qwen 3 14B (Q6) | Qwen 2.5 Coder 14B (Q6) | DeepSeek-R1-Distill-14B (Q6) |
| Mac Mini/Studio M4 Pro | 48 GB | Qwen 3 32B | Qwen 2.5 Coder 32B | QwQ 32B |
| Mac Studio M4 Max | 64 GB | Qwen 3 32B (Q8) | Qwen 2.5 Coder 32B (Q8) | QwQ 32B (Q8) |
| Mac Studio M3/M4 Ultra | 96 GB | Llama 3.3 70B | Qwen 2.5 Coder 32B + second model | DeepSeek-R1-Distill-70B |
| Mac Studio Ultra | 128 GB | Llama 3.3 70B (Q6) | Multi-model setup | DeepSeek-R1-Distill-70B (Q6) |
The 48GB tier is the price/performance sweet spot on Mac. A Mac Mini M4 Pro with 48GB ($1,199) runs 32B models that need a $700+ GPU on PC, and does it with zero fan noise.
At 96-128GB, you’re running 70B models that require $3,000+ in multi-GPU setups on PC. No other consumer platform gives you that much memory for AI at this price.
Troubleshooting Mac-specific issues
For the full troubleshooting guide covering all platforms, see our Ollama Troubleshooting Guide.
“ollama ps” shows CPU instead of GPU
The most common Mac problem. Symptoms: slow generation (3-5 tok/s when you expected 20+), high CPU usage in Activity Monitor, no GPU activity.
First, is the model too large for GPU allocation?
ollama ps
If it shows a CPU/GPU split like 23%/77%, that’s normal. The model partially fits in GPU memory, the rest runs on CPU. If it shows 100% CPU for a model that should fit, keep going.
Second, are you running through Docker? Docker on Mac has no access to Metal GPU acceleration. This is an Apple/Docker limitation, not an Ollama bug. If Ollama is inside Docker, it will always use CPU. Run it natively instead. The performance difference is 5-6x.
Third, did a recent Ollama update break things? Version 0.12.9 had a known regression where CPU fallback logic accidentally triggered on Apple Silicon. Some users went from 53 tok/s to 7 tok/s overnight. Update to the latest version. If that doesn’t fix it, remove any custom num_gpu parameters from Modelfiles, as restrictive GPU layer settings can force CPU fallback.
Memory pressure warnings
If Activity Monitor shows red memory pressure while running a model:
- Close browsers and heavy apps. Chrome alone can use 4-8GB.
- Use a smaller quantization. Switch from Q6 to Q4:
ollama pull qwen3:14b-q4_K_Minstead of the default. - Reduce context length. Set
OLLAMA_CONTEXT_LENGTH=2048if you don’t need long conversations. - Use a smaller model. Going from 14B to 8B frees several GB immediately.
Yellow memory pressure is usually fine. The model might swap slightly, but generation speed stays reasonable. Red means active swapping and the model will feel unusable.
Performance drops after updates
Both macOS updates and Ollama updates can affect performance.
After a macOS update, Metal drivers sometimes change behavior. Restart your Mac (not just sleep/wake) after major updates. Check ollama ps to confirm GPU is still active.
After an Ollama update, check the Ollama releases page for known issues. If performance dropped, you can pin a specific version via Homebrew:
brew install ollama@0.15.4 # whatever version worked
Or download a specific .dmg from the GitHub releases page.
Where Mac excels and where it falls short
Mac loads what NVIDIA can’t
A 32B model needs roughly 18-20GB of memory at Q4. On PC, that requires an RTX 3090 ($700+ used) or RTX 4090 ($1,800+). On Mac, a $1,199 Mac Mini with 48GB handles it with room to spare.
A 70B model needs roughly 40GB at Q4. On PC, you need two GPUs and a system that supports multi-GPU inference. On Mac, a Mac Studio with 96GB ($3,000-4,000) runs it natively.
No other consumer platform lets you load models this large at these price points.
The tradeoff: token speed per dollar
For models that fit in 24GB, an RTX 3090 is faster. A 7B model runs at 80-100+ tok/s on an RTX 3090 vs 15-45 tok/s on Apple Silicon (depending on your chip). CUDA’s memory bandwidth advantage is real.
If your workload stays within 24GB of VRAM, a used RTX 3090 ($450-550) plus a cheap PC build delivers more tok/s per dollar than any Mac.
What only Mac can do at consumer prices
| Capability | Mac cost | PC equivalent cost |
|---|---|---|
| Run 32B models | $1,199 (Mac Mini 48GB) | $700-1,800 (3090/4090) |
| Run 70B models | $3,000-4,000 (Mac Studio 96GB) | $3,000+ (dual GPU + board) |
| Run 70B at high quant | $4,000-5,000 (Mac Studio 128GB) | Not practical on consumer PC |
| Silent operation | Built-in | Aftermarket cooling needed |
| Multi-model serving | 96GB+ covers 2-3 models | Each model needs its own GPU |
The 96-128GB tier is where Mac has no PC equivalent at a similar price. Running two 32B models simultaneously, or one 70B model at Q6 quality, simply requires more memory than any single consumer GPU offers.
For a deeper comparison of Mac vs PC tradeoffs, see our Mac M-Series guide.
Get notified when we publish new guides.
Subscribe — free, no spam