๐Ÿ“š More on this topic: Ollama Troubleshooting Guide ยท Run Your First Local LLM ยท VRAM Requirements ยท Open WebUI Setup

Three tools dominate local LLM inference: llama.cpp, Ollama, and vLLM. They solve different problems, and picking the wrong one either wastes your hardware or makes your life harder than it needs to be.

Here’s the short version: Ollama is llama.cpp with training wheels. vLLM is for serving models to multiple users. llama.cpp is the engine under Ollama’s hood that you can also drive directly.

This guide covers when each one makes sense, with actual benchmarks instead of vibes.


What Each Tool Actually Is

llama.cpp

The original. Georgi Gerganov wrote it to run LLaMA on a MacBook, and it’s become the de facto standard for running quantized models on consumer hardware. Written in C++, it handles GGUF model files and runs on basically anything โ€” CPUs, NVIDIA GPUs, AMD GPUs, Apple Silicon, even phones.

Key insight: llama.cpp is an inference engine, not a user-facing application. You interact with it via command line or build applications on top of it.

Ollama

A friendly wrapper around llama.cpp. It handles model downloads, versioning, and provides a simple CLI and REST API. When you run ollama run llama3, it’s using llama.cpp underneath to actually do the inference.

Key insight: Ollama adds convenience, not performance. The inference speed is llama.cpp speed with a small overhead from the Go-based server layer.

vLLM

A production inference engine built for serving. Uses PagedAttention to efficiently handle multiple concurrent requests, tensor parallelism for multi-GPU setups, and continuous batching to maximize throughput. Written in Python, optimized for NVIDIA GPUs.

Key insight: vLLM is designed for “many users, one server” โ€” the opposite of the desktop use case Ollama targets.


Head-to-Head Comparison

Featurellama.cppOllamavLLM
Primary useEngine/libraryDesktop local AIProduction serving
Setup difficultyMediumEasyMedium-Hard
Model formatGGUFGGUF (via llama.cpp)HuggingFace, GPTQ, AWQ, GGUF
Single-user speedFastest~Same as llama.cppSlightly slower
Multi-user throughputPoorPoorExcellent
Multi-GPU supportLimitedLimitedExcellent
CPU inferenceYesYesLimited
Memory efficiencyGoodGoodBest (PagedAttention)
APIHTTP server optionalREST API built-inOpenAI-compatible
Model managementManualAutomaticManual

Performance Benchmarks

Real numbers from a Qwen 2.5 3B on RTX 4090:

Single User (You Alone)

Scenariollama.cppvLLMWinner
2K prompt + 256 gen90.0s94.4sllama.cpp (4.7% faster)
30K prompt + 256 gen231.6s245.8sllama.cpp (5.8% faster)

For single-user desktop use, llama.cpp (and therefore Ollama) is slightly faster. The difference is small โ€” under 6% โ€” but it’s real.

Multiple Users (Concurrent Requests)

Scenariollama.cppvLLMWinner
16 concurrent, 2K prompt265.5s215.3svLLM (23% faster)
16 concurrent, 24K prompt3640.7s3285.3svLLM (11% faster)

Once you have multiple users hitting the same model, vLLM pulls ahead significantly. At 16 concurrent requests, it’s 23% faster. At higher concurrency, the gap widens further.

The PagedAttention Advantage

vLLM’s secret weapon is PagedAttention, which eliminates 60-80% of the memory waste from KV cache fragmentation. In practice:

  • Standard implementation on 24GB VRAM: ~32 concurrent sequences
  • vLLM with PagedAttention: ~128 concurrent sequences

Same hardware, 4x the capacity. That’s why production deployments use vLLM.


Ollama: When to Use It

Use Ollama when:

  • You’re running models for yourself on your own machine
  • You want model management handled automatically
  • You’re building local apps that need an LLM API
  • You don’t want to think about quantization formats or compile flags
  • You’re using Open WebUI or similar frontends

Don’t use Ollama when:

  • You’re serving multiple users simultaneously
  • You need multi-GPU tensor parallelism
  • You need maximum control over inference parameters
  • You’re building a production API

Ollama’s Real Value

Ollama’s killer feature isn’t performance โ€” it’s convenience. Compare the workflows:

With Ollama:

ollama run qwen3:8b

With llama.cpp directly:

# Find and download GGUF file manually
# Figure out the right quantization
./llama-cli -m ./models/qwen3-8b-q4_k_m.gguf \
  -c 8192 -n 256 --temp 0.7 -p "Your prompt here"

Ollama handles model discovery, downloads, versioning, and provides a consistent interface. For desktop use, that convenience is worth the tiny overhead.

Ollama Limitations

  • No tensor parallelism: Multi-GPU support exists but doesn’t split models across GPUs efficiently
  • Limited quantization control: You get what’s in the Ollama library, can’t easily use custom quants
  • Server overhead: The Go layer adds some latency, noticeable at very high request rates
  • Batching: Handles concurrent requests poorly compared to vLLM

llama.cpp: When to Use It Directly

Use llama.cpp when:

  • You need CPU inference or heavy CPU offloading
  • You want maximum control over quantization and inference settings
  • You’re building something that needs to embed inference directly
  • You’re on unusual hardware (ARM, older GPUs, etc.)
  • You need features Ollama hasn’t exposed yet

Don’t use llama.cpp when:

  • You just want to chat with a model (use Ollama)
  • You need high-concurrency serving (use vLLM)
  • You’re doing multi-GPU inference (use vLLM or ExLlamaV2)

llama.cpp’s Unique Strengths

CPU offloading โ€” llama.cpp is the only tool that gracefully handles models too large for your VRAM by offloading layers to system RAM. Yes, it’s slow (~1 tok/s for huge models), but it works. vLLM can’t do this at all.

Hardware compatibility โ€” Runs on NVIDIA, AMD, Intel, Apple Silicon, and even pure CPU. If you have unusual hardware, llama.cpp probably supports it.

Quantization ecosystem โ€” The GGUF format and llama.cpp’s quantization tools are the standard. K-quants (Q4_K_M, Q5_K_M) and I-quants give you fine-grained control over the size/quality tradeoff:

QuantBits/WeightPerplexity ImpactUse Case
Q4_K_M~4.5 bpw+0.05 pplDefault recommendation
Q5_K_M~5.3 bpw+0.01 pplQuality sweet spot
Q3_K_M~3.7 bpw+0.66 pplVRAM constrained
Q6_K~6.0 bpw+0.004 pplNear-lossless
IQ2_XS~2.3 bpwHigherExtreme compression

Speculative decoding โ€” Use a small draft model to speed up generation. Reports of ~12 tok/s vs 8-9 tok/s baseline with good draft model matches.

Basic llama.cpp Usage

# Run inference
./llama-cli -m model.gguf -p "Your prompt" -n 256

# Start a server
./llama-server -m model.gguf -c 4096 --host 0.0.0.0 --port 8080

# With GPU layers (offload 35 layers to GPU)
./llama-cli -m model.gguf -ngl 35 -p "Your prompt"

# Enable flash attention
./llama-cli -m model.gguf -fa -p "Your prompt"

vLLM: When to Use It

Use vLLM when:

  • You’re serving a model to multiple users
  • You need an OpenAI-compatible API
  • You have multi-GPU setups and want tensor parallelism
  • Throughput and cost efficiency matter more than setup simplicity
  • You’re running a production inference service

Don’t use vLLM when:

  • You’re the only user (Ollama is simpler)
  • You need CPU inference or offloading
  • You’re on non-NVIDIA hardware (limited support)
  • You want the simplest possible setup

vLLM’s Production Numbers

The throughput improvements are dramatic:

  • 14-24x higher throughput vs standard HuggingFace Transformers
  • 3-10x improvement from continuous batching alone
  • Stripe reduced inference costs by 73% using vLLM

Concrete example: Serving Mixtral-8x7B on 2x A100 at 100 requests/second:

  • vLLM P50 latency: 180ms
  • Standard serving P50 latency: 650ms

At scale, vLLM isn’t optional โ€” it’s required to make the math work.

vLLM Setup

# Install
pip install vllm

# Start server
vllm serve meta-llama/Llama-3.1-8B-Instruct

# With tensor parallelism (multi-GPU)
vllm serve meta-llama/Llama-3.1-70B-Instruct --tensor-parallel-size 4

# With quantization
vllm serve model-name --quantization awq

The API is OpenAI-compatible, so existing code that calls OpenAI can point at your vLLM server instead:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

vLLM Hardware Requirements

vLLM is GPU-hungry and primarily targets NVIDIA:

Model SizeMinimum VRAMRecommended
7B16 GB24 GB
13B24 GB40 GB
30B+40 GB+80 GB+

For consumer GPUs, an RTX 3090 or 4090 can run 7-13B models in vLLM. Larger models need professional cards or multi-GPU setups.


Multi-GPU: Critical Guidance

Do NOT use llama.cpp or Ollama for multi-GPU inference.

This is the most common mistake. llama.cpp’s multi-GPU support exists but isn’t optimized for tensor parallelism. Benchmark data shows it performs poorly compared to purpose-built solutions.

For multi-GPU setups:

  • vLLM โ€” Best for FP16/BF16 models, excellent tensor parallelism
  • ExLlamaV2 โ€” Best for quantized models (EXL2 format), good multi-GPU support

If you have 2+ GPUs and want to run large models across them, skip Ollama entirely.


Other Tools Worth Knowing

ExLlamaV2

The speed king for quantized inference on NVIDIA GPUs. Uses EXL2 format (similar to GPTQ but better quality). Benchmark: 120-150 tok/s on RTX 4090 for 13B models vs 80-100 tok/s for llama.cpp.

Use when: You need maximum speed for quantized models on NVIDIA, especially multi-GPU.

Skip when: You need CPU offloading or non-NVIDIA hardware.

kobold.cpp

A fork of llama.cpp focused on creative writing and roleplay. Adds features like soft prompts, memory management, and UI tailored for story generation.

Use when: Fiction writing, roleplay, story continuation.

text-generation-inference (TGI)

HuggingFace’s production inference server. Similar to vLLM in goals, different implementation. Good HuggingFace integration.

Use when: You’re already deep in the HuggingFace ecosystem.


Decision Flowchart

Are you the only user?

  • Yes โ†’ Ollama (or llama.cpp if you need more control)
  • No โ†’ Continue

Do you have multiple GPUs you want to use together?

  • Yes โ†’ vLLM (or ExLlamaV2 for quantized models)
  • No โ†’ Continue

Are you serving >5 concurrent users?

  • Yes โ†’ vLLM
  • No โ†’ Ollama is probably fine

Do you need CPU inference or offloading?

  • Yes โ†’ llama.cpp (only real option)
  • No โ†’ See above

Are you building a production API?

  • Yes โ†’ vLLM
  • No โ†’ Ollama

Using Multiple Tools Together

You don’t have to pick just one. A common setup:

  1. Ollama for development โ€” Quick testing, trying new models, local experimentation
  2. vLLM for production โ€” Serving the final model to users

Or for power users:

  1. Ollama for daily use โ€” Chat, quick queries, Open WebUI
  2. llama.cpp directly โ€” When you need specific quantization or settings Ollama doesn’t expose

Ollama โ†’ llama.cpp Migration

If you’ve been using Ollama and want to try llama.cpp directly, your models are already downloaded. Find them at:

  • macOS: ~/.ollama/models/
  • Linux: ~/.ollama/models/ or /usr/share/ollama/.ollama/models/
  • Windows: C:\Users\<user>\.ollama\models\

The actual GGUF files are in blobs/ with hash names. You can use them directly with llama.cpp.


Bottom Line

The choice is simpler than it looks:

Ollama โ€” You’re running models locally for yourself. It just works. Start here.

llama.cpp โ€” You need control Ollama doesn’t give you, or you need CPU inference/offloading. Power user territory.

vLLM โ€” You’re serving models to other people, or you have multi-GPU setups. Production territory.

The mistake to avoid: using Ollama for production APIs when vLLM would handle 4x the load. Ollama is fantastic for what it’s designed for โ€” desktop local AI. It’s not designed for serving hundreds of concurrent users, and it shows in benchmarks.

For most readers of this site โ€” hobbyists running models on their own hardware โ€” Ollama is the right answer. You’ll know when you’ve outgrown it.

# Start here
ollama run qwen3:8b

# Graduate to this when you need to
vllm serve Qwen/Qwen3-8B-Instruct