Vision Models Locally: Image-Understanding AI on Your GPU
๐ More on this topic: VRAM Requirements ยท Ollama vs LM Studio ยท Best Models Under 3B ยท Quantization Explained
You can ask a local model “what’s in this image?” now. Describe a photo, read a chart, extract text from a screenshot, convert a handwritten note to text โ all running on your GPU with nothing sent to the cloud.
Vision language models (VLMs) have gotten dramatically better since LLaVA proved the concept in 2023. Qwen2.5-VL 7B now scores 95.7 on document understanding benchmarks โ matching what cloud-only models achieved a year ago. Gemma 3 runs on 4GB VRAM. And Ollama makes all of them a one-command install.
This guide covers every vision model worth running locally in 2026, with real VRAM numbers, benchmark comparisons, and setup instructions for Ollama, LM Studio, and llama.cpp.
What Vision Models Do (and Don’t Do)
Vision language models take an image plus a text prompt and return a text response. They understand images โ they don’t generate them.
What works well:
- Describe what’s in a photo
- Read text from screenshots and documents (OCR)
- Interpret charts, graphs, and diagrams
- Analyze UI screenshots and identify elements
- Convert code screenshots to actual code
- Answer questions about image content
What doesn’t work:
- Generating images (that’s Stable Diffusion or Flux)
- Real-time video analysis (frame-by-frame is possible but slow)
- Editing or modifying images
- Reliably counting large numbers of objects
If you want to generate images, see our Stable Diffusion guide or Flux guide. This article is about models that look at images and tell you things.
The Best Vision Models for Local Use (2026)
| Model | Params | Ollama Size | Min VRAM (Q4) | MMMU | DocVQA | Best For |
|---|---|---|---|---|---|---|
| Qwen2.5-VL 7B | 8.3B | 6.0 GB | ~8 GB | 58.6 | 95.7 | Best all-rounder |
| Gemma 3 27B | 27B | 17 GB | ~14 GB | 64.9 | 86.6 | General understanding |
| Gemma 3 12B | 12B | 8.1 GB | ~7 GB | 59.6 | 87.1 | Sweet spot for 8GB GPUs |
| Llama 3.2 Vision 11B | 10.7B | 7.8 GB | ~8 GB | 50.7 | 88.4 | Meta ecosystem |
| Pixtral 12B | 12.4B | ~8 GB | ~10 GB | 52.5 | 90.7 | Document extraction |
| Gemma 3 4B | 4B | 3.3 GB | ~3 GB | 48.8 | 75.8 | Low VRAM setups |
| Moondream 2 | 2B | ~2 GB | ~2.5 GB | โ | โ | Edge devices, <4GB |
| Qwen2.5-VL 72B | 72B | 49 GB | ~36 GB | 70.2 | 96.4 | Maximum quality |
MMMU measures general visual understanding. DocVQA measures document reading accuracy. Higher is better for both.
Qwen2.5-VL 7B โ The One to Beat
Alibaba’s Qwen2.5-VL 7B is the current king of local vision models. It outperforms Llama 3.2 Vision 11B on almost every benchmark while being smaller. The document understanding score (95.7 DocVQA) is absurd for a 7B model โ it reads documents better than many 70B+ text-only models handle their own tasks.
| Benchmark | Qwen2.5-VL 7B | Llama 3.2 Vision 11B |
|---|---|---|
| MMMU | 58.6 | 50.7 |
| DocVQA | 95.7 | 88.4 |
| ChartQA | 87.3 | โ |
| MathVista | 68.2 | 51.5 |
| OCRBench | 86.4 | โ |
Strengths: OCR, document reading, chart interpretation, math with visual content. Handles 125K context.
Weaknesses: Slightly slower than text-only models of the same size due to the vision encoder overhead.
Run it:
ollama run qwen2.5vl:7b
Also available at 3B (great for testing) and 72B (for multi-GPU setups).
Gemma 3 โ Google’s Strong Entry
Gemma 3 brought native multimodal support across three sizes: 4B, 12B, and 27B. All three handle both text and images. The 27B model scores highest on MMMU (64.9) among mid-size models, making it the best for general visual reasoning if you have the VRAM.
The real story is Google’s QAT (Quantization-Aware Trained) versions. These are models trained to maintain quality at int4 precision โ not just post-training quantized. The Gemma 3 27B QAT fits in 14GB VRAM with near-BF16 quality.
| Gemma 3 Size | BF16 VRAM | Int4/QAT VRAM | MMMU |
|---|---|---|---|
| 4B | ~8 GB | ~2.6 GB | 48.8 |
| 12B | ~24 GB | ~6.6 GB | 59.6 |
| 27B | ~54 GB | ~14.1 GB | 64.9 |
Strengths: Best general visual understanding at the 27B tier. QAT quantization preserves quality. 128K context.
Weaknesses: DocVQA scores trail Qwen2.5-VL significantly (86.6 vs 95.7 at comparable sizes).
Run it:
ollama run gemma3:4b # Fits on 4GB VRAM
ollama run gemma3:12b # Fits on 8GB VRAM
ollama run gemma3:27b # Needs 16GB+ VRAM
Llama 3.2 Vision โ Meta’s Multimodal Play
Meta’s Llama 3.2 Vision comes in 11B and 90B. The 11B model fits on 8GB VRAM at Q4 and handles basic vision tasks competently. It’s a solid choice if you’re already in the Llama ecosystem, but benchmark-for-benchmark it loses to Qwen2.5-VL 7B despite being larger.
Strengths: 128K context. Good at natural image description. Strong community support.
Weaknesses: Trails Qwen2.5-VL on OCR, document understanding, and math reasoning. The 90B model needs 55GB+ which puts it out of reach for most consumer setups.
Run it:
ollama run llama3.2-vision # 11B default
ollama run llama3.2-vision:90b # Multi-GPU only
Pixtral 12B โ The Document Specialist
Mistral’s Pixtral 12B has a 12B language model paired with a dedicated 400M vision encoder. Its DocVQA score (90.7) places it ahead of Llama 3.2 Vision 11B for document work.
The catch: Pixtral isn’t in Ollama’s official library. You need to use a community GGUF from HuggingFace or run it through vLLM or llama.cpp directly.
Strengths: Strong document understanding (90.7 DocVQA). Good chart reading (81.8 ChartQA).
Weaknesses: No official Ollama support. Limited community adoption compared to Qwen and Llama. Needs ~10GB VRAM at Q4.
Moondream 2 โ The Tiny Vision Model
Moondream 2 squeezes vision understanding into 2B parameters. At 4-bit quantization it uses just 2.45GB VRAM and runs at 184 tokens/sec on an RTX 3090. It’s the only viable option for truly constrained hardware.
Moondream 3 (preview, late 2025) bumps to 9B total parameters but only 2B active via mixture-of-experts, adding object detection and UI understanding.
Strengths: Runs on almost anything. 184 tok/s at 4-bit on an RTX 3090. Good enough for captioning and basic VQA.
Weaknesses: Substantially less accurate than larger models. OCRBench score of 61.2 vs Qwen2.5-VL’s 86.4.
LLaVA โ The Pioneer
LLaVA proved that open-source vision models could work in 2023. LLaVA 1.6 (LLaVA-NeXT) improved resolution handling and reasoning. It’s still in Ollama and still functions, but the field has moved on. Qwen2.5-VL, Gemma 3, and Llama 3.2 Vision all outperform it.
Use LLaVA if: You’re following older tutorials that reference it, or you need the 34B variant for maximum quality on legacy setups.
Otherwise: Start with Qwen2.5-VL 7B or Gemma 3 12B instead.
ollama run llava # LLaVA 1.6 7B
ollama run llava:34b # LLaVA 1.6 34B (needs 24GB+ VRAM)
VRAM Requirements
This is what you can actually run on your GPU.
| Your VRAM | Best Vision Model | Ollama Command | Notes |
|---|---|---|---|
| 4 GB | Moondream 2 (4-bit) | ollama run moondream | 2.45 GB, basic captioning/VQA |
| 4 GB | Gemma 3 4B (int4) | ollama run gemma3:4b | 2.6 GB, better quality than Moondream |
| 8 GB | Qwen2.5-VL 7B (Q4) | ollama run qwen2.5vl:7b | 6 GB, best overall at this tier |
| 8 GB | Gemma 3 12B (int4) | ollama run gemma3:12b | 6.6 GB, tight fit but works |
| 12 GB | Same as 8GB + headroom | โ | More context, less swapping |
| 16 GB | Gemma 3 27B (QAT int4) | ollama run gemma3:27b | 14 GB, highest MMMU at this tier |
| 24 GB | Qwen2.5-VL 32B (Q4) | ollama run qwen2.5vl:32b | 21 GB, best all-rounder at 24GB |
For a full breakdown of what text models fit on each GPU tier, see our VRAM requirements guide.
Key points:
- Vision models use more VRAM than text-only models of the same parameter count because of the vision encoder
- The VRAM numbers above assume the image is already loaded โ complex or very high-resolution images can spike usage temporarily
- Leave 1-2GB headroom for the OS and the vision encoder’s processing overhead
How to Run Vision Models
Ollama (Easiest)
Ollama added vision support in late 2024. You pass images via the CLI, the Python SDK, or the REST API.
CLI:
# Pull the model
ollama pull qwen2.5vl:7b
# Chat with text only
ollama run qwen2.5vl:7b
# Describe an image (pass image path after prompt)
ollama run qwen2.5vl:7b "What's in this image?" --image ./photo.jpg
# Multiple images work too
ollama run qwen2.5vl:7b "Compare these two screenshots" --image ./before.png --image ./after.png
Python SDK:
import ollama
response = ollama.chat(
model='qwen2.5vl:7b',
messages=[{
'role': 'user',
'content': 'What text is in this screenshot?',
'images': ['screenshot.png']
}]
)
print(response['message']['content'])
REST API:
# Images must be base64-encoded (no data:image prefix)
curl http://localhost:11434/api/chat -d '{
"model": "qwen2.5vl:7b",
"messages": [{
"role": "user",
"content": "Describe this image",
"images": ["BASE64_STRING_HERE"]
}]
}'
Gotchas:
- Don’t include
data:image/jpeg;base64,in API calls โ Ollama expects raw base64 only - Unsupported image formats can hang the server instead of returning an error. Stick to JPEG, PNG, and WebP
- Vision models load slower than text-only models on first inference due to the vision encoder initialization
For detailed Ollama setup, see our beginner tutorial.
LM Studio
LM Studio supports vision models with drag-and-drop image input. As of early 2026:
- Supports Gemma 3, Qwen2.5-VL, LLaVA, and other GGUF vision models
- Drag images directly into the chat window
- PDFs and text files also supported
- On Mac, the MLX engine provides native Apple Silicon acceleration for vision models
LM Studio is the better choice if you prefer a GUI. See our LM Studio tips guide for optimization.
llama.cpp (Advanced)
llama.cpp added full multimodal support in mid-2025 through the libmtmd library. Vision models need two files: the main model GGUF and a separate mmproj (multimodal projector) GGUF.
# Auto-download model + mmproj from HuggingFace
llama-server -hf ggml-org/gemma-3-4b-it-GGUF
# Or specify files manually
llama-server --model gemma-3-4b-Q4_K_M.gguf --mmproj mmproj-model-f16.gguf
# Interactive CLI
llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF
# Then type /image path-to-image.jpg in the chat
llama.cpp supports Gemma 3, Qwen2.5-VL, Pixtral 12B, and LLaVA. The -hf flag auto-downloads both the model and the mmproj file.
Use llama.cpp if you need maximum control over inference parameters or if you’re running Pixtral 12B (which isn’t in Ollama’s official library). For most people, Ollama is simpler. See our llama.cpp vs Ollama vs vLLM comparison for when to use each.
Use Cases That Actually Work
Screenshot and UI Analysis
Ask a vision model to describe a UI, identify buttons, or explain what an error dialog says. Qwen2.5-VL and Gemma 3 handle this well โ Moondream 3 added specific UI understanding capability.
Prompt: "What error is shown in this screenshot? What should I click?"
Document OCR and Text Extraction
This is where Qwen2.5-VL dominates. Its 95.7 DocVQA score means it reads documents nearly as well as dedicated OCR tools. Feed it receipts, invoices, handwritten notes, or scanned pages.
Prompt: "Extract all text from this document. Preserve the formatting."
Chart and Diagram Reading
Feed a bar chart, line graph, or architecture diagram and ask the model to interpret it. Qwen2.5-VL scores 87.3 on ChartQA. It can identify trends, read values, and describe relationships.
Prompt: "What trend does this chart show? What are the highest and lowest values?"
Code Screenshot to Text
Got a screenshot of code from a tutorial, tweet, or presentation? Vision models convert it to actual copyable code. This works surprisingly well for clean screenshots.
Prompt: "Convert the code in this screenshot to text. Preserve indentation."
Photo Description and Accessibility
Describe photos for alt text, accessibility, or cataloging. All the models handle this โ even Moondream at 2B produces usable descriptions.
Prompt: "Describe this photo in detail for someone who can't see it."
Benchmarks Compared
How the models stack up across different visual tasks:
| Model | MMMU (General) | DocVQA (Documents) | ChartQA (Charts) | MathVista (Math) |
|---|---|---|---|---|
| Qwen2.5-VL 72B | 70.2 | 96.4 | โ | 74.8 |
| Gemma 3 27B | 64.9 | 86.6 | 78.0 | โ |
| Llama 3.2 90B | 60.3 | 90.1 | 85.5 | โ |
| Gemma 3 12B | 59.6 | 87.1 | 75.7 | โ |
| Qwen2.5-VL 7B | 58.6 | 95.7 | 87.3 | 68.2 |
| Pixtral 12B | 52.5 | 90.7 | 81.8 | 58.0 |
| Llama 3.2 11B | 50.7 | 88.4 | โ | 51.5 |
| Gemma 3 4B | 48.8 | 75.8 | 68.8 | โ |
What stands out:
- Qwen2.5-VL 7B’s document score (95.7) is higher than Llama 3.2 90B (90.1). A 7B model reading documents better than a 90B model.
- Gemma 3 27B leads on MMMU (general visual understanding) among consumer-runnable models
- Pixtral 12B beats Llama 3.2 11B on every benchmark where they overlap
- For OCR and documents specifically, Qwen2.5-VL wins at every size
The Bottom Line
Most people (8GB+ VRAM): Run Qwen2.5-VL 7B. It’s the best overall vision model you can run locally. One command: ollama run qwen2.5vl:7b
Tight VRAM (4-6GB): Gemma 3 4B at int4 (2.6GB) gives you real vision capability. Moondream 2 (2.45GB at 4-bit) if you need the absolute minimum.
Maximum quality (16GB+): Gemma 3 27B QAT at 14GB for best general understanding, or Qwen2.5-VL 32B at 21GB if you have 24GB VRAM.
Documents and OCR specifically: Qwen2.5-VL wins at every tier. Its document understanding is unmatched among open models.
The landscape has shifted fast. A year ago, LLaVA was the only game in town. Now you have four competitive model families, all running in Ollama with a single command, all fitting on consumer GPUs. If you’re running text models locally and haven’t tried vision yet, ollama run qwen2.5vl:7b followed by --image photo.jpg is all it takes.