Vision Models Locally: Image-Understanding AI on Your GPU

📚 More on this topic: VRAM Requirements · Ollama vs LM Studio · Best Models Under 3B · Quantization Explained

You can ask a local model “what’s in this image?” now. Describe a photo, read a chart, extract text from a screenshot, convert a handwritten note to text — all running on your GPU with nothing sent to the cloud.

Vision language models (VLMs) have gotten dramatically better since LLaVA proved the concept in 2023. Qwen2.5-VL 7B now scores 95.7 on document understanding benchmarks — matching what cloud-only models achieved a year ago. Gemma 3 runs on 4GB VRAM. And Ollama makes all of them a one-command install.

This guide covers every vision model worth running locally in 2026, with real VRAM numbers, benchmark comparisons, and setup instructions for Ollama, LM Studio, and llama.cpp.

What Vision Models Do (and Don’t Do)

Vision language models take an image plus a text prompt and return a text response. They understand images — they don’t generate them.

What works well:

Describe what’s in a photo
Read text from screenshots and documents (OCR)
Interpret charts, graphs, and diagrams
Analyze UI screenshots and identify elements
Convert code screenshots to actual code
Answer questions about image content

What doesn’t work:

Generating images (that’s Stable Diffusion or Flux)
Real-time video analysis (frame-by-frame is possible but slow)
Editing or modifying images
Reliably counting large numbers of objects

If you want to generate images, see our Stable Diffusion guide or Flux guide. This article is about models that look at images and tell you things.

The Best Vision Models for Local Use (2026)

Model	Params	Ollama Size	Min VRAM (Q4)	MMMU	DocVQA	Best For
Qwen2.5-VL 7B	8.3B	6.0 GB	~8 GB	58.6	95.7	Best all-rounder
Gemma 3 27B	27B	17 GB	~14 GB	64.9	86.6	General understanding
Gemma 3 12B	12B	8.1 GB	~7 GB	59.6	87.1	Sweet spot for 8GB GPUs
Llama 3.2 Vision 11B	10.7B	7.8 GB	~8 GB	50.7	88.4	Meta ecosystem
Pixtral 12B	12.4B	~8 GB	~10 GB	52.5	90.7	Document extraction
Gemma 3 4B	4B	3.3 GB	~3 GB	48.8	75.8	Low VRAM setups
Moondream 2	2B	~2 GB	~2.5 GB	—	—	Edge devices, <4GB
Qwen2.5-VL 72B	72B	49 GB	~36 GB	70.2	96.4	Maximum quality

MMMU measures general visual understanding. DocVQA measures document reading accuracy. Higher is better for both.

Qwen2.5-VL 7B — The One to Beat

Alibaba’s Qwen2.5-VL 7B is the current king of local vision models. It outperforms Llama 3.2 Vision 11B on almost every benchmark while being smaller. The document understanding score (95.7 DocVQA) is absurd for a 7B model — it reads documents better than many 70B+ text-only models handle their own tasks.

Benchmark	Qwen2.5-VL 7B	Llama 3.2 Vision 11B
MMMU	58.6	50.7
DocVQA	95.7	88.4
ChartQA	87.3	—
MathVista	68.2	51.5
OCRBench	86.4	—

Strengths: OCR, document reading, chart interpretation, math with visual content. Handles 125K context.

Weaknesses: Slightly slower than text-only models of the same size due to the vision encoder overhead.

Run it:

ollama run qwen2.5vl:7b

Also available at 3B (great for testing) and 72B (for multi-GPU setups).

Gemma 3 — Google’s Strong Entry

Gemma 3 brought native multimodal support across three sizes: 4B, 12B, and 27B. All three handle both text and images. The 27B model scores highest on MMMU (64.9) among mid-size models, making it the best for general visual reasoning if you have the VRAM.

The real story is Google’s QAT (Quantization-Aware Trained) versions. These are models trained to maintain quality at int4 precision — not just post-training quantized. The Gemma 3 27B QAT fits in 14GB VRAM with near-BF16 quality.

Gemma 3 Size	BF16 VRAM	Int4/QAT VRAM	MMMU
4B	~8 GB	~2.6 GB	48.8
12B	~24 GB	~6.6 GB	59.6
27B	~54 GB	~14.1 GB	64.9

Strengths: Best general visual understanding at the 27B tier. QAT quantization preserves quality. 128K context.

Weaknesses: DocVQA scores trail Qwen2.5-VL significantly (86.6 vs 95.7 at comparable sizes).

Run it:

ollama run gemma3:4b    # Fits on 4GB VRAM
ollama run gemma3:12b   # Fits on 8GB VRAM
ollama run gemma3:27b   # Needs 16GB+ VRAM

Llama 3.2 Vision — Meta’s Multimodal Play

Meta’s Llama 3.2 Vision comes in 11B and 90B. The 11B model fits on 8GB VRAM at Q4 and handles basic vision tasks competently. It’s a solid choice if you’re already in the Llama ecosystem, but benchmark-for-benchmark it loses to Qwen2.5-VL 7B despite being larger.

Strengths: 128K context. Good at natural image description. Strong community support.

Weaknesses: Trails Qwen2.5-VL on OCR, document understanding, and math reasoning. The 90B model needs 55GB+ which puts it out of reach for most consumer setups.

Run it:

ollama run llama3.2-vision        # 11B default
ollama run llama3.2-vision:90b    # Multi-GPU only

Pixtral 12B — The Document Specialist

Mistral’s Pixtral 12B has a 12B language model paired with a dedicated 400M vision encoder. Its DocVQA score (90.7) places it ahead of Llama 3.2 Vision 11B for document work.

The catch: Pixtral isn’t in Ollama’s official library. You need to use a community GGUF from HuggingFace or run it through vLLM or llama.cpp directly.

Strengths: Strong document understanding (90.7 DocVQA). Good chart reading (81.8 ChartQA).

Weaknesses: No official Ollama support. Limited community adoption compared to Qwen and Llama. Needs ~10GB VRAM at Q4.

Moondream 2 — The Tiny Vision Model

Moondream 2 squeezes vision understanding into 2B parameters. At 4-bit quantization it uses just 2.45GB VRAM and runs at 184 tokens/sec on an RTX 3090. It’s the only viable option for truly constrained hardware.

Moondream 3 (preview, late 2025) bumps to 9B total parameters but only 2B active via mixture-of-experts, adding object detection and UI understanding.

Strengths: Runs on almost anything. 184 tok/s at 4-bit on an RTX 3090. Good enough for captioning and basic VQA.

Weaknesses: Substantially less accurate than larger models. OCRBench score of 61.2 vs Qwen2.5-VL’s 86.4.

LLaVA — The Pioneer

LLaVA proved that open-source vision models could work in 2023. LLaVA 1.6 (LLaVA-NeXT) improved resolution handling and reasoning. It’s still in Ollama and still functions, but the field has moved on. Qwen2.5-VL, Gemma 3, and Llama 3.2 Vision all outperform it.

Use LLaVA if: You’re following older tutorials that reference it, or you need the 34B variant for maximum quality on legacy setups.

Otherwise: Start with Qwen2.5-VL 7B or Gemma 3 12B instead.

ollama run llava           # LLaVA 1.6 7B
ollama run llava:34b       # LLaVA 1.6 34B (needs 24GB+ VRAM)

VRAM Requirements

This is what you can actually run on your GPU.

Your VRAM	Best Vision Model	Ollama Command	Notes
4 GB	Moondream 2 (4-bit)	`ollama run moondream`	2.45 GB, basic captioning/VQA
4 GB	Gemma 3 4B (int4)	`ollama run gemma3:4b`	2.6 GB, better quality than Moondream
8 GB	Qwen2.5-VL 7B (Q4)	`ollama run qwen2.5vl:7b`	6 GB, best overall at this tier
8 GB	Gemma 3 12B (int4)	`ollama run gemma3:12b`	6.6 GB, tight fit but works
12 GB	Same as 8GB + headroom	—	More context, less swapping
16 GB	Gemma 3 27B (QAT int4)	`ollama run gemma3:27b`	14 GB, highest MMMU at this tier
24 GB	Qwen2.5-VL 32B (Q4)	`ollama run qwen2.5vl:32b`	21 GB, best all-rounder at 24GB

For a full breakdown of what text models fit on each GPU tier, see our VRAM requirements guide.

Key points:

Vision models use more VRAM than text-only models of the same parameter count because of the vision encoder
The VRAM numbers above assume the image is already loaded — complex or very high-resolution images can spike usage temporarily
Leave 1-2GB headroom for the OS and the vision encoder’s processing overhead

How to Run Vision Models

Ollama (Easiest)

Ollama added vision support in late 2024. You pass images via the CLI, the Python SDK, or the REST API.

CLI:

# Pull the model
ollama pull qwen2.5vl:7b

# Chat with text only
ollama run qwen2.5vl:7b

# Describe an image (pass image path after prompt)
ollama run qwen2.5vl:7b "What's in this image?" --image ./photo.jpg

# Multiple images work too
ollama run qwen2.5vl:7b "Compare these two screenshots" --image ./before.png --image ./after.png

Python SDK:

import ollama

response = ollama.chat(
    model='qwen2.5vl:7b',
    messages=[{
        'role': 'user',
        'content': 'What text is in this screenshot?',
        'images': ['screenshot.png']
    }]
)
print(response['message']['content'])

REST API:

# Images must be base64-encoded (no data:image prefix)
curl http://localhost:11434/api/chat -d '{
  "model": "qwen2.5vl:7b",
  "messages": [{
    "role": "user",
    "content": "Describe this image",
    "images": ["BASE64_STRING_HERE"]
  }]
}'

Gotchas:

Don’t include data:image/jpeg;base64, in API calls — Ollama expects raw base64 only
Unsupported image formats can hang the server instead of returning an error. Stick to JPEG, PNG, and WebP
Vision models load slower than text-only models on first inference due to the vision encoder initialization

For detailed Ollama setup, see our beginner tutorial.

LM Studio

LM Studio supports vision models with drag-and-drop image input. As of early 2026:

Supports Gemma 3, Qwen2.5-VL, LLaVA, and other GGUF vision models
Drag images directly into the chat window
PDFs and text files also supported
On Mac, the MLX engine provides native Apple Silicon acceleration for vision models

LM Studio is the better choice if you prefer a GUI. See our LM Studio tips guide for optimization.

llama.cpp (Advanced)

llama.cpp added full multimodal support in mid-2025 through the libmtmd library. Vision models need two files: the main model GGUF and a separate mmproj (multimodal projector) GGUF.

# Auto-download model + mmproj from HuggingFace
llama-server -hf ggml-org/gemma-3-4b-it-GGUF

# Or specify files manually
llama-server --model gemma-3-4b-Q4_K_M.gguf --mmproj mmproj-model-f16.gguf

# Interactive CLI
llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF
# Then type /image path-to-image.jpg in the chat

llama.cpp supports Gemma 3, Qwen2.5-VL, Pixtral 12B, and LLaVA. The -hf flag auto-downloads both the model and the mmproj file.

Use llama.cpp if you need maximum control over inference parameters or if you’re running Pixtral 12B (which isn’t in Ollama’s official library). For most people, Ollama is simpler. See our llama.cpp vs Ollama vs vLLM comparison for when to use each.

Use Cases That Actually Work

Screenshot and UI Analysis

Ask a vision model to describe a UI, identify buttons, or explain what an error dialog says. Qwen2.5-VL and Gemma 3 handle this well — Moondream 3 added specific UI understanding capability.

Prompt: "What error is shown in this screenshot? What should I click?"

Document OCR and Text Extraction

This is where Qwen2.5-VL dominates. Its 95.7 DocVQA score means it reads documents nearly as well as dedicated OCR tools. Feed it receipts, invoices, handwritten notes, or scanned pages.

Prompt: "Extract all text from this document. Preserve the formatting."

Chart and Diagram Reading

Feed a bar chart, line graph, or architecture diagram and ask the model to interpret it. Qwen2.5-VL scores 87.3 on ChartQA. It can identify trends, read values, and describe relationships.

Prompt: "What trend does this chart show? What are the highest and lowest values?"

Code Screenshot to Text

Got a screenshot of code from a tutorial, tweet, or presentation? Vision models convert it to actual copyable code. This works surprisingly well for clean screenshots.

Prompt: "Convert the code in this screenshot to text. Preserve indentation."

Photo Description and Accessibility

Describe photos for alt text, accessibility, or cataloging. All the models handle this — even Moondream at 2B produces usable descriptions.

Prompt: "Describe this photo in detail for someone who can't see it."

Benchmarks Compared

How the models stack up across different visual tasks:

Model	MMMU (General)	DocVQA (Documents)	ChartQA (Charts)	MathVista (Math)
Qwen2.5-VL 72B	70.2	96.4	—	74.8
Gemma 3 27B	64.9	86.6	78.0	—
Llama 3.2 90B	60.3	90.1	85.5	—
Gemma 3 12B	59.6	87.1	75.7	—
Qwen2.5-VL 7B	58.6	95.7	87.3	68.2
Pixtral 12B	52.5	90.7	81.8	58.0
Llama 3.2 11B	50.7	88.4	—	51.5
Gemma 3 4B	48.8	75.8	68.8	—

What stands out:

Qwen2.5-VL 7B’s document score (95.7) is higher than Llama 3.2 90B (90.1). A 7B model reading documents better than a 90B model.
Gemma 3 27B leads on MMMU (general visual understanding) among consumer-runnable models
Pixtral 12B beats Llama 3.2 11B on every benchmark where they overlap
For OCR and documents specifically, Qwen2.5-VL wins at every size

The Bottom Line

Most people (8GB+ VRAM): Run Qwen2.5-VL 7B. It’s the best overall vision model you can run locally. One command: ollama run qwen2.5vl:7b

Tight VRAM (4-6GB): Gemma 3 4B at int4 (2.6GB) gives you real vision capability. Moondream 2 (2.45GB at 4-bit) if you need the absolute minimum.

Maximum quality (16GB+): Gemma 3 27B QAT at 14GB for best general understanding, or Qwen2.5-VL 32B at 21GB if you have 24GB VRAM.

Documents and OCR specifically: Qwen2.5-VL wins at every tier. Its document understanding is unmatched among open models.

The landscape has shifted fast. A year ago, LLaVA was the only game in town. Now you have four competitive model families, all running in Ollama with a single command, all fitting on consumer GPUs. If you’re running text models locally and haven’t tried vision yet, ollama run qwen2.5vl:7b followed by --image photo.jpg is all it takes.

What Vision Models Do (and Don’t Do)

The Best Vision Models for Local Use (2026)

Qwen2.5-VL 7B — The One to Beat

Gemma 3 — Google’s Strong Entry

Llama 3.2 Vision — Meta’s Multimodal Play

Pixtral 12B — The Document Specialist

Moondream 2 — The Tiny Vision Model

LLaVA — The Pioneer

VRAM Requirements

How to Run Vision Models

Ollama (Easiest)

LM Studio

llama.cpp (Advanced)

Use Cases That Actually Work

Screenshot and UI Analysis

Document OCR and Text Extraction

Chart and Diagram Reading

Code Screenshot to Text

Photo Description and Accessibility

Benchmarks Compared

The Bottom Line

Related Guides